Data Science Full Course - Learn Data Science Beginners Final day
We have the support. We have the confidenceto lift the leverage and the
conviction. So guys, that's itfor this session. That is how you
createAssociation rules using the API. Real gold tone which helps a
lotin the marketing business. It runs on the principleof Market Basket
analysis, which is exactly what bigcompanies like Walmart. You have
Reliance and Target to even Ikea does itand I hope you got to know what
exactly isAssociation rule mining what is lift confidence and support
and how tocreate Association rules. So guys reinforcement learning.
Dying is a partof machine learning where an agent is putin an
environment and he learns to behavein this environment by performing
certain actions. Okay, so it basically performsactions and it either
gets a rewards on the actions or it gets a punishmentand observing the
reward which it gets from those actionsreinforcement learning is all
about taking an appropriateaction in order to maximize the rewardin a
particular situation. So guys in supervised learningthe training data
comprises of the input and the expected output And so the model is
trainedwith the expected output itself, but when it comesto
reinforcement learning, there is noexpected output here. The
reinforcement agentdecides what actions to take in order to performa
given task in the absence of a training data set. It is bound to
learnfrom its experience itself. Alright. So reinforcement learningis
all about an agent who's put inan unknown environment and he's going to
usea hit and trial method in order to figure outthe environment and then
come up with an outcome. Okay. Now, let's look at it. Reinforcement
learningwithin an analogy.
So consider a scenariowhere in a baby is
learning how to walk the scenariocan go about in two ways. Now in the
first casethe baby starts walking and makes it to the candy here. The
candy is basicallythe reward it's going to get so since the candy isthe
end goal the baby is happy. It's positive. Okay, so the baby is happyand
it gets rewarded a set of candies now another wayin which this could go
is that the baby starts walking but Falls due to some hurdlein between
The baby gets hot and it doesn't get any candyand obviously the baby is
sad. So this is a negative reward. Okay, or you can saythis is a
setback. So just like how we humans learnfrom our mistakes by trial and
error reinforcementlearning is also similar. Okay, so we have an agent
which is basicallythe baby and a reward which is the candy over here.
Okay, and with many hurdlesin between the agent is supposed to find the
best possible pathto read through the reward. So guys. I hope you all
are clear withthe reinforcement learning now, let's look at At
thereinforcement learning process. So generally a reinforcementlearning
system has two main components, right? The first is an agent and the
second oneis an environment. Now in the previous case, we saw that the
agent was the baby and the environmentwas the living room where in the
baby was crawling. Okay. The environment is the setting that the agent
is actingon and the agent over here represents the reinforcementlearning
algorithm.
So guys the reinforcementlearning process starts when the
environmentsends a state to the And then the agentwill take some actions
based on the observations in turn the environmentwill send the next
state and the respective rewardback to the agent. The agent willupdate
its knowledge with the reward returned bythe environment and it uses
that to evaluateits previous action. So guys thisLoop keeps continuing
until the environment sendsa terminal state which means that the agent
hasaccomplished all his tasks and he finally gets the reward. Okay. This
is exactly what was depictedin this scenario. So the agent
keepsclimbing up ladders until he reaches his rewardto understand this
better. Let's suppose that our agent islearning to play Counter Strike.
Okay. So let's break it downnow initially the RL agent which is
basicallythe player player 1. Let's say it's a player one who is trying
to learnhow to play the game. Okay. He collects some statefrom the
environment. Okay. This could be the first dateof Counter-Strike now
based on the state the agentwill take some action. Okay, and this
actioncan be anything that causes a result. So if the Almost left or
right it's alsoconsidered as an action. Okay, so initially the actionis
going to be random because obviously the first timeyou pick up
Counter-Strike, you're not goingto be a master at it. So you're going to
trywith different actions and you just want to pick up arandom action
in the beginning. Now the environment is goingto give a new state. So
after clearing that the environmentis now going to give a new state to
the agent or to the player. So maybe he's across th one now. He's in
stage 2. So now the playerwill get a reward our one from the
environment. Because it cleared stage 1. So this reward can be anything.
It can be additional pointsor coins or anything like that. Okay. So
basically this Loopkeeps going on until the player is deador reaches the
destination. Okay, and it continuouslyoutputs a sequence of States
actions and rewards. So guys, this wasa small example to show you how
reinforcementlearning process works. So you startwith an initial State
and once a player clothesthat state he gets a reward after that the
environment will give another stageto the player. And after it clears
that stateit's going to get another award and it's going to keep
happening until the playerreaches his destination. All right, so guys,I
hope this is clear now, let's move on and look at the
reinforcementlearning definitions. So there are a few Conceptsthat you
should be aware of while studyingreinforcement learning. Let's look at
thosedefinitions over here. So first we have the agentnow an agent is
basically the reinforcement learningalgorithm that learns from trial and
error. Okay, so an agent takes actionslike For example a soldier in
Counter-Strike navigatingthrough the game. That's also an action. Okay,
if he moves left rightor if he shoots at somebody that's also an action.
Okay. So the agent is responsible for taking actionsin the environment.
Now the environment isthe whole Counter-Strike game. Okay. It's
basically the worldthrough which the agent moves
the environment
takesthe agents current state and action as input and it Returns the
agency rewardand its next state as output. Alright next we have
actionnow all the possible. Steps that an agentcan take are called
actions. So like I said, it can be moving right leftor shooting or any
of that. Alright, then we havestate now state is basically the current
conditionreturned by the environment. So whichever State you are in if
you are in state 1 orif you're in state to that representsyour current
condition. All right. Next we have reward a rewardis basically
an
instant return from the environmentto appraise Your Last Action. Okay,
so it can beanything like coins or it can be audition. Two points. So
basically a rewardis given to an agent after it clearsthe specific
stages. Next we have policy policiesbasically the strategy that the
agent uses to findout his next action based on his current state policy
isjust the strategy with which you approach the game. Then we have
value. Now while you isthe expected long-term return with discount so
value in action value can be a littlebit confusing for you right now,
but as we move further, you'll understandwhat I'm talking. Kima okay. So
value is basicallythe long-term return that you get with discount. Okay
discount. I'll explain inthe furthest lines. Then we have action value
now action valueis also known as Q value. Okay. It's very similar to
Value except that it takesan extra parameter, which is the current
action. So basically here you'll findout the Q value depending on the
particular actionthat you took. All right. So guys don't get
confusedwith value and action value. We look at examples in the further
slides and youwill understand this better. Okay. So guys make sure that
you'refamiliar with these terms because you'll be seeinga lot of these
terms in the further slides. All right. Now before we move any further,
I'd like to discussa few more Concepts. Okay. So first we will
discussthe reward maximization.
So if you haven't alreadyrealized it the
basic aim of the RL agent isto maximize the reward now, how does that
happen? Let's try to understandthis in depth. So the agent must
betrained in such a way that he takes the best action sothat the reward
is Because the end goalof reinforcement learning is to maximize your
rewardbased on a set of actions. So let me explain thiswith a small game
now in the figure you can seethere is a fox there's some meat and
there's a tiger so our agent is basicallythe fox and his end goal is to
eat the maximum amount of meat before being eatenby the tiger now since
the fox is a cleverfellow he eats the meat that is closer to himrather
than the meat which is closer to the tiger. Now this is because
thecloser he is to the tiger the higher our his chancesof getting
killed. So because of this the rewardswhich are near the tiger, even if
they arebigger meat chunks, they will be discounted. So this is
exactlywhat discounting means so our agent is not goingto eat the meat
chunks which are closer to the tigerbecause of the risk. All right now,
even though the meat chunksmight be larger. He does not want to takethe
chances of getting killed. Okay. This is called discounting. Okay. This
is where you discount because it improviseand you just eat the meat
which are closer to youinstead of taking risks and eating the meatwhich
are The to your opponent. All right. Now the discountingof reward Works
based on a value called gammawill be discussing gamma in our further
slides but in short the valueof gamma is between 0 and 1. Okay. So the
smaller the
gamma thelarger is the discount value. Okay. So if the gamma
value is lesser, it means that the agentis not going to explore and
he's not goingto try and eat the meat chunks which are closer to the
tiger. Okay, but if the gamma valueis closer to 1 it means that our
agent is actuallyWe're going to explore and it's going to dryand eat the
meat chunks which are closer to the tiger. All right, now, I'll be
explaining thisin depth in the further slides. So don't worry if you
haven't gota clear concept yet, but just understand that reward
maximization isa very important step when it comesto reinforcement
learning because the agent hasto collect maximum rewards by the end of
the game. All right. Now, let's lookat another concept which is called
explorationand exploitation. So exploration likethe name suggests is
about exploring and capturing. More information aboutan environment on
the other hand exploitation is about using the already known exploited
informationto heighten the rewards. So guys consider the foxand tiger
example that we discussed now here thefox eats only the meat chunks
which are close to him, but he does not eatthe meat chunks which are
closer to the tiger. Okay, even though theymight give him more Awards.
He does not eat them if the fox only focuseson the closest rewards, he
will never reachthe big chunks of meat. Okay, this is whatexploitation
is the about you just going to usethe currently known information and
you're goingto try and get rewards based on that information. But if the
fox decidesto explore a bit, it can find the bigger awardwhich is the
big chunks of meat. This is exactlywhat exploration is. So the agent is
not goingto stick to one corner instead. He's going to explorethe entire
environment and try and collect bigger rewards. All right, so guys, I
hope you all are clear withexploration and exploitation. Now, let's
lookat the markers decision process. So guys this is basicallya
mathematical approach for mapping a solution inreinforcement learning in
a way. The purpose of reinforcementlearning is to solve a Markov
decision process. Okay. So there are a few parameters that are used to
getto the solution. So the parameters includethe set of actions the set
of states the rewards the policy that you're taking to approachthe
problem and the value that you get.
Okay, so to sum it upthe agent must
take an action a to transitionfrom a start state. The end State s while
doing so the agent will receivea reward are for each action that he
takes. So guys a series of actions taken bythe agent Define the policy
or it defines the approachand the rewards that are collectedDefine the
value. So the main goal here isto maximize the rewards by choosing the
optimum policy. All right. Now, let's try to understandthis with the
help of the shortest path problem. I'm sure a lot of you mighthave gone
through this problem when you are in college. So guys lookat the graph
over here. So our aim here isto find the shortest path between a and
dwith minimum possible cost. So the value that you seeon each of these
edges basically denotes the cost. So if I want to go from a to cit's
going to cost me 15 points. Okay. So let's look athow this is done. Now
before we moveand look at the problem in this problem the set ofstates
are denoted by the nodes, which is ABCD and the action is to
Traversefrom one node to the other. So if I'm going from a Be that's an
actionsimilarly a to see that's an action. Okay, the reward isbasically
the cost which is representedby each Edge over here. All right. Now the
policy isbasically the path that I choose toreach the destination. So
let's say I choosea seed be okay that's one policy in orderto get to D
and choosing a CD which is a policy. Okay. It's basically howI'm
approaching the problem.
So guys here youcan start off at node a and you
can take baby steps to your destination nowinitially you're Clueless.
So you can just takethe next possible node, which is visible to you. So
guys if you're smart enough, you're going to choose ato see instead of
ABCD or ABD. All right. So now if you areat nodes see you want to
Traverse to note D. Youmust again choose a wise path or red you just
haveto calculate which path has the highest cost or which path will
giveyou the maximum rewards. So guys, this isa simple problem. We just
drank to calculatethe shortest path between a and d by traversingthrough
these nodes. So if I travels from a CD itgives me the maximum reward.
Okay, it gives me 65 which is more than any otherpolicy would give me
okay. So if I go from ABD, it would be 40 when youcompare this to a CD.
It gives me more reward. So obviously I'm goingto go with a CB. Okay, so
guys wasa simple problem in order to understand howMarkov decision
process works. All right, so guys,I want to ask you a question. What do
you think? I did hear didI perform exploration or did I perform
exploitation? Now the policy for the aboveexample is of exploitation
because we didn't explorethe other nodes. Okay. We just selected three
notesand we Traverse through them. So that's why thisis called
exploitation. We must always explorethe different notes so that we can
finda more optimal policy. But in this case, obviouslya CD has the
highest reward and we're going with a CD, but generally it'snot so
simple. There are a lot of nodes therehundreds of notes to Traverse and
they're like 50 60 policies. Okay, 50 60 different policies.
So you make
sure you explore. All the policies and then decideon an Optimum policy
which will give youa maximum reward. So guys before we performthe
Hands-On part. Let's try to understandthe math behind our demo. Okay. So
in our demo will be usingthe Q learning algorithm which is a type of
reinforcementlearning algorithm. Okay, it's simple, it just means that
if youtake the best possible actions to reach your goalor to get the
most rewards. All right, let's try tounderstand this with an example. So
guys, this is exactlywhat be running in In our demo, so make sure
youunderstand this properly. Okay. So our goal here iswe're going to
place an agent in any one of the rooms. Okay. So basically these
squaresyou see here our rooms. OK 0 is a room for is a room three isa
room one is a room and 2:05 is also a room. It's basically a wayoutside
the building. All right. So what we're going to do iswe're going to
place an agent in any one of these rooms and the goal is to reachoutside
the building.
Okay outside. The building isroom number five. Okay, so
these are These spacesare basically doors, which means that you can
gofrom zero to four. You can go from 4to 3 3 to 1 1 to 5 and similarly 3
to 2, but you can't gofrom 5 to 2 directly. All right, so thereare
certain set of rooms that don't getconnected directly. Okay. So like of
mentioned here eachroom is numbered from 0 to 4, and the outside of the
buildingis numbered as five and one thing to note hereis Room 1 and room
for directly leadto room number five. All right. So room number one and
fourwill directly lead out to room number five. So basically our goal
over hereis to get to room number five. Okay to set this roomas a goal
will associate a reward value to each door. Okay. Don't worry. I'll
explain what I'm saying. So if you re present these roomsin a graph this
is how the graph is going to look. Okay. So for example from true, you
can go to threeand then three two, one one two five which will lead us
to our goalthese arrows represent the link between the dose. No, this is
quiteunderstandable now. Our next step isto associate a reward value to
each of these doors. Okay, so the rooms that are directly connectedto
our end room, which is room number five willget a reward of hundred.
Okay. So basically our room numberone will have a reward five now. This
is obviously because it's directlyconnected to 5 similarly for will also
be associatedwith a reward of hundred because it's directlyconnected to
5. Okay. So if you go outfrom for it will lead to five now the other
know. Roads are not directlyconnected to 5. So you can't directlygo from
0 to 5. Okay. So for this will be assigninga reward of zero.
So
basically other doorsnot directly connected to the Target roomhave a
zero reward. Okay now because the doorsare to weigh the two arrows are
assigned to each room. Okay, you can see two arrowsassigned to each
room. So basically zero leads to fourand four leads back to 0 now. We
have assigned 0 0 over here because 0 does not directly lead to five but
onedirectly leads to Five and that's why you can seea hundred over here
similarly for directly leadsto our goal State and that's why we were
signeda hundred over here and obviously five two fiveis hundred as well.
So here all the directconnections to room number five are rewarded
hundredand all the indirect connections are awarded zero. So guys in
q-learning the endgoal is to reach the state with the highest reward so
that the agentarrives at the goal. Okay. So let me just explainthis
graph to you in detail now these These roomsover here labeled one, two,
three to five they representthe state an agent is in so if I stay to one
It means that the agent is in room number one similarlythe agents
movement from one room to the otherrepresents the action. Okay. So if I
say one two, three,it represents an action. All right. So basically the
stateis represented as node and the action is representedby these
arrows. Okay. So this is what this graph isabout these nodes represent
the rooms and these Arrowsrepresent the actions. Okay. Let's look at a
small example. Let's set the initialstate to 0. So my agent is placedin
room number two, and he has to travel all the wayto room number five. So
if I set the initial stageto to he can travel to State 3. Okay from
three hecan either go to one or you can go back to to or you can go to
forif he chooses to go to for it will directly takehim to room number 5,
okay,
which is our end goal and evenif he goes from room number 3 2 1
it will takehim to room number. High five, so this is how our algorithm
works is goingto drivers different rooms. In order to reachthe Gold
Room, which is room number 5. Now, let's tryand depict these rewards in
the form of a matrix. Okay, because we'll beusing this our Matrix or the
reward Matrix tocalculate the Q value or the Q Matrix. Okay. We'll see
what the Q value isin the next step. But for now, let's see how this
rewardMatrix is calculated. Now the - ones that you seein the table,
they represent the null values. Now these -1 basically means that
Wherever there isno link between nodes. It's represented as minus 1 so 0
2 0 is minus 1 0 to 1there is no link. Okay, there's no directlink from
0 to 1. So it's represented asminus 1 similarly 0 to 2 or 2. There is
no link. You can see there'sno line over here. So this is also minus 1,
but when it comes to 0 to 4, there is a connectionand
we have numbered 0
because the reward for a state which is not directly connectedto the
goal is zero, but if you look at this 1 comma 5which is is basically
traversing from Node 1 to node 5, youcan see the reward is hundred.
Okay, that's basically because one and fiveare directly connected and
five is our end goal. So any node which will directly connectedto our
goal state will get a reward of hundred. Okay. That's why I've put
hundredover here similarly. If you look at thefourth row over here. I've
assigned hundred over here. This is because from 4 to 5that is a direct
connection. There's a direct connection which gives thema hundred
reward. Okay, you can see from 4 to 5. There is a direct link. Okay, so
from room number for to room numberfive you can go directly. That's why
there'sa hundred reward over here. So guys, this ishow the reward Matrix
is made. Alright, I hope thisis clear to you all. Okay. Now that we
havethe reward Matrix. We need to create another Matrixcalled The Q
Matrix. OK here, you'll storeor the Q values that will calculate nowthis
Q Matrix basically represents the memory of what the agent has
learnedthrough experience. Okay. So once he traversesfrom one room to
the final room, whatever he's learned. It is stored in this Q Matrix.
Okay, in orderfor him to remember that the next time he travelsthis we
use this Matrix. Okay.
It's basically like a memory. So guys the rows of
the Q Matrixwill represent the current state of the agent The Columns
willrepresent the possible actions and to calculate the Q valueuse this
formula. All right, I'll show youwhat the Q Matrix looks like, but
first, let'sunderstand this formula. Now this Q value will calculating
because wewant to fill in the Q Matrix. Okay. So this is basically a
Matrixover here initially, it's all 0 but as the agent Traverse isfrom
different nodes to the destination node. This Matrix will get filled up.
Okay. So basically it will belike a memory to the agent. He'll know
that okay, when he traversed usinga particular path, he found out that
his value was maximum oras a reward was maximum of year. So next time
he'llchoose that path.
This is exactly whatthe Q Matrix is. Okay. Let's
go back now guys, don't worry aboutthis formula for now because we'll be
implementingthis formula in an example. In the next slide. Okay, so
don't worryabout this formula for now, but here just remember that this Q
basically representsthe Q Matrix the r represents the reward Matrix and
the gamma is the gamma valuewhich I'll talk about shortly and here you
just finding outthe maximum from the Q Matrix. So basically the gamma
parameterhas a range from 0 to 1 so you can have a value of0.1 0.3 0.5
0.8 and all of that. So if the gamma is closerto zero it means That the
agent will consider only the immediaterewards which means that the agent
willnot explore the surrounding. Basically, it won'texplore different
rooms. It will just choosea particular room and then we'll trysticking
to it. But if the value of gammais high meaning that if it's closer to
one the agentwill consider future Awards with greater weight. This means
that the agentwill explore all the possible approaches or all the
possible policiesin order to get to the end goal. So guys, this is what
Iwas talking about when I mention ation and exploration. All right. So
if the gamma value is closerto 1 it basically means that you're actually
exploringthe entire environment and then choosingan Optimum policy. But
if your gamma valueis closer to zero, it means that the agentwill only
stick to a certain set of policies and it will calculatethe maximum
reward based on those policies. Now next. We have the Q learning
algorithm that we're going to useto solve this problem. So guys now this
is goingto look very confusing to y'all. So let me just explainIn this
with an example.
Okay. We'll see what we're actuallygoing to run in our
demo. We will do the math behind it. And then I'll tell you whatthis Q
learning algorithm is. Okay, you'll understand itas I'm showing you the
example. So guys in the Q learningalgorithm the agent learns from his
experience. Okay, so each episode, which is basically when the agents
are traversingfrom an initial room to the end goal is equivalentto one
training session and in every training sessionthe agent will explore the
environment itwill Receive some reward until it reaches the goal
statewhich is five. So there's a purpose of training is to enhancethe
brain of our agent. Okay only if he knowsthe environment very well,
will
he knowwhich action to take and this is why we calculatethe Q Matrix
okay in Q Matrix, which is going to calculatethe value of traversing
from every state to the endstate from every initial room to the end
room. Okay, so when wecalculate all the values or how much rewardwe're
getting from each policy that we We knowthe optimum policy that will
give usthe maximum reward. Okay, that's whywe have the Q Matrix. This is
very important because the moreyou train the agent and the more Optimum
your outputwill be so basically here the agent will not
performexploitation instead. He'll explore around and go back and
forththrough the different rooms and find the fastestroute to the goal.
All right. Now, let's look at an example. Okay. Let's see howthe
algorithm works. Okay. Let's go backto the previous slide and Here it
says that the first step isto set the gamma parameter. Okay. So let's do
that. Now the first stepis to set the value of the learning parameter,
which is gamma and wehave randomly set it to zero point eight. Okay. The
next step is to initializethe Matrix Q 2 0 Okay. So we've set Matrix Q 2
0 over here and then wewill select the initial stage Okay, the third
step is selecta random initial State and here we've selected the initial
Stateas room number one
Okay. So after you initializethe matter Q as a
zero Matrix from room number one, you can either go to room numberthree
or number five. So if you lookat the reward Matrix can see that from
room number one, you can only go to room numberthree or room number
five. The other valuesare minus 1 here, which means that there isno link
from 1 to 0 1 2 1 1 2 2 and 1 to 4. So the only possible actionsfrom
room number one is to go to room number 3 and to goto room number five.
All right. Okay. So let's selectroom number five, okay. So from room
number one, you can go to 3 and 5 and wehave randomly selected five. You
can also selectthree but for example, let's select five over here. Now
from Rome five, you're going to calculatethe maximum Q value for the
next state basedon all possible actions. So from number five, the next
state can beroom number one four or five. So you're going to
calculatethe Q value for traversing 5 to 1 5 2 4 5 2 5and you're going
to find out which has the maximum Q valueand that's how you're going.
Compute the Q value.
So let's Implement our formula. Okay, this isthe
q-learning formula. So right now we're traversing from room numberone to
room number 5. Okay. This is our state. So here I've writtenQ 1 comma
5. Okay one representsour current state which is room number one. Okay.
Our initial state was roomnumber one and we are traversing to room
number five. Okay. It's shown in this figure roomnumber 5 now for this
we need to calculate the Q valuenext in our formula. It says the
rewardMatrix State and action. So the reward Matrix for 1 comma5 let's
look at 1 comma 5 1 comma 5 correspondsto a hundred. Okay, so I reward
overhere will be hundred so r 1 comma 5 is basicallyhundred then you're
going to add the gamma value. Now the gamma valuewill be initialized it
to zero point eight. So that's what wehave written over here. And we're
going to multiply itwith the maximum value that we're going to getfor
the next date based on all possible actions. Okay. So from 5,the next
state is 1 4 and 5. So if Travis from five to one that's what I've
writtenover here 5 to 4. You're going to calculate the Qvalue of Fire 2 4
& 5 to 5. Okay. That's what Imentioned over here. So Q 5 comma 1
5 comma 4 and 5 comma 5 arethe next possible actions that you can take
from State V.
So r 1 comma 5 is hundred. Okay, because fromthe reward
Matrix, you can see that 1 comma5 is hundred 0.8 is the value of gamma
after that. We will calculate Qof 5 comma 1 5 comma 4 and 5 comma 5
LikeI mentioned earlier that we're going to initializeMatrix Q as zero
Matrix So based setting the value of 0 because initially obviouslythe
agent doesn't have any memory of what is happening. Okay, so he
juststarting from scratch. That's why allthese values are 0 so Q of 5
comma 1 will obviouslybe 0 5 comma 4 would be 0 and 5 comma 5 will also
be zero and to find out the maximumbetween these it's obviously 0. So
when youcompute this equation, you will get hundred sothe Q value of 1
comma 5 is So if I agent goes from roomnumber one to room number five,
he's going to havea maximum reward or Q value of hundred. All right. Now
in the nextslide you can see that I've updated the valueof Q of 1 comma
5. Okay, it said 200. All right now similarly, let's look at another
example sothat you understand this better. So guys, this is exactly what
we're goingto do in our demo. It's only going to be coded. Okay. I'm
just explainingour code right now. I'm just telling youthe math behind
it. Alright now, let's lookat another example. Example OK this time.
We'll start with a randomlychosen initial State.
Let's say thatwe've
chosen State 3. Okay. So from room 3, you can either goto room number
one two, or four randomlywill select room number one and from room
number one, you're going to calculatethe maximum Q value for the next
state basedon all possible actions. So the possible actionsfrom one is
to go to 3 and to go to 5 now if you calculate the Q valueusing this
formula, so let me explain thisto you once again now, 3 comma 1
basically represents that we're in room numberthree and we are going to
room number one. Okay. So this represents our action? Okay. So we're
going from 3 to 1 which is our action and three is our current statenext
we will look at the reward of going from 3 to 1. Okay, if you go to the
rewardMatrix 3 comma 1 is 0 okay. Now this is because there's no direct
linkbetween three and five. Okay, so that's whythe reward here is zero.
So the value here will be 0 after that we havethe gamma value, which is
zero point. Eight and then we're goingto calculate the Q Max of 1 comma
3 and 1 comma5 out of these whichever has the maximum valuewe're going
to use that. Okay, so Q of 1 comma 3 is 0. All right 0 you can seehere 1
comma 3 is 0 and 1 comma 5 if you remember we just calculated1 comma 5
in the previous slide. Okay 1 comma 5 is hundred. So here I'm goingto
put a hundred. So the maximum here is hundred. So 0.8 in 200 will give
us c tso that's the Q value. Going to get if you Traversefrom three two
one. Okay. I hope that was clear. So now we have Traversfrom room number
three to room numberone with the reward of 80. Okay, but we
stillhaven't reached the end goal which is room number five. So for our
next episodethe state will be room. Number one. So guys, like I
said,we'll repeat this in a loop because room numberone is not our end
goal. Okay, our end goalis room number 5. So now we need to figure out
how to get from room numberone to room number 5.
So from room number
one, you can either either goto three or five. That's what I'vedrawn
over here. So if we select five we knowthat it's our end goal. Okay. So
from room number 5, then you have to calculatethe maximum Q value for
the next possible actions. So the next possible actionsfrom five is to
go to room number one room numberfour or room number five. So you're
going to calculatethe Q value of 5 to 1 5 2 4 & 5 2 5 and find
out which is the maximum Q value here and you're goingto use that value.
All right. So let's lookat the formula now now again, we're in room
numberone and Want to go to room number 5. Okay, so that's exactly what
I've written here Q 1 comma5 next is the reward Matrix. So reward of 1
comma5 which is hundred. All right, then we have addedthe gamma value
which is 0.8. And then we're goingto find the maximum Q value from 5 to 1
5 2 4 & 5 to 5. So this is whatwe're performing over here. So 5
comma 1 5 comma 4and 5 comma 5 are all 0 this is because we initially
set allthe values of the Q Matrix as 0 so you get Hundred over hereand
the Matrix Remains the Same because we alreadyhad calculated Q 1
comma 5
so the value of 1 comma5 is already fed to the agent. So when he comes
back here,he knows our okay. He's already donethis before now. He's
going to tryand Implement another method. Okay is going to tryand take
another route or another policy. So he's going to try to gofrom
different rooms and finally land upin room number 5, so guys, this is
exactlyhow our code runs. We're going to Traversethrough each and every
node because we want an Optimum ball. See, okay. An Optimum policyis
attained only when you Traversethrough all possible actions. Okay. So if
you go throughall possible actions that you can perform onlythen will
you understand which is the best action which will lead usto the reward.
I hope this is clear now, let's move onand look at our code. So guys,
this is our codeand this is executed in Python and I'm assuming that all
of you havea good background in Python. Okay, if you don't
understandpython very well. I'm going to leave a linkin the description.
You can check outthat video on Python and then maybe comeback to this
later. Okay, but I'll be explainingthe code to you anyway, but I'm not
going to spend a lotof time explaining each and every line of
codebecause I'm assuming that you know python. Okay. So let's look at
the first lineof code over here. So what we're going to do iswe're going
to import numpy. Okay numpy is basicallya python library for adding
support forlarge multi-dimensional arrays and matrices and it'sbasically
for computing mathematical functions. Okay so first Want to import that
after that we're goingto create the our Matrix. Okay. So this is the
our Matrix nextwe're going to create a q Matrix and it's a 6 into 6
Matrix because obviously we havesix states starting from 0 to 5. Okay,
and we are goingto initialize the value to zero. So basically the Q
Matrixis going to be initialized to zero over here. All right, after
that we're settingthe gamma parameter to 0.8. So guys you can playwith
this parameter and you know move itto 0.9 or movement logo to 0.8. Okay,
you can see seewhat happens then then we'll set an initial stage. Okay
initial stageis set as 1 after that. We're defining a functioncalled
available actions. Okay. So basically whatwe're doing here is since our
initial state is one. We're going to checkour row number one. Okay, this
isour own number one. Okay. This is wrong number zero. This is zero
numberone and so on. So we're going to check
the rownumber one and we're
going to find the values which are greaterthan or equal to 0 because
these values basically The nodes thatwe can travel to now if you select
minus 1 you can Traverse 2-1. Okay, I explainedthis earlier the - one
represents all the nodesthat we can travel to but we can travel to these
nodes. Okay. So basically over herea checking all the values which are
equal to 0 or greater than 0 thesewill be our available actions. So if
our initial state is onewe can travel to other states whose value is
equal to 0 or greater than 0 and this is storedin this variable called.
All available act right now. This will basically getthe available
actions in the current state. Okay. So we're just storingthe possible
actions in this availableact variable over here. So basically over here
since our initial state isone we're going to find out the next possible
Stateswe can go to okay that is storedin the available act variable. Now
next is this functionchooses at random
which action to be
performedwithin the range. So if you remember over here, so guys
initially weare in stage number. Okay are available actions is to go to
stage number3 or stage number five. Sorry room number3 or room number 5.
Okay. Now randomly, we needto choose one room. So for that usingthis
line of code, okay. So here we are randomly goingto choose one of the
actions from the available actthis available act. Like I said earlier
storesall our possible actions. Okay from the initial State. Okay. So
once it chooses an actionis going to store it in next action, so guys
this action will Present the next available action totake now next is
our Q Matrix. Remember this formulathat we used. So guys this
formulathat we use is what we are going to calculatein the next few
lines of code. So in this block of code, which is executingand Computing
the value of Q. Okay, this is our formulafor computing the value of Q
current state Karma action. Our current state Karma actiongamma into the
maximum value. So here basically we're going to calculatethe maximum
index meaning
that To be going to check which of the possibleactions
will give us the maximum Q value read if you remember in our explanation
over herethis value over here Max Q or five comma 1 5 comma 4and 5
comma 5 we had to choose a maximum Q value that we get from these three.
So basically that's exactly what we're doingin this line of code, the
calculating the indexwhich gives us the maximum value after we finish
Computingthe value of Q will just have to update our Matrix. After that,
we'll beupdating the Q value and will be choosinga new initial State.
Okay. So this is the update functionthat is defined over here. Okay. So
I've just calledthe function over here. So guys this whole set of
codewill just calculate the Q value. Okay. This is exactly
what we didin
our examples after that. We have the training phase. So guys remember
the moreyou train an algorithm the better it's going to learn. Okay so
over hereI have provided around 10,000 titrations. Okay. So my range
is10 thousand iterations meaning that my age It will take10,000 possible
scenarios and in go to 10,000 titrationsto find out the best policy. So
you're exactly what I'm doing is I'm choosingthe current state randomly
after that. I'm choosing the availableaction from the current state. So
either I can go to stage3 or straight five then I'm calculating the
next action and then I'm finallyupdating the value in the Q Matrix and
next. We just normalize the Q Matrix. So sometimes in our Q Matrixthe
value might exceed. Okay, let's say it. Heated to 500 600 sothat time
you want to normalize The Matrix. Okay, we want to bringit down a little
bit. Okay, because larger numberswe won't be able to understand and
computation would bevery hard on larger numbers. That's why weperform
normalization. You're taking your calculatedvalue and you're dividing it
with the maximum Q value in 200. All right, so youare normalizing it
over here. So guys, this isthe testing phase. Okay here you will just
randomlyset a current state and you want given any other data because
you've alreadytrained our model. Okay, you're To givea Garden State then
you're going to tell your agentthat listen you're in room. Number one.
Now. You need to goto room number five. Okay, so he has to figure out
how to go to room number 5because we have trained him now. All right. So
here we have setthe current state to one and we need to make surethat
it's not equal to 5 because 5 is the end goal. So guys this is the same
Loopthat we executed earlier. So we're going to dothe same I trations
again now if I run this entire code,let's look at the result. So our
current statehere we've chosen as one. Okay and And if we goback to our
Matrix, you can see that there isa direct link from 1 to 5, which means
that the route that the agent shouldtake is one to five. Okay directly.
You should go from 1 to 5 because it will getthe maximum reward like
that.
Okay. Let's see if that's happening. So if I run this it should
giveme a direct path from 1 to 5. Okay, that's exactlywhat happened. So
this is the selected pathso directly from one to five it went and it
calculatedthe entire Q Matrix. Works for me. So guys this is exactlyhow
it works. Now. Let's try to setthe initial stage as that's a to so if I
set the initial stage asto and if I try to run the code, let's see the
path that it gives sothe selected path is 2 3 4 5 now chose this path
because it's givingus the maximum reward from this path. Okay. This is
the Q Matrixthat are calculated and this is the selected path. All
right, so guys with this wecome to the end of this demo. So basically
what we didwas we just placed an agent in a room random room and we ask
it to Traverse through and reachto the end room, which is room number
five. So basically we trainedour agent and we made sure that it went
through allthe possible paths. to calculate the bestpath the for a robot
and environment is a placewhere it has been put to use. Now. Remember
this reward isitself the agent for example an automobile Factorywhere a
robot is used to move materials from one place to another nowthe task we
discussed just now have a property in common. Now, these tasks
involveand environment and expect the agent to learnfrom the
environment. Now, this is where traditionalmachine learning phase and
hence the needfor reinforcement learning now, it is good to have
Establishoverview of the problem that is to be solvedusing the Q
learning or the reinforcement learning. So it helps to definethe main
components of a reinforcementlearning solution.
That is the agent
environmentaction rewards and States. So let's suppose we are to build a
few autonomous robots foran automobile building Factory. Now, these
robots will helpthe factory personal by conveying themthe necessary
parts that they would needin order to pull the car. Now these
differentparts are located at Nine different positions within the
factory warehousethe car part include the chassis Wheels dashboard the
engine and so on andthe factory workers have prioritized the location
that contains the body or the chassis to bethe topmost but they have
provided the prioritiesfor other locations as well, which will look into
the moment. Now these locations within the factory looksomewhat like
this. So as you can see here,we have L1 L2 L3 all of these stations. Now
one thing youmight notice here that there are little obstacleprison in
between the locations. So L6 is the toppriority location that contains
the chassisfor preparing the car bodies. Now the task isto enable the
robots so that they can findthe shortest route from any given location
toanother location on their own. Now the agents in this case arethe
robots the environment is the automobile factorywarehouse the let's talk
about the state's the states. Are the location in whicha particular
robot is present in the particularinstance of time which will denote it
statesthe machines understand numbers rather than let us so let's mapthe
location codes to number. So as you can see here, we have map location
l1 to this t 0 L 2 and 1 and so on we have L8 asstate 7 + L line at
state. So next what we're going to talkabout are the actions. So in our
example, the action will be the directlocation that a robot can. Call
from a particular location, right consider a robot that is a tel to
locationand the Direct locations to which it can moveour L5 L1 and L3.
Now the figure here may comein handy to visualize this now as you might
have alreadyguessed the set of actions here is nothing but the set of
all possible states of the robot for each locationthe set of actions
that a robot can takewill be different. For example, the setof actions
will change if the robot is. An L1 rather than L2. So if the robot is in
L1,
it can only go to L4 and L 2 directly now that we are done with the
statesand the actions. Let's talk about the rewards. So the states
arebasically zero one two, three four and theactions are also 0 1 2 3 4
up till 8:00. Now, the rewards nowwill be given to a robot. If a
location which is the stateis directly reachable from a particular
location. So let's take an examplesuppose l Lane is directly reachable
from L8. Right? So if a robot goes from LAto align and vice versa, it
will be rewarded by one and if a location isnot directly reachable from a
particular equation. We do not give any rewarda reward of 0 now the
reward is just a number and nothing else it enablesthe robots to make
sense of the movements helping them in deciding what locationsare
directly reachable and what are not nowwith this Q. We can construct a
reward tablewhich contains all the required. Use mapping betweenall
possible States. So as you can see herein the table the positions which
are marked greenhave a positive reward. And as you can see here, we have
all the possible rewardsthat a robot can get by moving in between the
different states. Now comes aninteresting decision. Now remember that
the factoryadministrator prioritized L6 to be the topmost. So how do we
incorporate thisfact in the above table now, this is done by
associatingthe topmost priority location with a very high reward. The
usual ones so let's put 999 in the cell L 6 commaand six now the table
of rewards with a higher reward for the topmost locationlooks something
like this. We have not formally definedall the vital components for the
solution. We are aiming forthe problem discussed now, we will shift
gearsa bit and study some of the fundamental concepts that Prevail in
the worldof reinforcement learning and q-learning the firstof all we'll
start with the Bellman equation nowconsider the following Square.
Rooms,
which is analogous to the actual environmentfrom our original problem.
But without the barriers nowsuppose a robot needs to go to the room
marked in the green from its current position ausing the specified
Direction. Now, how can we enable the robotto do this programmatically
one idea would be introducedsome kind of a footprint which the robot
will be ableto follow now here a constant value is specifiedin each of
the rooms, which will comealong the robots way if it follows the
directionsby Fight about now in this way if it starts at location a it
will be able to scanthrough this constant value and will move
accordingly but this will only work if the direction is prefix and the
robot always starts at the location a nowconsider the robot starts at
this location ratherthan its previous one. Now the robotnow sees
Footprints in two different directions. It is therefore unableto decide
which way to go in order to get the destinationwhich is the Green Room.
It happens. Primarily because the robotdoes not have a way to remember
the directions to proceed. So our job now is to enablethe robot with a
memory. Now, this is where the Bellmanequation comes into play. So as
you can see here, the main reasonof the Bellman equation is to enable
the rewardwith the memory. That's the thingwe're going to use. So the
equation goessomething like this V of s gives maximum a rof s comma a
plus gamma of vs - where s is a particular stateWhich is a room is the
Action Movingbetween the rooms as - is the state to whichthe robot goes
from s and gamma is the discount Factor now we'll getinto it in a moment
and obviously R of s commaa is a reward function which takes a state as
an actiona and outputs the reward now V of s is the value of beingin a
particular state which is the footprint now we consider allthe possible
actions and take the onethat yields the maximum value. Now there is one
constraint. However regardingthe value footprint that is the room marked
in the yellow justbelow the Green Room. It will always havethe value of
1 to denote that is one of the nearest roomadjacent to the green room.
Now. This is also to ensurethat a robot gets a reward when it goes from a
yellow roomto The Green Room.
Let's see how to makesense of the
equation which we have here. So let's assumea discount factor of 0.9 as
remember gamma isthe discount value or the discount Factor. So let's
Take a 0.9. Now for the room, which is Mark just below the oneor the
yellow room, which is the Aztec Markfor this room. What will be the V of
s that is the value of beingin a particular state? So for this V of
swould be something like maximum of a will take 0 which is the initialof
our s comma. Hey plus 0.9which is gamma into 1 that gives us zero
pointnine now here the robot will not get any reward for Owing to a
state marked in yellow hence the IR scomma a is 0 here but the robot
knows the valueof being in the yellow room. Hence V of s Dash isone
following this for the other states. We should get 0.9 then again, if we
put 0.9 in this equation, we get 0.81 then zero pointseven to nine and
then we again reached the starting point. So this is how the table looks
withsome value Footprints computer. From the Bellman equation now a
couple of thingsto notice here is that the max functionhas the robot to
always choose the state that gives it the maximum valueof being in that
state now the discount Factorgamma notifies the robot about how far it
isfrom the destination.
This is typically specified bythe developer of
the algorithm. That would be installedin the robot. Now, the other
states can alsobe given their respective values in a similar way. So as
you can see here the boxesInto the green one have one and if we move
away from one weget 0.9 0.8 1 0 1 7 to 9. And finally we reach0.66 now
the robot now can precede its way through the Green Room utilizingthese
value Footprints event if it's droppedat any arbitrary room in the given
location now, if a robot Lance up inthe highlighted Sky Blue Area, it
will still findtwo options to choose from but eventuallyeither of the
parties. It's will be good enoughfor the robot to take because Auto V
the value Footprintsare not only that out. Now one thing to note is that
the Bellman equation is oneof the key equations in the world of
reinforcementlearning and Q learning. So if we think realistically
oursurroundings do not always work in the way we expectthere is always a
bit of stochastic Cityinvolved in it. So this appliesto robot as well.
Sometimes it might so happen that the robotsMachinery got corrupted.
Sometimes the robot makes comeacross some hindrance on its way which may
not be knownto it beforehand. Right and sometimes evenif the robot
knows that it needs to takethe right turn it will not so how do we
introducethis to cast a city in our case now here comesthe Markov
decision process now consider the robot iscurrently in the Red Room and
it needs to goto the green room. Now. Let's now considerthe robot has a
slight chance of dysfunctioningand might take the left or the right or
the bottom. On instead updatingthe upper turn in order to get to The
Green Roomfrom where it is now, which is the Red Room. Now the question
is, how do we enable the robotto handle this when it is out in the given
environment right. Now, this is a situation where the decision making
regarding which turn isto be taken is partly random and partly another
controlof the robot now partly random because we are not sure when
exactly the robot minddysfunctional and partly under the control of the
robot because it is stillMaking a decision of taking a turn right on its
own and with the helpof the program embedded into it.
So a Markov
decision process is a discrete timestochastic Control process. It
provides a mathematicalframework for modeling decision-making in
situations where the outcomesare partly random and partly under
controlof the decision maker. Now we need to give this concepta
mathematical shape most likely an equation which then can be
takenfurther now you might be Price that we can do thiswith the help of
the Bellman equationwith a few minor tweaks. So if we have a look at the
original Bellman equationV of X is equal to maximum of our s comma a
plusgamma V of s stash what needs to be changedin the above equation so
that we can introducesome amount of Randomness here as long as we are
not sure when the robot might not takethe expected turn. We are then
also not surein which room it might end up in which is nothingbut the
room it. Moves from its currentroom at this point according to the
equation. We are not sure of the S stash which is the next stateor the
room, but we do know all the probableturns the reward might take now in
order to incorporate each of this probabilitiesinto the above equation.
We need to associatea probability with each of the turns toquantify the
robot if it has got any experts it ischance of taking this turn now if
we do, so We get PS is equal to maximumof our s comma a plus gamma into
summation of s - PS comma a comma s stash into Vof his stash now the PS
a-- and a stash is the probability of moving from room sto establish
with the action a and the submissionhere is the expectation of the
situation thatthe robot in curse, which is the randomness now, let's
take a lookat this example here. So when We associatethe probabilities
to each of these Stones. We essentially meanthat there is an 80% chance
that the robot willtake the upper turn. Now, if you put allthe required
values in our equation, we get V of s is equalto maximum of our of s
comma a + comma of 0.8 into Vof room up plus 0.1 into V of room down
0.03into a room of V of from left plus 0.03 into Vo Right now note that
the value Footprintswill not change due to the fact that we are
incorporatingstochastic Ali here. But this time wewill not calculate
those values Footprints instead. We will let the robotto figure it out.
Now up until this point. We have not consideredabout rewarding the robot
for its action of goinginto a particular room. We are only watering the
robot when it getsto the destination now, ideally there should be a
reward for each action the robottakes to help it better as Assess the
qualityof the actions, but there was neednot to be always be the same
but it is much betterthan having some amount of reward for the
actionsthan having no rewards at all. Right and this idea is known asthe
living penalty in reality. The reward systemcan be very complex and
particularly modelingsparse rewards is an active area of research in the
domainof reinforcement learning. So by now we havegot the equation
which we have a so what? To do is now transitionto Q learning. So this
equation givesus the value of going to a particular Statetaking the
stochastic city of the environment into account. Now, we have also
learnedvery briefly about the idea of living penalty which deals with
associatingeach move of the robot with a reward soQ learning processes
and idea of assessingthe quality of an action that is taken to moveto a
state rather than determining the possiblevalue of the state which is
being movedto So earlier we had 0.8 into V of s 1 0.03 into Vof S 2 0
point 1 into V of S 3 and so on now if you incorporate the ideaof
assessing the quality of the action for movingto a certain state so the
environmentwith the agent and the quality of the actionwill look
something like this. So instead of 0.8 V of s 1 will have q of s1 comma a
one will have q of S 2 comma 2 You of S3 not the robot now hasfour
different states to choose from and along with that. There are four
different actions also for the currentstate it is in so how do we
calculate Q of s comma a that is the cumulative qualityof the possible
actions the robot might take solet's break it down. Now from the
equation V of sequals maximum a RS comma a + comma summation s - PSAs
stash -into V of s - if we discard the maximumfunction we have is of a
plus gamma into summation p and v now essentiallyin the equation that
produces V of s we are consideringall possible actions and all possible
States from the current statethat the robot is in and then we are
takingthe maximum value caused by taking a certain action and the
equation producesa value footprint, which is for justone possible
action. In fact if we can thinkof it as the quality of the action soQ of
s comma a is equal to RS comma a plus gammaof summation p and v now
that we have got an equationto quantify the quality of a particular
action. We are going to makea little adjustment in the equation we can
now say that we of s is the maximumof all the possible values of Q of s
comma a right. So let's utilize this fact and replace V of sStash as a
function of Q so q s commaa becomes R of s comma a + comma of summation
PSAs - and maximum of the que es -a - so
the equation of V is now
turnedinto an equation of Q, which is the quality. But why would we do
that now? This is done toease our calculations because now we haveonly
one function Q, which is also the coreof the Programming language. We
have only one function Qto calculate an R of s comma a is a Quantified
metric which produces rewardof moving to a certain State. Now, the
qualities of the actions arecalled The Q values and from now on we will
referto the value Footprints as the Q valuesan important piece of the
puzzle isthe temporal difference. Now temporal differenceis the
component that will help the robotcalculate the Q values which respect
to the change. Changes in theenvironment over time. So consider our
robot iscurrently in the mark State and it wants to moveto the Upper
State. One thing to note that here is that the robot already knowsthe Q
value of making the action that is moving throughthe Upper State and we
know that the environmentis stochastic in nature and the reward that the
robot will getafter moving to the Upper State might be differentfrom an
earlier observation. So how do we capturethis change the real
difference? We calculate the new Q as My awith the same formula and
subtract the previous youknown qsa from it. So this will in turn give
usthe new QA now the equation that we just derived giftsthe temporal
difference in the Q values which further helpsto capture the random
changes in the environment which may impose nowthe new q s comma a is
updated as the following so Q T of s comma is equal to QT minus 1 s
commaa plus Alpha TD. ET of a comma s now here Alpha is the learningrate
which controls how quickly the robot adaptsto the random changes
imposed by the environment the qts commais the current state q value and
a QT minus 1 s comma isthe previously recorded Q value. So if we
replace the TDS comma awith its full form equation, we should get Q T of
scomma is equal to QT - 1 of s comma y plus Alpha into our of S commaa
plus gamma maximum of q s Dash a dash minus QT minus 1 s comma a now
that we have all the littlepieces of q line together. Let's move
forwardto its implementation part. Now, this is the final equationof
q-learning, right? So, let's see how we can implement thisand obtain the
best path for any robot to take nowto implement the algorithm. We need
tounderstand the warehouse. Ian and how that can be mappedto different
states. So let's start by reconnectingthe sample environment. So as you
can see here, we have L1 L2 L3 to alignand as you can see here, we have
certain borders also. So first of all, let's map each of the
abovelocations in the warehouse two numbers or the states so that it
will easeour calculations, right? So what I'm going to do iscreate a new
Python 3 file in the jupyter notebook and I'll name it aslearning Numb,
but okay, so let'sdefine the states. But before that what weneed to do
is import numpy because we're going to use numpy for this purpose and
let'sinitialize the parameters. That is the gammaand Alpha parameters.
So gamma is 0.75, which is the discount Factorwhereas Alpha is 0.9,
which is the learning rate. Now next what we're going to dois Define the
states and map it to numbers. So as I mentioned earlierl 1 is Zero and
online. We have defined the statesin the numerical form. Now. The next
step is to definethe actions which is as mentioned aboverepresents the
transition to the next state. So as you can see here, we have an arrayof
actions from 0 to 8. Now, what we're going to dois Define the reward
table. So as you can see hereis the same Matrix that we created just now
that I showed you just now nowif you understood it correctly, there
isn't any realBarrel limitation as depicted in the image, for example,
the transitionalfor tell one is allowed but the reward will be 0to
discourage that path or in tough situation. What we do is adda minus
1
there so that it getsa negative reward. So in the above code snippetas
you can see here, we took each of the It's and putonce in the respective
state that are directly reachablefrom the certain State. Now. If you
refer to that rewardtable, once again, which we created the above or
reconstruction willbe easy to understand but one thing to note here is
that we did not consider the toppriority location L6 yet. We would also
needan inverse mapping from the state's backto its original location and
it will be cleaner when we reach to the otherdepths of the algorithms.
So for that what we're goingto do is Have the inverse map location state
to location. We will take the distinctState and location and convert it
back. Now. What will do is will not Definea function get optimal which
is the get optimal route, which will have a start locationand an N
location. Don't worry the code is back. But I'll explain youeach and
every bit of the code. It's not the get optimal rootfunction will take
two arguments the starting locationin the warehouse and the end
locationin the warehouse recipe lovely and it will returnthe optimal
route for reaching the end location from the starting locationin the
form of an ordered list containing the letters. So we'll start by
defining the function by initializingthe Q values to be all zeros. So as
you can see here we haveEven the Q value has to be 0 but before that
what we need to do is copythe reward Matrix to a new one. So this the
rewardsnew and next again, what we need to do is getthe ending State
corresponding to the ending location. And with this
informationautomatically will set the priority of the given endingstay
to the highest one that we are not defining it now, but will
automaticallyset the priority of the given endingState as nine nine
nine. So what we're going to do isinitialize the Q values to be 0 and in
the Learning processwhat you can see here. We are taking I in range1000
and we're going to pick up a state randomly. So we're going to usethe
MP dot random randint and for traversingthrough the neighbor location in
the same mazewe're going to iterate through the new rewardMatrix and
get the actions which are greaterthan 0 and after that what we're going
to do is pickan action randomly from the list of the playable actions in
years to
the next state will going to computethe temporal difference,
which is TD, which is the rewards plus gammainto the queue of next state
and will take n p dot ARG Max of Q of next 8 minus Qof the current
state. We going to then update the Q values usingthe Bellman equation as
you can see here. We have the Bellman equation and we're goingto update
the Q values and after that we're goingto initialize the optimal route
with a starting locationnow here we do not know what the next location
yet. So initialize it with a valueof the starting location, which Again
isthe random location. So we do not knowabout the exact number of
iteration needed to reachto the final location. Hence while loop will
bea good choice for the iteration. So when you're going to fetchthe
starting State fetch the highest Q value penetrating to the starting
Statewe go to the index or the next state, but we needthe corresponding
letter. So we're going to use that stateto location function. We just
mentioned there and after that we're goingto update the starting
location for the The next iteration and finally we'llreturn the root. So
let's take the startinglocation of n line and and location of L while
and see what partdo we actually get? So as you can see here weget
Airline l8l 5 L2 and L1. And if you have a lookat the image here, we
have if we startfrom L9 to L1. We got L8 L5 L2 l 1 l 8l v L2 L1 that
would He does the maximumvalue of the maximum reward for the robot. So
now we have come to the endof this Q learning session and I hope you got
to know what exactly is Q learningwith the analogy all the way
startingfrom the number of rooms and I hope the examplewhich I took the
analogy which I took was good enough for you to understand
q-learningunderstand the Bellman equation how to make quick changesto
the Bellman equation and how to createthe reward table the cue. Will and
how to updatethe Q values using the Bellman equation, what does alpha
dowhat does karma do.
Thnaks for Reading
onlyharish
golden knowledgee
Post a Comment
If you have any questions ! please let me know