Data Science Full Course - Learn Data Science Beginners Final day




We have the support. We have the confidenceto lift the leverage and the conviction. So guys, that's itfor this session. That is how you createAssociation rules using the API. Real gold tone which helps a lotin the marketing business. It runs on the principleof Market Basket analysis, which is exactly what bigcompanies like Walmart. You have Reliance and Target to even Ikea does itand I hope you got to know what exactly isAssociation rule mining what is lift confidence and support and how tocreate Association rules. So guys reinforcement learning. Dying is a partof machine learning where an agent is putin an environment and he learns to behavein this environment by performing certain actions. Okay, so it basically performsactions and it either gets a rewards on the actions or it gets a punishmentand observing the reward which it gets from those actionsreinforcement learning is all about taking an appropriateaction in order to maximize the rewardin a particular situation. So guys in supervised learningthe training data comprises of the input and the expected output And so the model is trainedwith the expected output itself, but when it comesto reinforcement learning, there is noexpected output here. The reinforcement agentdecides what actions to take in order to performa given task in the absence of a training data set. It is bound to learnfrom its experience itself. Alright. So reinforcement learningis all about an agent who's put inan unknown environment and he's going to usea hit and trial method in order to figure outthe environment and then come up with an outcome. Okay. Now, let's look at it. Reinforcement learningwithin an analogy.



So consider a scenariowhere in a baby is learning how to walk the scenariocan go about in two ways. Now in the first casethe baby starts walking and makes it to the candy here. The candy is basicallythe reward it's going to get so since the candy isthe end goal the baby is happy. It's positive. Okay, so the baby is happyand it gets rewarded a set of candies now another wayin which this could go is that the baby starts walking but Falls due to some hurdlein between The baby gets hot and it doesn't get any candyand obviously the baby is sad. So this is a negative reward. Okay, or you can saythis is a setback. So just like how we humans learnfrom our mistakes by trial and error reinforcementlearning is also similar. Okay, so we have an agent which is basicallythe baby and a reward which is the candy over here. Okay, and with many hurdlesin between the agent is supposed to find the best possible pathto read through the reward. So guys. I hope you all are clear withthe reinforcement learning now, let's look at At thereinforcement learning process. So generally a reinforcementlearning system has two main components, right? The first is an agent and the second oneis an environment. Now in the previous case, we saw that the agent was the baby and the environmentwas the living room where in the baby was crawling. Okay. The environment is the setting that the agent is actingon and the agent over here represents the reinforcementlearning algorithm.


Data Science Full Course - Learn Data Science Beginners Final day



So guys the reinforcementlearning process starts when the environmentsends a state to the And then the agentwill take some actions based on the observations in turn the environmentwill send the next state and the respective rewardback to the agent. The agent willupdate its knowledge with the reward returned bythe environment and it uses that to evaluateits previous action. So guys thisLoop keeps continuing until the environment sendsa terminal state which means that the agent hasaccomplished all his tasks and he finally gets the reward. Okay. This is exactly what was depictedin this scenario. So the agent keepsclimbing up ladders until he reaches his rewardto understand this better. Let's suppose that our agent islearning to play Counter Strike. Okay. So let's break it downnow initially the RL agent which is basicallythe player player 1. Let's say it's a player one who is trying to learnhow to play the game. Okay. He collects some statefrom the environment. Okay. This could be the first dateof Counter-Strike now based on the state the agentwill take some action. Okay, and this actioncan be anything that causes a result. So if the Almost left or right it's alsoconsidered as an action. Okay, so initially the actionis going to be random because obviously the first timeyou pick up Counter-Strike, you're not goingto be a master at it. So you're going to trywith different actions and you just want to pick up arandom action in the beginning. Now the environment is goingto give a new state. So after clearing that the environmentis now going to give a new state to




the agent or to the player. So maybe he's across th one now. He's in stage 2. So now the playerwill get a reward our one from the environment. Because it cleared stage 1. So this reward can be anything. It can be additional pointsor coins or anything like that. Okay. So basically this Loopkeeps going on until the player is deador reaches the destination. Okay, and it continuouslyoutputs a sequence of States actions and rewards. So guys, this wasa small example to show you how reinforcementlearning process works. So you startwith an initial State and once a player clothesthat state he gets a reward after that the environment will give another stageto the player. And after it clears that stateit's going to get another award and it's going to keep happening until the playerreaches his destination. All right, so guys,I hope this is clear now, let's move on and look at the reinforcementlearning definitions. So there are a few Conceptsthat you should be aware of while studyingreinforcement learning. Let's look at thosedefinitions over here. So first we have the agentnow an agent is basically the reinforcement learningalgorithm that learns from trial and error. Okay, so an agent takes actionslike For example a soldier in Counter-Strike navigatingthrough the game. That's also an action. Okay, if he moves left rightor if he shoots at somebody that's also an action. Okay. So the agent is responsible for taking actionsin the environment. Now the environment isthe whole Counter-Strike game. Okay. It's basically the worldthrough which the agent moves






the environment takesthe agents current state and action as input and it Returns the agency rewardand its next state as output. Alright next we have actionnow all the possible. Steps that an agentcan take are called actions. So like I said, it can be moving right leftor shooting or any of that. Alright, then we havestate now state is basically the current conditionreturned by the environment. So whichever State you are in if you are in state 1 orif you're in state to that representsyour current condition. All right. Next we have reward a rewardis basically


an instant return from the environmentto appraise Your Last Action. Okay, so it can beanything like coins or it can be audition. Two points. So basically a rewardis given to an agent after it clearsthe specific stages. Next we have policy policiesbasically the strategy that the agent uses to findout his next action based on his current state policy isjust the strategy with which you approach the game. Then we have value. Now while you isthe expected long-term return with discount so value in action value can be a littlebit confusing for you right now, but as we move further, you'll understandwhat I'm talking. Kima okay. So value is basicallythe long-term return that you get with discount. Okay discount. I'll explain inthe furthest lines. Then we have action value now action valueis also known as Q value. Okay. It's very similar to Value except that it takesan extra parameter, which is the current action. So basically here you'll findout the Q value depending on the particular actionthat you took. All right. So guys don't get confusedwith value and action value. We look at examples in the further slides and youwill understand this better. Okay. So guys make sure that you'refamiliar with these terms because you'll be seeinga lot of these terms in the further slides. All right. Now before we move any further, I'd like to discussa few more Concepts. Okay. So first we will discussthe reward maximization.





So if you haven't alreadyrealized it the basic aim of the RL agent isto maximize the reward now, how does that happen? Let's try to understandthis in depth. So the agent must betrained in such a way that he takes the best action sothat the reward is Because the end goalof reinforcement learning is to maximize your rewardbased on a set of actions. So let me explain thiswith a small game now in the figure you can seethere is a fox there's some meat and there's a tiger so our agent is basicallythe fox and his end goal is to eat the maximum amount of meat before being eatenby the tiger now since the fox is a cleverfellow he eats the meat that is closer to himrather than the meat which is closer to the tiger. Now this is because thecloser he is to the tiger the higher our his chancesof getting killed. So because of this the rewardswhich are near the tiger, even if they arebigger meat chunks, they will be discounted. So this is exactlywhat discounting means so our agent is not goingto eat the meat chunks which are closer to the tigerbecause of the risk. All right now, even though the meat chunksmight be larger. He does not want to takethe chances of getting killed. Okay. This is called discounting. Okay. This is where you discount because it improviseand you just eat the meat which are closer to youinstead of taking risks and eating the meatwhich are The to your opponent. All right. Now the discountingof reward Works based on a value called gammawill be discussing gamma in our further slides but in short the valueof gamma is between 0 and 1. Okay. So the smaller the


gamma thelarger is the discount value. Okay. So if the gamma value is lesser, it means that the agentis not going to explore and he's not goingto try and eat the meat chunks which are closer to the tiger. Okay, but if the gamma valueis closer to 1 it means that our agent is actuallyWe're going to explore and it's going to dryand eat the meat chunks which are closer to the tiger. All right, now, I'll be explaining thisin depth in the further slides. So don't worry if you haven't gota clear concept yet, but just understand that reward maximization isa very important step when it comesto reinforcement learning because the agent hasto collect maximum rewards by the end of the game. All right. Now, let's lookat another concept which is called explorationand exploitation. So exploration likethe name suggests is about exploring and capturing. More information aboutan environment on the other hand exploitation is about using the already known exploited informationto heighten the rewards. So guys consider the foxand tiger example that we discussed now here thefox eats only the meat chunks which are close to him, but he does not eatthe meat chunks which are closer to the tiger. Okay, even though theymight give him more Awards.



He does not eat them if the fox only focuseson the closest rewards, he will never reachthe big chunks of meat. Okay, this is whatexploitation is the about you just going to usethe currently known information and you're goingto try and get rewards based on that information. But if the fox decidesto explore a bit, it can find the bigger awardwhich is the big chunks of meat. This is exactlywhat exploration is. So the agent is not goingto stick to one corner instead. He's going to explorethe entire environment and try and collect bigger rewards. All right, so guys, I hope you all are clear withexploration and exploitation. Now, let's lookat the markers decision process. So guys this is basicallya mathematical approach for mapping a solution inreinforcement learning in a way. The purpose of reinforcementlearning is to solve a Markov decision process. Okay. So there are a few parameters that are used to getto the solution. So the parameters includethe set of actions the set of states the rewards the policy that you're taking to approachthe problem and the value that you get.



Okay, so to sum it upthe agent must take an action a to transitionfrom a start state. The end State s while doing so the agent will receivea reward are for each action that he takes. So guys a series of actions taken bythe agent Define the policy or it defines the approachand the rewards that are collectedDefine the value. So the main goal here isto maximize the rewards by choosing the optimum policy. All right. Now, let's try to understandthis with the help of the shortest path problem. I'm sure a lot of you mighthave gone through this problem when you are in college. So guys lookat the graph over here. So our aim here isto find the shortest path between a and dwith minimum possible cost. So the value that you seeon each of these edges basically denotes the cost. So if I want to go from a to cit's going to cost me 15 points. Okay. So let's look athow this is done. Now before we moveand look at the problem in this problem the set ofstates are denoted by the nodes, which is ABCD and the action is to Traversefrom one node to the other. So if I'm going from a Be that's an actionsimilarly a to see that's an action. Okay, the reward isbasically the cost which is representedby each Edge over here. All right. Now the policy isbasically the path that I choose toreach the destination. So let's say I choosea seed be okay that's one policy in orderto get to D and choosing a CD which is a policy. Okay. It's basically howI'm approaching the problem.



So guys here youcan start off at node a and you can take baby steps to your destination nowinitially you're Clueless. So you can just takethe next possible node, which is visible to you. So guys if you're smart enough, you're going to choose ato see instead of ABCD or ABD. All right. So now if you areat nodes see you want to Traverse to note D. Youmust again choose a wise path or red you just haveto calculate which path has the highest cost or which path will giveyou the maximum rewards. So guys, this isa simple problem. We just drank to calculatethe shortest path between a and d by traversingthrough these nodes. So if I travels from a CD itgives me the maximum reward. Okay, it gives me 65 which is more than any otherpolicy would give me okay. So if I go from ABD, it would be 40 when youcompare this to a CD. It gives me more reward. So obviously I'm goingto go with a CB. Okay, so guys wasa simple problem in order to understand howMarkov decision process works. All right, so guys,I want to ask you a question. What do you think? I did hear didI perform exploration or did I perform exploitation? Now the policy for the aboveexample is of exploitation because we didn't explorethe other nodes. Okay. We just selected three notesand we Traverse through them. So that's why thisis called exploitation. We must always explorethe different notes so that we can finda more optimal policy. But in this case, obviouslya CD has the highest reward and we're going with a CD, but generally it'snot so simple. There are a lot of nodes therehundreds of notes to Traverse and they're like 50 60 policies. Okay, 50 60 different policies.

So you make sure you explore. All the policies and then decideon an Optimum policy which will give youa maximum reward. So guys before we performthe Hands-On part. Let's try to understandthe math behind our demo. Okay. So in our demo will be usingthe Q learning algorithm which is a type of reinforcementlearning algorithm. Okay, it's simple, it just means that if youtake the best possible actions to reach your goalor to get the most rewards. All right, let's try tounderstand this with an example. So guys, this is exactlywhat be running in In our demo, so make sure youunderstand this properly. Okay. So our goal here iswe're going to place an agent in any one of the rooms. Okay. So basically these squaresyou see here our rooms. OK 0 is a room for is a room three isa room one is a room and 2:05 is also a room. It's basically a wayoutside the building. All right. So what we're going to do iswe're going to place an agent in any one of these rooms and the goal is to reachoutside the building.



Okay outside. The building isroom number five. Okay, so these are These spacesare basically doors, which means that you can gofrom zero to four. You can go from 4to 3 3 to 1 1 to 5 and similarly 3 to 2, but you can't gofrom 5 to 2 directly. All right, so thereare certain set of rooms that don't getconnected directly. Okay. So like of mentioned here eachroom is numbered from 0 to 4, and the outside of the buildingis numbered as five and one thing to note hereis Room 1 and room for directly leadto room number five. All right. So room number one and fourwill directly lead out to room number five. So basically our goal over hereis to get to room number five. Okay to set this roomas a goal will associate a reward value to each door. Okay. Don't worry. I'll explain what I'm saying. So if you re present these roomsin a graph this is how the graph is going to look. Okay. So for example from true, you can go to threeand then three two, one one two five which will lead us to our goalthese arrows represent the link between the dose. No, this is quiteunderstandable now. Our next step isto associate a reward value to each of these doors. Okay, so the rooms that are directly connectedto our end room, which is room number five willget a reward of hundred. Okay. So basically our room numberone will have a reward five now. This is obviously because it's directlyconnected to 5 similarly for will also be associatedwith a reward of hundred because it's directlyconnected to 5. Okay. So if you go outfrom for it will lead to five now the other know. Roads are not directlyconnected to 5. So you can't directlygo from 0 to 5. Okay. So for this will be assigninga reward of zero.



So basically other doorsnot directly connected to the Target roomhave a zero reward. Okay now because the doorsare to weigh the two arrows are assigned to each room. Okay, you can see two arrowsassigned to each room. So basically zero leads to fourand four leads back to 0 now. We have assigned 0 0 over here because 0 does not directly lead to five but onedirectly leads to Five and that's why you can seea hundred over here similarly for directly leadsto our goal State and that's why we were signeda hundred over here and obviously five two fiveis hundred as well. So here all the directconnections to room number five are rewarded hundredand all the indirect connections are awarded zero. So guys in q-learning the endgoal is to reach the state with the highest reward so that the agentarrives at the goal. Okay. So let me just explainthis graph to you in detail now these These roomsover here labeled one, two, three to five they representthe state an agent is in so if I stay to one It means that the agent is in room number one similarlythe agents movement from one room to the otherrepresents the action. Okay. So if I say one two, three,it represents an action. All right. So basically the stateis represented as node and the action is representedby these arrows. Okay. So this is what this graph isabout these nodes represent the rooms and these Arrowsrepresent the actions. Okay. Let's look at a small example. Let's set the initialstate to 0. So my agent is placedin room number two, and he has to travel all the wayto room number five. So if I set the initial stageto to he can travel to State 3. Okay from three hecan either go to one or you can go back to to or you can go to forif he chooses to go to for it will directly takehim to room number 5, okay,




which is our end goal and evenif he goes from room number 3 2 1 it will takehim to room number. High five, so this is how our algorithm works is goingto drivers different rooms. In order to reachthe Gold Room, which is room number 5. Now, let's tryand depict these rewards in the form of a matrix. Okay, because we'll beusing this our Matrix or the reward Matrix tocalculate the Q value or the Q Matrix. Okay. We'll see what the Q value isin the next step. But for now, let's see how this rewardMatrix is calculated. Now the - ones that you seein the table, they represent the null values. Now these -1 basically means that Wherever there isno link between nodes. It's represented as minus 1 so 0 2 0 is minus 1 0 to 1there is no link. Okay, there's no directlink from 0 to 1. So it's represented asminus 1 similarly 0 to 2 or 2. There is no link. You can see there'sno line over here. So this is also minus 1, but when it comes to 0 to 4, there is a connectionand

we have numbered 0 because the reward for a state which is not directly connectedto the goal is zero, but if you look at this 1 comma 5which is is basically traversing from Node 1 to node 5, youcan see the reward is hundred. Okay, that's basically because one and fiveare directly connected and five is our end goal. So any node which will directly connectedto our goal state will get a reward of hundred. Okay. That's why I've put hundredover here similarly. If you look at thefourth row over here. I've assigned hundred over here. This is because from 4 to 5that is a direct connection. There's a direct connection which gives thema hundred reward. Okay, you can see from 4 to 5. There is a direct link. Okay, so from room number for to room numberfive you can go directly. That's why there'sa hundred reward over here. So guys, this ishow the reward Matrix is made. Alright, I hope thisis clear to you all. Okay. Now that we havethe reward Matrix. We need to create another Matrixcalled The Q Matrix. OK here, you'll storeor the Q values that will calculate nowthis Q Matrix basically represents the memory of what the agent has learnedthrough experience. Okay. So once he traversesfrom one room to the final room, whatever he's learned. It is stored in this Q Matrix. Okay, in orderfor him to remember that the next time he travelsthis we use this Matrix. Okay.




It's basically like a memory. So guys the rows of the Q Matrixwill represent the current state of the agent The Columns willrepresent the possible actions and to calculate the Q valueuse this formula. All right, I'll show youwhat the Q Matrix looks like, but first, let'sunderstand this formula. Now this Q value will calculating because wewant to fill in the Q Matrix. Okay. So this is basically a Matrixover here initially, it's all 0 but as the agent Traverse isfrom different nodes to the destination node. This Matrix will get filled up. Okay. So basically it will belike a memory to the agent. He'll know that okay, when he traversed usinga particular path, he found out that his value was maximum oras a reward was maximum of year. So next time he'llchoose that path.

This is exactly whatthe Q Matrix is. Okay. Let's go back now guys, don't worry aboutthis formula for now because we'll be implementingthis formula in an example. In the next slide. Okay, so don't worryabout this formula for now, but here just remember that this Q basically representsthe Q Matrix the r represents the reward Matrix and the gamma is the gamma valuewhich I'll talk about shortly and here you just finding outthe maximum from the Q Matrix. So basically the gamma parameterhas a range from 0 to 1 so you can have a value of0.1 0.3 0.5 0.8 and all of that. So if the gamma is closerto zero it means That the agent will consider only the immediaterewards which means that the agent willnot explore the surrounding. Basically, it won'texplore different rooms. It will just choosea particular room and then we'll trysticking to it. But if the value of gammais high meaning that if it's closer to one the agentwill consider future Awards with greater weight. This means that the agentwill explore all the possible approaches or all the possible policiesin order to get to the end goal. So guys, this is what Iwas talking about when I mention ation and exploration. All right. So if the gamma value is closerto 1 it basically means that you're actually exploringthe entire environment and then choosingan Optimum policy. But if your gamma valueis closer to zero, it means that the agentwill only stick to a certain set of policies and it will calculatethe maximum reward based on those policies. Now next. We have the Q learning algorithm that we're going to useto solve this problem. So guys now this is goingto look very confusing to y'all. So let me just explainIn this with an example.



Okay. We'll see what we're actuallygoing to run in our demo. We will do the math behind it. And then I'll tell you whatthis Q learning algorithm is. Okay, you'll understand itas I'm showing you the example. So guys in the Q learningalgorithm the agent learns from his experience. Okay, so each episode, which is basically when the agents are traversingfrom an initial room to the end goal is equivalentto one training session and in every training sessionthe agent will explore the environment itwill Receive some reward until it reaches the goal statewhich is five. So there's a purpose of training is to enhancethe brain of our agent. Okay only if he knowsthe environment very well,




will he knowwhich action to take and this is why we calculatethe Q Matrix okay in Q Matrix, which is going to calculatethe value of traversing from every state to the endstate from every initial room to the end room. Okay, so when wecalculate all the values or how much rewardwe're getting from each policy that we We knowthe optimum policy that will give usthe maximum reward. Okay, that's whywe have the Q Matrix. This is very important because the moreyou train the agent and the more Optimum your outputwill be so basically here the agent will not performexploitation instead. He'll explore around and go back and forththrough the different rooms and find the fastestroute to the goal. All right. Now, let's look at an example. Okay. Let's see howthe algorithm works. Okay. Let's go backto the previous slide and Here it says that the first step isto set the gamma parameter. Okay. So let's do that. Now the first stepis to set the value of the learning parameter, which is gamma and wehave randomly set it to zero point eight. Okay. The next step is to initializethe Matrix Q 2 0 Okay. So we've set Matrix Q 2 0 over here and then wewill select the initial stage Okay, the third step is selecta random initial State and here we've selected the initial Stateas room number one





Okay. So after you initializethe matter Q as a zero Matrix from room number one, you can either go to room numberthree or number five. So if you lookat the reward Matrix can see that from room number one, you can only go to room numberthree or room number five. The other valuesare minus 1 here, which means that there isno link from 1 to 0 1 2 1 1 2 2 and 1 to 4. So the only possible actionsfrom room number one is to go to room number 3 and to goto room number five. All right. Okay. So let's selectroom number five, okay. So from room number one, you can go to 3 and 5 and wehave randomly selected five. You can also selectthree but for example, let's select five over here. Now from Rome five, you're going to calculatethe maximum Q value for the next state basedon all possible actions. So from number five, the next state can beroom number one four or five. So you're going to calculatethe Q value for traversing 5 to 1 5 2 4 5 2 5and you're going to find out which has the maximum Q valueand that's how you're going. Compute the Q value.



So let's Implement our formula. Okay, this isthe q-learning formula. So right now we're traversing from room numberone to room number 5. Okay. This is our state. So here I've writtenQ 1 comma 5. Okay one representsour current state which is room number one. Okay. Our initial state was roomnumber one and we are traversing to room number five. Okay. It's shown in this figure roomnumber 5 now for this we need to calculate the Q valuenext in our formula. It says the rewardMatrix State and action. So the reward Matrix for 1 comma5 let's look at 1 comma 5 1 comma 5 correspondsto a hundred. Okay, so I reward overhere will be hundred so r 1 comma 5 is basicallyhundred then you're going to add the gamma value. Now the gamma valuewill be initialized it to zero point eight. So that's what wehave written over here. And we're going to multiply itwith the maximum value that we're going to getfor the next date based on all possible actions. Okay. So from 5,the next state is 1 4 and 5. So if Travis from five to one that's what I've writtenover here 5 to 4. You're going to calculate the Qvalue of Fire 2 4 & 5 to 5. Okay. That's what Imentioned over here. So Q 5 comma 1 5 comma 4 and 5 comma 5 arethe next possible actions that you can take from State V.



So r 1 comma 5 is hundred. Okay, because fromthe reward Matrix, you can see that 1 comma5 is hundred 0.8 is the value of gamma after that. We will calculate Qof 5 comma 1 5 comma 4 and 5 comma 5 LikeI mentioned earlier that we're going to initializeMatrix Q as zero Matrix So based setting the value of 0 because initially obviouslythe agent doesn't have any memory of what is happening. Okay, so he juststarting from scratch. That's why allthese values are 0 so Q of 5 comma 1 will obviouslybe 0 5 comma 4 would be 0 and 5 comma 5 will also be zero and to find out the maximumbetween these it's obviously 0. So when youcompute this equation, you will get hundred sothe Q value of 1 comma 5 is So if I agent goes from roomnumber one to room number five, he's going to havea maximum reward or Q value of hundred. All right. Now in the nextslide you can see that I've updated the valueof Q of 1 comma 5. Okay, it said 200. All right now similarly, let's look at another example sothat you understand this better. So guys, this is exactly what we're goingto do in our demo. It's only going to be coded. Okay. I'm just explainingour code right now. I'm just telling youthe math behind it. Alright now, let's lookat another example. Example OK this time. We'll start with a randomlychosen initial State.



Let's say thatwe've chosen State 3. Okay. So from room 3, you can either goto room number one two, or four randomlywill select room number one and from room number one, you're going to calculatethe maximum Q value for the next state basedon all possible actions. So the possible actionsfrom one is to go to 3 and to go to 5 now if you calculate the Q valueusing this formula, so let me explain thisto you once again now, 3 comma 1 basically represents that we're in room numberthree and we are going to room number one. Okay. So this represents our action? Okay. So we're going from 3 to 1 which is our action and three is our current statenext we will look at the reward of going from 3 to 1. Okay, if you go to the rewardMatrix 3 comma 1 is 0 okay. Now this is because there's no direct linkbetween three and five. Okay, so that's whythe reward here is zero. So the value here will be 0 after that we havethe gamma value, which is zero point. Eight and then we're goingto calculate the Q Max of 1 comma 3 and 1 comma5 out of these whichever has the maximum valuewe're going to use that. Okay, so Q of 1 comma 3 is 0. All right 0 you can seehere 1 comma 3 is 0 and 1 comma 5 if you remember we just calculated1 comma 5 in the previous slide. Okay 1 comma 5 is hundred. So here I'm goingto put a hundred. So the maximum here is hundred. So 0.8 in 200 will give us c tso that's the Q value. Going to get if you Traversefrom three two one. Okay. I hope that was clear. So now we have Traversfrom room number three to room numberone with the reward of 80. Okay, but we stillhaven't reached the end goal which is room number five. So for our next episodethe state will be room. Number one. So guys, like I said,we'll repeat this in a loop because room numberone is not our end goal. Okay, our end goalis room number 5. So now we need to figure out how to get from room numberone to room number 5.





So from room number one, you can either either goto three or five. That's what I'vedrawn over here. So if we select five we knowthat it's our end goal. Okay. So from room number 5, then you have to calculatethe maximum Q value for the next possible actions. So the next possible actionsfrom five is to go to room number one room numberfour or room number five. So you're going to calculatethe Q value of 5 to 1 5 2 4 & 5 2 5 and find out which is the maximum Q value here and you're goingto use that value. All right. So let's lookat the formula now now again, we're in room numberone and Want to go to room number 5. Okay, so that's exactly what I've written here Q 1 comma5 next is the reward Matrix. So reward of 1 comma5 which is hundred. All right, then we have addedthe gamma value which is 0.8. And then we're goingto find the maximum Q value from 5 to 1 5 2 4 & 5 to 5. So this is whatwe're performing over here. So 5 comma 1 5 comma 4and 5 comma 5 are all 0 this is because we initially set allthe values of the Q Matrix as 0 so you get Hundred over hereand the Matrix Remains the Same because we alreadyhad calculated Q 1



comma 5 so the value of 1 comma5 is already fed to the agent. So when he comes back here,he knows our okay. He's already donethis before now. He's going to tryand Implement another method. Okay is going to tryand take another route or another policy. So he's going to try to gofrom different rooms and finally land upin room number 5, so guys, this is exactlyhow our code runs. We're going to Traversethrough each and every node because we want an Optimum ball. See, okay. An Optimum policyis attained only when you Traversethrough all possible actions. Okay. So if you go throughall possible actions that you can perform onlythen will you understand which is the best action which will lead usto the reward. I hope this is clear now, let's move onand look at our code. So guys, this is our codeand this is executed in Python and I'm assuming that all of you havea good background in Python. Okay, if you don't understandpython very well. I'm going to leave a linkin the description. You can check outthat video on Python and then maybe comeback to this later. Okay, but I'll be explainingthe code to you anyway, but I'm not going to spend a lotof time explaining each and every line of codebecause I'm assuming that you know python. Okay. So let's look at the first lineof code over here. So what we're going to do iswe're going to import numpy. Okay numpy is basicallya python library for adding support forlarge multi-dimensional arrays and matrices and it'sbasically for computing mathematical functions. Okay so first Want to import that after that we're goingto create the our Matrix. Okay. So this is the our Matrix nextwe're going to create a q Matrix and it's a 6 into 6 Matrix because obviously we havesix states starting from 0 to 5. Okay, and we are goingto initialize the value to zero. So basically the Q Matrixis going to be initialized to zero over here. All right, after that we're settingthe gamma parameter to 0.8. So guys you can playwith this parameter and you know move itto 0.9 or movement logo to 0.8. Okay, you can see seewhat happens then then we'll set an initial stage. Okay initial stageis set as 1 after that. We're defining a functioncalled available actions. Okay. So basically whatwe're doing here is since our initial state is one. We're going to checkour row number one. Okay, this isour own number one. Okay. This is wrong number zero. This is zero numberone and so on. So we're going to check




the rownumber one and we're going to find the values which are greaterthan or equal to 0 because these values basically The nodes thatwe can travel to now if you select minus 1 you can Traverse 2-1. Okay, I explainedthis earlier the - one represents all the nodesthat we can travel to but we can travel to these nodes. Okay. So basically over herea checking all the values which are equal to 0 or greater than 0 thesewill be our available actions. So if our initial state is onewe can travel to other states whose value is equal to 0 or greater than 0 and this is storedin this variable called. All available act right now. This will basically getthe available actions in the current state. Okay. So we're just storingthe possible actions in this availableact variable over here. So basically over here since our initial state isone we're going to find out the next possible Stateswe can go to okay that is storedin the available act variable. Now next is this functionchooses at random


which action to be performedwithin the range. So if you remember over here, so guys initially weare in stage number. Okay are available actions is to go to stage number3 or stage number five. Sorry room number3 or room number 5. Okay. Now randomly, we needto choose one room. So for that usingthis line of code, okay. So here we are randomly goingto choose one of the actions from the available actthis available act. Like I said earlier storesall our possible actions. Okay from the initial State. Okay. So once it chooses an actionis going to store it in next action, so guys this action will Present the next available action totake now next is our Q Matrix. Remember this formulathat we used. So guys this formulathat we use is what we are going to calculatein the next few lines of code. So in this block of code, which is executingand Computing the value of Q. Okay, this is our formulafor computing the value of Q current state Karma action. Our current state Karma actiongamma into the maximum value. So here basically we're going to calculatethe maximum index meaning





that To be going to check which of the possibleactions will give us the maximum Q value read if you remember in our explanation over herethis value over here Max Q or five comma 1 5 comma 4and 5 comma 5 we had to choose a maximum Q value that we get from these three. So basically that's exactly what we're doingin this line of code, the calculating the indexwhich gives us the maximum value after we finish Computingthe value of Q will just have to update our Matrix. After that, we'll beupdating the Q value and will be choosinga new initial State. Okay. So this is the update functionthat is defined over here. Okay. So I've just calledthe function over here. So guys this whole set of codewill just calculate the Q value. Okay. This is exactly



what we didin our examples after that. We have the training phase. So guys remember the moreyou train an algorithm the better it's going to learn. Okay so over hereI have provided around 10,000 titrations. Okay. So my range is10 thousand iterations meaning that my age It will take10,000 possible scenarios and in go to 10,000 titrationsto find out the best policy. So you're exactly what I'm doing is I'm choosingthe current state randomly after that. I'm choosing the availableaction from the current state. So either I can go to stage3 or straight five then I'm calculating the next action and then I'm finallyupdating the value in the Q Matrix and next. We just normalize the Q Matrix. So sometimes in our Q Matrixthe value might exceed. Okay, let's say it. Heated to 500 600 sothat time you want to normalize The Matrix. Okay, we want to bringit down a little bit. Okay, because larger numberswe won't be able to understand and computation would bevery hard on larger numbers. That's why weperform normalization. You're taking your calculatedvalue and you're dividing it with the maximum Q value in 200. All right, so youare normalizing it over here. So guys, this isthe testing phase. Okay here you will just randomlyset a current state and you want given any other data because you've alreadytrained our model. Okay, you're To givea Garden State then you're going to tell your agentthat listen you're in room. Number one. Now. You need to goto room number five. Okay, so he has to figure out how to go to room number 5because we have trained him now. All right. So here we have setthe current state to one and we need to make surethat it's not equal to 5 because 5 is the end goal. So guys this is the same Loopthat we executed earlier. So we're going to dothe same I trations again now if I run this entire code,let's look at the result. So our current statehere we've chosen as one. Okay and And if we goback to our Matrix, you can see that there isa direct link from 1 to 5, which means that the route that the agent shouldtake is one to five. Okay directly. You should go from 1 to 5 because it will getthe maximum reward like that.




Okay. Let's see if that's happening. So if I run this it should giveme a direct path from 1 to 5. Okay, that's exactlywhat happened. So this is the selected pathso directly from one to five it went and it calculatedthe entire Q Matrix. Works for me. So guys this is exactlyhow it works. Now. Let's try to setthe initial stage as that's a to so if I set the initial stage asto and if I try to run the code, let's see the path that it gives sothe selected path is 2 3 4 5 now chose this path because it's givingus the maximum reward from this path. Okay. This is the Q Matrixthat are calculated and this is the selected path. All right, so guys with this wecome to the end of this demo. So basically what we didwas we just placed an agent in a room random room and we ask it to Traverse through and reachto the end room, which is room number five. So basically we trainedour agent and we made sure that it went through allthe possible paths. to calculate the bestpath the for a robot and environment is a placewhere it has been put to use. Now. Remember this reward isitself the agent for example an automobile Factorywhere a robot is used to move materials from one place to another nowthe task we discussed just now have a property in common. Now, these tasks involveand environment and expect the agent to learnfrom the environment. Now, this is where traditionalmachine learning phase and hence the needfor reinforcement learning now, it is good to have Establishoverview of the problem that is to be solvedusing the Q learning or the reinforcement learning. So it helps to definethe main components of a reinforcementlearning solution.



That is the agent environmentaction rewards and States. So let's suppose we are to build a few autonomous robots foran automobile building Factory. Now, these robots will helpthe factory personal by conveying themthe necessary parts that they would needin order to pull the car. Now these differentparts are located at Nine different positions within the factory warehousethe car part include the chassis Wheels dashboard the engine and so on andthe factory workers have prioritized the location that contains the body or the chassis to bethe topmost but they have provided the prioritiesfor other locations as well, which will look into the moment. Now these locations within the factory looksomewhat like this. So as you can see here,we have L1 L2 L3 all of these stations. Now one thing youmight notice here that there are little obstacleprison in between the locations. So L6 is the toppriority location that contains the chassisfor preparing the car bodies. Now the task isto enable the robots so that they can findthe shortest route from any given location toanother location on their own. Now the agents in this case arethe robots the environment is the automobile factorywarehouse the let's talk about the state's the states. Are the location in whicha particular robot is present in the particularinstance of time which will denote it statesthe machines understand numbers rather than let us so let's mapthe location codes to number. So as you can see here, we have map location l1 to this t 0 L 2 and 1 and so on we have L8 asstate 7 + L line at state. So next what we're going to talkabout are the actions. So in our example, the action will be the directlocation that a robot can. Call from a particular location, right consider a robot that is a tel to locationand the Direct locations to which it can moveour L5 L1 and L3. Now the figure here may comein handy to visualize this now as you might have alreadyguessed the set of actions here is nothing but the set of all possible states of the robot for each locationthe set of actions that a robot can takewill be different. For example, the setof actions will change if the robot is. An L1 rather than L2. So if the robot is in L1,





it can only go to L4 and L 2 directly now that we are done with the statesand the actions. Let's talk about the rewards. So the states arebasically zero one two, three four and theactions are also 0 1 2 3 4 up till 8:00. Now, the rewards nowwill be given to a robot. If a location which is the stateis directly reachable from a particular location. So let's take an examplesuppose l Lane is directly reachable from L8. Right? So if a robot goes from LAto align and vice versa, it will be rewarded by one and if a location isnot directly reachable from a particular equation. We do not give any rewarda reward of 0 now the reward is just a number and nothing else it enablesthe robots to make sense of the movements helping them in deciding what locationsare directly reachable and what are not nowwith this Q. We can construct a reward tablewhich contains all the required. Use mapping betweenall possible States. So as you can see herein the table the positions which are marked greenhave a positive reward. And as you can see here, we have all the possible rewardsthat a robot can get by moving in between the different states. Now comes aninteresting decision. Now remember that the factoryadministrator prioritized L6 to be the topmost. So how do we incorporate thisfact in the above table now, this is done by associatingthe topmost priority location with a very high reward. The usual ones so let's put 999 in the cell L 6 commaand six now the table of rewards with a higher reward for the topmost locationlooks something like this. We have not formally definedall the vital components for the solution. We are aiming forthe problem discussed now, we will shift gearsa bit and study some of the fundamental concepts that Prevail in the worldof reinforcement learning and q-learning the firstof all we'll start with the Bellman equation nowconsider the following Square.



Rooms, which is analogous to the actual environmentfrom our original problem. But without the barriers nowsuppose a robot needs to go to the room marked in the green from its current position ausing the specified Direction. Now, how can we enable the robotto do this programmatically one idea would be introducedsome kind of a footprint which the robot will be ableto follow now here a constant value is specifiedin each of the rooms, which will comealong the robots way if it follows the directionsby Fight about now in this way if it starts at location a it will be able to scanthrough this constant value and will move accordingly but this will only work if the direction is prefix and the robot always starts at the location a nowconsider the robot starts at this location ratherthan its previous one. Now the robotnow sees Footprints in two different directions. It is therefore unableto decide which way to go in order to get the destinationwhich is the Green Room. It happens. Primarily because the robotdoes not have a way to remember the directions to proceed. So our job now is to enablethe robot with a memory. Now, this is where the Bellmanequation comes into play. So as you can see here, the main reasonof the Bellman equation is to enable the rewardwith the memory. That's the thingwe're going to use. So the equation goessomething like this V of s gives maximum a rof s comma a plus gamma of vs - where s is a particular stateWhich is a room is the Action Movingbetween the rooms as - is the state to whichthe robot goes from s and gamma is the discount Factor now we'll getinto it in a moment and obviously R of s commaa is a reward function which takes a state as an actiona and outputs the reward now V of s is the value of beingin a particular state which is the footprint now we consider allthe possible actions and take the onethat yields the maximum value. Now there is one constraint. However regardingthe value footprint that is the room marked in the yellow justbelow the Green Room. It will always havethe value of 1 to denote that is one of the nearest roomadjacent to the green room. Now. This is also to ensurethat a robot gets a reward when it goes from a yellow roomto The Green Room.





Let's see how to makesense of the equation which we have here. So let's assumea discount factor of 0.9 as remember gamma isthe discount value or the discount Factor. So let's Take a 0.9. Now for the room, which is Mark just below the oneor the yellow room, which is the Aztec Markfor this room. What will be the V of s that is the value of beingin a particular state? So for this V of swould be something like maximum of a will take 0 which is the initialof our s comma. Hey plus 0.9which is gamma into 1 that gives us zero pointnine now here the robot will not get any reward for Owing to a state marked in yellow hence the IR scomma a is 0 here but the robot knows the valueof being in the yellow room. Hence V of s Dash isone following this for the other states. We should get 0.9 then again, if we put 0.9 in this equation, we get 0.81 then zero pointseven to nine and then we again reached the starting point. So this is how the table looks withsome value Footprints computer. From the Bellman equation now a couple of thingsto notice here is that the max functionhas the robot to always choose the state that gives it the maximum valueof being in that state now the discount Factorgamma notifies the robot about how far it isfrom the destination.


This is typically specified bythe developer of the algorithm. That would be installedin the robot. Now, the other states can alsobe given their respective values in a similar way. So as you can see here the boxesInto the green one have one and if we move away from one weget 0.9 0.8 1 0 1 7 to 9. And finally we reach0.66 now the robot now can precede its way through the Green Room utilizingthese value Footprints event if it's droppedat any arbitrary room in the given location now, if a robot Lance up inthe highlighted Sky Blue Area, it will still findtwo options to choose from but eventuallyeither of the parties. It's will be good enoughfor the robot to take because Auto V the value Footprintsare not only that out. Now one thing to note is that the Bellman equation is oneof the key equations in the world of reinforcementlearning and Q learning. So if we think realistically oursurroundings do not always work in the way we expectthere is always a bit of stochastic Cityinvolved in it. So this appliesto robot as well. Sometimes it might so happen that the robotsMachinery got corrupted. Sometimes the robot makes comeacross some hindrance on its way which may not be knownto it beforehand. Right and sometimes evenif the robot knows that it needs to takethe right turn it will not so how do we introducethis to cast a city in our case now here comesthe Markov decision process now consider the robot iscurrently in the Red Room and it needs to goto the green room. Now. Let's now considerthe robot has a slight chance of dysfunctioningand might take the left or the right or the bottom. On instead updatingthe upper turn in order to get to The Green Roomfrom where it is now, which is the Red Room. Now the question is, how do we enable the robotto handle this when it is out in the given environment right. Now, this is a situation where the decision making regarding which turn isto be taken is partly random and partly another controlof the robot now partly random because we are not sure when exactly the robot minddysfunctional and partly under the control of the robot because it is stillMaking a decision of taking a turn right on its own and with the helpof the program embedded into it.




So a Markov decision process is a discrete timestochastic Control process. It provides a mathematicalframework for modeling decision-making in situations where the outcomesare partly random and partly under controlof the decision maker. Now we need to give this concepta mathematical shape most likely an equation which then can be takenfurther now you might be Price that we can do thiswith the help of the Bellman equationwith a few minor tweaks. So if we have a look at the original Bellman equationV of X is equal to maximum of our s comma a plusgamma V of s stash what needs to be changedin the above equation so that we can introducesome amount of Randomness here as long as we are not sure when the robot might not takethe expected turn. We are then also not surein which room it might end up in which is nothingbut the room it. Moves from its currentroom at this point according to the equation. We are not sure of the S stash which is the next stateor the room, but we do know all the probableturns the reward might take now in order to incorporate each of this probabilitiesinto the above equation. We need to associatea probability with each of the turns toquantify the robot if it has got any experts it ischance of taking this turn now if we do, so We get PS is equal to maximumof our s comma a plus gamma into summation of s - PS comma a comma s stash into Vof his stash now the PS a-- and a stash is the probability of moving from room sto establish with the action a and the submissionhere is the expectation of the situation thatthe robot in curse, which is the randomness now, let's take a lookat this example here. So when We associatethe probabilities to each of these Stones. We essentially meanthat there is an 80% chance that the robot willtake the upper turn. Now, if you put allthe required values in our equation, we get V of s is equalto maximum of our of s comma a + comma of 0.8 into Vof room up plus 0.1 into V of room down 0.03into a room of V of from left plus 0.03 into Vo Right now note that the value Footprintswill not change due to the fact that we are incorporatingstochastic Ali here. But this time wewill not calculate those values Footprints instead. We will let the robotto figure it out. Now up until this point. We have not consideredabout rewarding the robot for its action of goinginto a particular room. We are only watering the robot when it getsto the destination now, ideally there should be a reward for each action the robottakes to help it better as Assess the qualityof the actions, but there was neednot to be always be the same but it is much betterthan having some amount of reward for the actionsthan having no rewards at all. Right and this idea is known asthe living penalty in reality. The reward systemcan be very complex and particularly modelingsparse rewards is an active area of research in the domainof reinforcement learning. So by now we havegot the equation which we have a so what? To do is now transitionto Q learning. So this equation givesus the value of going to a particular Statetaking the stochastic city of the environment into account. Now, we have also learnedvery briefly about the idea of living penalty which deals with associatingeach move of the robot with a reward soQ learning processes and idea of assessingthe quality of an action that is taken to moveto a state rather than determining the possiblevalue of the state which is being movedto So earlier we had 0.8 into V of s 1 0.03 into Vof S 2 0 point 1 into V of S 3 and so on now if you incorporate the ideaof assessing the quality of the action for movingto a certain state so the environmentwith the agent and the quality of the actionwill look something like this. So instead of 0.8 V of s 1 will have q of s1 comma a one will have q of S 2 comma 2 You of S3 not the robot now hasfour different states to choose from and along with that. There are four different actions also for the currentstate it is in so how do we calculate Q of s comma a that is the cumulative qualityof the possible actions the robot might take solet's break it down. Now from the equation V of sequals maximum a RS comma a + comma summation s - PSAs stash -into V of s - if we discard the maximumfunction we have is of a plus gamma into summation p and v now essentiallyin the equation that produces V of s we are consideringall possible actions and all possible States from the current statethat the robot is in and then we are takingthe maximum value caused by taking a certain action and the equation producesa value footprint, which is for justone possible action. In fact if we can thinkof it as the quality of the action soQ of s comma a is equal to RS comma a plus gammaof summation p and v now that we have got an equationto quantify the quality of a particular action. We are going to makea little adjustment in the equation we can now say that we of s is the maximumof all the possible values of Q of s comma a right. So let's utilize this fact and replace V of sStash as a function of Q so q s commaa becomes R of s comma a + comma of summation PSAs - and maximum of the que es -a - so




the equation of V is now turnedinto an equation of Q, which is the quality. But why would we do that now? This is done toease our calculations because now we haveonly one function Q, which is also the coreof the Programming language. We have only one function Qto calculate an R of s comma a is a Quantified metric which produces rewardof moving to a certain State. Now, the qualities of the actions arecalled The Q values and from now on we will referto the value Footprints as the Q valuesan important piece of the puzzle isthe temporal difference. Now temporal differenceis the component that will help the robotcalculate the Q values which respect to the change. Changes in theenvironment over time. So consider our robot iscurrently in the mark State and it wants to moveto the Upper State. One thing to note that here is that the robot already knowsthe Q value of making the action that is moving throughthe Upper State and we know that the environmentis stochastic in nature and the reward that the robot will getafter moving to the Upper State might be differentfrom an earlier observation. So how do we capturethis change the real difference? We calculate the new Q as My awith the same formula and subtract the previous youknown qsa from it. So this will in turn give usthe new QA now the equation that we just derived giftsthe temporal difference in the Q values which further helpsto capture the random changes in the environment which may impose nowthe new q s comma a is updated as the following so Q T of s comma is equal to QT minus 1 s commaa plus Alpha TD. ET of a comma s now here Alpha is the learningrate which controls how quickly the robot adaptsto the random changes imposed by the environment the qts commais the current state q value and a QT minus 1 s comma isthe previously recorded Q value. So if we replace the TDS comma awith its full form equation, we should get Q T of scomma is equal to QT - 1 of s comma y plus Alpha into our of S commaa plus gamma maximum of q s Dash a dash minus QT minus 1 s comma a now that we have all the littlepieces of q line together. Let's move forwardto its implementation part. Now, this is the final equationof q-learning, right? So, let's see how we can implement thisand obtain the best path for any robot to take nowto implement the algorithm. We need tounderstand the warehouse. Ian and how that can be mappedto different states. So let's start by reconnectingthe sample environment. So as you can see here, we have L1 L2 L3 to alignand as you can see here, we have certain borders also. So first of all, let's map each of the abovelocations in the warehouse two numbers or the states so that it will easeour calculations, right? So what I'm going to do iscreate a new Python 3 file in the jupyter notebook and I'll name it aslearning Numb, but okay, so let'sdefine the states. But before that what weneed to do is import numpy because we're going to use numpy for this purpose and let'sinitialize the parameters. That is the gammaand Alpha parameters. So gamma is 0.75, which is the discount Factorwhereas Alpha is 0.9, which is the learning rate. Now next what we're going to dois Define the states and map it to numbers. So as I mentioned earlierl 1 is Zero and online. We have defined the statesin the numerical form. Now. The next step is to definethe actions which is as mentioned aboverepresents the transition to the next state. So as you can see here, we have an arrayof actions from 0 to 8. Now, what we're going to dois Define the reward table. So as you can see hereis the same Matrix that we created just now that I showed you just now nowif you understood it correctly, there isn't any realBarrel limitation as depicted in the image, for example, the transitionalfor tell one is allowed but the reward will be 0to discourage that path or in tough situation. What we do is adda minus




1 there so that it getsa negative reward. So in the above code snippetas you can see here, we took each of the It's and putonce in the respective state that are directly reachablefrom the certain State. Now. If you refer to that rewardtable, once again, which we created the above or reconstruction willbe easy to understand but one thing to note here is that we did not consider the toppriority location L6 yet. We would also needan inverse mapping from the state's backto its original location and it will be cleaner when we reach to the otherdepths of the algorithms. So for that what we're goingto do is Have the inverse map location state to location. We will take the distinctState and location and convert it back. Now. What will do is will not Definea function get optimal which is the get optimal route, which will have a start locationand an N location. Don't worry the code is back. But I'll explain youeach and every bit of the code. It's not the get optimal rootfunction will take two arguments the starting locationin the warehouse and the end locationin the warehouse recipe lovely and it will returnthe optimal route for reaching the end location from the starting locationin the form of an ordered list containing the letters. So we'll start by defining the function by initializingthe Q values to be all zeros. So as you can see here we haveEven the Q value has to be 0 but before that what we need to do is copythe reward Matrix to a new one. So this the rewardsnew and next again, what we need to do is getthe ending State corresponding to the ending location. And with this informationautomatically will set the priority of the given endingstay to the highest one that we are not defining it now, but will automaticallyset the priority of the given endingState as nine nine nine. So what we're going to do isinitialize the Q values to be 0 and in the Learning processwhat you can see here. We are taking I in range1000 and we're going to pick up a state randomly. So we're going to usethe MP dot random randint and for traversingthrough the neighbor location in the same mazewe're going to iterate through the new rewardMatrix and get the actions which are greaterthan 0 and after that what we're going to do is pickan action randomly from the list of the playable actions in years to




the next state will going to computethe temporal difference, which is TD, which is the rewards plus gammainto the queue of next state and will take n p dot ARG Max of Q of next 8 minus Qof the current state. We going to then update the Q values usingthe Bellman equation as you can see here. We have the Bellman equation and we're goingto update the Q values and after that we're goingto initialize the optimal route with a starting locationnow here we do not know what the next location yet. So initialize it with a valueof the starting location, which Again isthe random location. So we do not knowabout the exact number of iteration needed to reachto the final location. Hence while loop will bea good choice for the iteration. So when you're going to fetchthe starting State fetch the highest Q value penetrating to the starting Statewe go to the index or the next state, but we needthe corresponding letter. So we're going to use that stateto location function. We just mentioned there and after that we're goingto update the starting location for the The next iteration and finally we'llreturn the root. So let's take the startinglocation of n line and and location of L while and see what partdo we actually get? So as you can see here weget




Airline l8l 5 L2 and L1. And if you have a lookat the image here, we have if we startfrom L9 to L1. We got L8 L5 L2 l 1 l 8l v L2 L1 that would He does the maximumvalue of the maximum reward for the robot. So now we have come to the endof this Q learning session and I hope you got to know what exactly is Q learningwith the analogy all the way startingfrom the number of rooms and I hope the examplewhich I took the analogy which I took was good enough for you to understand q-learningunderstand the Bellman equation how to make quick changesto the Bellman equation and how to createthe reward table the cue. Will and how to updatethe Q values using the Bellman equation, what does alpha dowhat does karma do.

Thnaks for Reading
onlyharish
golden knowledgee

Post a Comment

If you have any questions ! please let me know

Previous Post Next Post