They also have to take careof the database backups and recoveries. So some of the skills that are needed to becomea database administrator include database backup and Recoverydata security data modeling and design next. We have the business analyst nowthe role of a business analyst is a little It different from all of the otherdata signs job now. Don't get me wrong. They have a very good understanding of the dataoriented Technologies. They know how to handle a lotof data and process it but they are also very focusedon how this data can be linked to actionable business inside. So they mainly focuson business growth. Okay. Now a business analystacts like a link between the data engineersand the management Executives. So in order to become a business analyst you haveto have an understanding of business financesbusiness intelligence. And also I did acknowledge, he's like data modeling datavisualization tools and Etc at last we have a data and analytics managera data and analytics manager is responsiblefor the data science operations. Now the main responsibilitiesof a data and analytics manager is to overseethe data science operation
Okay, he's responsiblefor assigning the duties to the team accordingto their skills and expertise now their strengthshould include Technologies like SAS our SQL. And of course, they should have good managementskills apart from that. They must have excellent socialskills leadership qualities and and out-of-the-boxthinking attitude. And like I said earlier you need to have a goodunderstanding of Technologies. Like pythons asour Java and Etc. So Guys, these werethe different job roles in data science. I hope you all foundthis informative. Now, let's move aheadand look at the data lifecycle. So guys are basically six stepsin the data life cycle. It starts witha business requirement. Next is the data acquisition after that youwould process the data which is called data processing. Then there isdata exploration modeling and finally deployment. So guys before you even starton a data science project. It is important that you understand the problemyou're trying to solve. So in this stage, you're just going to focus on identifying the centralobjectives of the project and you will do thisby identifying the variables that need to bepredicted next up. We have data acquisition. Okay. So now that you haveyour objectives I find it's time for you to startGathering the data.
So data mining is the process of gathering your datafrom different sources at this stage some of the questions youcan ask yourself is what data do I needfor my project? Where does it live? How can I obtain it? And what is the mostefficient way to store and access all of it? Next up there is data processingnow usually all the data that you collectedis a huge mess. Okay. It's not formatted. It's not structured. It's not cleaned. So if Find any data set that is cleanedand it's packaged well for you, then you've actuallywon the lottery because finding the right datatakes a lot of time and it takes a lot of effort and one of the majortime-consuming task in the data science processis data cleaning. Okay, this requiresa lot of time. It requires a lot of effort because you have to gothrough the entire data set to find out any missing values or if there areany inconsistent values or corrupted data, and you also findthe unnecessary data. Over here and youremove that data. So this was all about data processing nextwe have data exploration. So now that you have sparklingclean set of data, you are finally ready to getstarted with your analysis. Okay, the data exploration stageis basically the brainstorming of data analysis. So in order to understandthe patterns in your data, you can use histogram. You can just pull upa random subset of data and plot a histogram. You can even createinteractive visualizations. This is the point where you Dive deepinto the data and you try to explorethe different models that can be appliedto your data next up. We have data modeling. So after processing the data, what you're going to dois you're going to carry out model training. Okay.
Now model training is basicallyabout finding a model that answers thequestions more accurately. So the process of model traininginvolves a lot of steps. So firstly you'll startby splitting the input data into the training data setand the testing data set. Okay, you're going to takethe entire data set and you're going to separate itinto Two two parts one is the training and oneis the testing data after that your build a model by using the training data setand once you're done with that, you'll evaluate the training and the test data set nowto evaluate the training and testing data. So you'll be using series of machine learningalgorithms after that. You'll find out the model which is the most suitablefor your business requirement. So this wasmainly data modeling. Okay. This is where you build a modelout of your training data set and then you evaluate this modelby using the testing data set.
You have deployment. So guys a goal of this stageis to deploy the model into a production or maybea production like environment. So this is basically donefor final user acceptance and the users have to validatethe performance of the models and if there are any issues with the model or any issueswith the algorithm, then they have to befixed in this stage. So guys with thiswe come to the end of the data lifecycle. I hope this was clear statisticsand probability are essential because these disciplesform the basic Foundation of all machine learning algorithms deeplearning artificial intelligence and data science. In fact, mathematicsand probability is behind everything around usfrom shapes patterns and colors to the count of petals in a flowermathematics is embedded in each and every aspectof our lives with this in mind. I welcome you allto today's session. So I'm going to go aheadand Scoffs the agenda for today with you all now going to beginthe session by understanding what is data after that. We'll move on and look at thedifferent categories of data, like quantitativeand qualitative data, then we'll discuss whatexactly statistics is the basic terminologies instatistics and a couple of sampling techniques. Once we're done with that. We'll discuss the differenttypes of Statistics which involve descriptiveand inferential statistics. Then in the next sessionwill mainly be focusing on descriptive statisticshere will understand the different measuresof center measures of spread Information Gain and entropy will alsounderstand all of these measures with the help of a use caseand finally we'll discuss what exactly aconfusion Matrix is once we've covered the entiredescriptive statistics module will discuss the probabilitymodule here will understand what exactly probability isthe different terminologies in probability will also study the Differentprobability distributions, then we'll discuss the typesof probability which include marginal probability jointand conditional probability. Then we move onand discuss a use case where and we'll seeexamples that show us how the different typesof probability work and to betterunderstand Bayes theorem.
We look at a small example. Also, I forgot to mention that at the end of thedescriptive statistics module will be running a small demoin the our language. So for those of you who don't know muchabout our I'll be explaining every line in depth, but if you want to havea more in-depth understanding about our I'll leavea couple of blocks. And a couple of videosin the description box you all can definitelycheck out that content. Now after we've completed theprobability module will discuss the inferential statisticsmodule will start this module by understanding what is point estimation. We will discusswhat is confidence interval and how you can estimatethe confidence interval. We will also discuss marginof error and will understand all of these concepts by lookingat a small use case. We'd finally end the inferentialReal statistic module by looking at what hypothesistesting is hypothesis. Testing is a very important partof inferential statistics. So we'll end the sessionby looking at a use case that discusses howhypothesis testing works and to sum everything up. We'll look at a demo that explains howinferential statistics Works. Alright, so guys,there's a lot to cover today. So let's move ahead and take a lookat our first topic which is what is data. Now, this isa quite simple question if I ask any of Youwhat is data? You'll see that it'sa set of numbers or some sort of documents that have stored in my computernow data is actually everything. All right, look around you thereis data everywhere each click on your phone generatesmore data than you know, now this generated dataprovides insights for analysis and helps us makeBetter Business decisions.
This is why data isso important to give you a formal definition data refersto facts and statistics. Collected togetherfor reference or analysis. All right. This is the definitionof data in terms of statistics and probability. So as we know datacan be collected it can be measured and analyzed it can be visualized byusing statistical models and graphs now data is dividedinto two major subcategories. Alright, so first wehave qualitative data and quantitative data. These are the twodifferent types of data under qualitative data. We have nominal and ordinal dataand under quantitative data. We have discreteand continuous data. Now, let's focuson qualitative data. Now this type of data deals withcharacteristics and descriptors that can't be easily measured but can be observed subjectively now qualitative datais further divided into nominal and ordinal data. So nominal data isany sort of data that doesn't haveany order or ranking? Okay. An example of nominaldata is gender. Now. There is no ranking in gender. There's only male femaleor other right? There is no one two, three four or any sortof ordering in gender race is another example of nominal data.
Now ordinal data is basically anordered series of information. Okay, let's saythat you went to a restaurant. Okay. Your information is storedin the form of customer ID. All right. So basically you are representedwith a customer ID. Now you would have ratedtheir service as either good or average. All right, that'show no ordinal data is and similarly they'll havea record of other customers who visit the restaurantalong with their ratings. All right. So any data which hassome sort of sequence or some sort of orderto it is known as ordinal data. All right, so guys, this is pretty simpleto understand now, let's move on and lookat quantitative data. So quantitative databasically these He's with numbers and things. Okay, you can understand that by the word quantitativeitself quantitative is basically quantity. Right Saudis will numbers a deals with anything that youcan measure objectively. All right, sothere are two types of quantitative data there isdiscrete and continuous data now discrete data is alsoknown as categorical data and it can hold a finite numberof possible values. Now, the number of studentsin a class is a finite Number. All right, you can'thave infinite number of students in a class. Let's say in your fifth grade. They have a hundred studentsin your class. All right, there weren'tinfinite number but there was a definite finite numberof students in your class. Okay, that's discrete data. Next. We have continuous data. Now this type of datacan hold infinite number of possible values. Okay. So when you say weightof a person is an example of continuous data what I mean to see is my weightcan be 50 kgs or it NB 50.1 kgs or it can be 50.00 one kgs or 50.000 one or is50.0 2 3 and so on right thereare infinite number of possible values, right? So this is what I meanby a continuous data. All right. This is the difference betweendiscrete and continuous data. And also I'd like to mentiona few other things over here. Now, there are a coupleof types of variables as well
. We have a discrete variable and we have a continuousvariable discrete variable is also known asa categorical variable or and it can hold valuesof different categories. Let's say that you havea variable called message and there are two typesof values that this variable can hold let's say that your messagecan either be a Spam message or a non spam message. Okay, that's when you calla variable as discrete or categorical variable. All right, because itcan hold values that represent differentcategories of data now continuous variablesare basically variables that can store infinitenumber of values. So the weight of a personcan be denoted as a continuous variable. All right, let's say there isa variable called weight and it can store infinite numberof possible values. That's why we will callit a continuous variable. So guys basicallyvariable is anything that can store a value right? So if you associate any sortof data with a Able, then it will becomeeither discrete variable or continuous variable. There is also dependent andindependent type of variables. Now, we won't discuss allof that in death because that's pretty understandable. I'm sure all of you know, what is independent variableand dependent variable right? Dependent variable isany variable whose value depends on any otherindependent variable? So guys that muchknowledge I expect or if you do have all right. So now let's move on and lookat our next topic which Which is what is statistics now comingto the formal definition of statistics statistics isan area of Applied Mathematics, which is concerned with data collectionanalysis interpretation and presentation now usually when I speak about statisticspeople think statistics is all about analysis but statistics has other partsto it it has data collection is also a part of Statistics datainterpretation presentation. All of this comes into statistics already aregoing to use statistical methods to visualize data to collectdata to interpret data. Alright, so the areaof mathematics deals with understanding how data can be usedto solve complex problems. Okay. Now I'll give youa couple of examples that can be solvedby using statistics. Okay, let's say that your companyhas created a new drug that may cure cancer. How would you conducta test to confirm the As Effectiveness now, even though this soundslike a biology problem. This can be solved with Statistics alreadywill have to create a test which can confirmthe effectiveness of the drum or a this is a common problem that can be solvedusing statistics. Let me give youanother example you and a friend are at a baseballgame and out of the blue. He offers you a bet that neither team will hita home run in that game.
Should you take the BET? All right here you justdiscuss the probability of I know you'll win or lose. All right, thisis another problem that comes under statistics. Let's look at another example. The latest sales datahas just come in and your boss wantsyou to prepare a report for management on places where the companycould improve its business. What should you look for? And what should younot look for now? This problem involves a lot of data analysis will have tolook at the different variables that are causingyour business to go down or the you have to lookat a few variables. That are increasingthe performance of your models and thus growing your business. Alright, so this involvesa lot of data analysis and the basic idea behind data analysis isto use statistical techniques in order to figureout the relationship between different variables or different componentsin your business. Okay. So now let's move onand look at our next topic which is basicterminologies in statistics. Now before you divedeep into statistics, it is important that youunderstand basic terminologies used in statistics. The two most importantterminologies in statistics are population and Sample. So throughout the statisticscourse or throughout any problem that you're tryingto stall with Statistics. You will comeacross these two words, which is population and SampleNow population is a collection or a set of individualsor objects or events. Events whose propertiesare to be analyzed. Okay. So basically you can referto population as a subject that you're trying to analyzenow a sample is just like the word suggests. It's a subset of the population. So you have to make surethat you choose the sample in such a way that it representsthe entire population. All right. It shouldn't Focus add one partof the population instead. It should representthe entire population. That's how your sampleshould be chosen. So Well chosen samplewill contain most of the information about aparticular population parameter. Now, you must be wonderinghow can one choose a sample that best representsthe entire population now sampling is a statistical method that deals with the selectionof individual observations within a population. So sampling is performed in order to infer statisticalknowledge about a population. All right, if youwant to understand the different statisticsof a population like the mean the median Median the modeor the standard deviation or the variance of a population. Then you're goingto perform sampling. All right, because it'snot reasonable for you to study a large population and find out the mean medianand everything else. So why is samplingperformed you might ask? What is the point of sampling? We can just studythe entire population now guys, think of a scenario where in your askedto perform a survey about the eating habitsof teenagers in the US. So at present there areover 42 million teens in the US and this number is growing as we are speakingright now, correct. Is it possible to survey eachof these 42 million individuals about their health? Is it possible? Well, it might be possible but this will takeforever to do now. Obviously, it's not it'snot reasonable to go around knocking each door and asking for what doesyour teenage son eat and all of that right?
This is not very reasonable. That's By sampling is used. It's a method wherein a sampleof the population is studied in order to draw inferencesabout the entire population. So it's basicallya shortcut to studying the entire population insteadof taking the entire population and finding outall the solutions. You just going to takea part of the population that represents theentire population and you're going to performall your statistical analysis your inferential statisticson that small sample. All right, and that sample basically herePresents the entire population. All right, so I'm shortof made this clear to y'all what is sampleand what is population now? There are two main typesof sampling techniques that are discussed today. We have probability samplingand non-probability sampling now in this video will only be focusing onprobability sampling techniques because non-probability samplingis not within the scope of this video. All right will only discussthe probability part because we're focusing on statistics andprobability, correct. Now again underprobability sampling. We have three different types. We have randomsampling systematic and stratified sampling. All right, and justto mention the different types of non-probability sampling, 's we have no bald Kota judgmentand convenience sampling. All right now guysin this session. I'll only befocusing on probability. So let's move on and look at the different typesof probability sampling. So what is probability samplingit is a sampling technique in which samples from a large populationare chosen by using the theory of probability. All right, so thereare three types of probability sampling. All right first we havethe random sampling now in this method each member of the populationhas an equal chance of being selected in the sample. All right, so eachand every individual or each and every object in the populationhas an equal John's of being a part of the sample. That's what randomsampling is all about. Okay, you are randomly goingto select any individual or any object. So this Bay each individual has an equal chanceof being selected. Correct? Next. We have systematic sampling now in systematic samplingevery nth record is chosen from the population to bea part of the sample. All right. Now refer this image that I've shown overhere out of these six. Groups every second groupis chosen as a sample. Okay. So every second recordis chosen here and this is our systematic sampling works. Okay, you're randomlyselecting the nth record and you're going to addthat to your sample. Next. We have stratifiedsampling now in this type of technique a stratumis used to form samples from a large population. So what is a stratuma stratum is basically a subset of the population that sharesat One common characteristics. So let's say that your population has a mixof both male and female so you can create to straightens out of this one will haveonly the male subset and the other will havethe female subset. All right, this iswhat stratum is. It is basically a subsetof the population that shares at leastone common characteristics. All right in our example,it is gender. So after you've created a stratum you're goingto use random sampling on these stratumsand you're going to choose. Choose a final sample. So random sampling meaning that all of the individualsin each of the stratum will have an equal chanceof being selected in the sample. Correct. So Guys, these werethe three different types of sampling techniques. Now, let's move on and lookat our next topic which is the differenttypes of Statistics. So after this, we'll be looking at the moreadvanced concepts of Statistics, right so far we discussthe basics of Statistics, which is basicallywhat is statistics the Friend sampling techniques and theterminologies and statistics. All right. Now we look at the differenttypes of Statistics. So there are two majortypes of Statistics descriptive statistics and inferential statisticsin today's session. We will be discussingboth of these types of Statistics in depth. All right, we'll alsobe looking at a demo which I'll be runningin the our language in order to makeyou understand what exactly descriptive and inferentialstatistics is soaked. As which is goingto look at the basic, so don't worry. If you don'thave much knowledge, I'm explaining everythingfrom the basic level. All right, so guys descriptivestatistics is a method which is used to describeand understand the features of specific data set by givinga short summary of the data. Okay, so it is mainly focused upon thecharacteristics of data. It also provides a graphicalsummary of the data now in order to make you understandwhat descriptive statistics is. Let's suppose thatyou want to gift all your classmates or t-shirt. So to study the averageshirt size of a student in a classroom. So if you were to usedescriptive statistics to study the average shirt sizeof students in your classroom, then what you would do is youwould record the shirt size of all students in the class and then you would find outthe maximum minimum and average shirt size of the cloud. Okay. So coming to inferentialstatistics inferential. Six makes inferences and predictions abouta population based on the sample of data takenfrom the population. Okay. So in simple words, it generalizes a large data setand it applies probability to draw a conclusion. Okay. So it allows youto infer data parameters based on a statistical modelby using sample data. So if we considerthe same example of finding the average shirt sizeof students in a class in infinite real statistics. We'll take a sample setof the class which is basically a few peoplefrom the entire class. All right, you alreadyhave had grouped the class into large medium and small. All right in this methodyou basically build a statistical model and expand it for the entirepopulation in the class. So guys, there was a briefunderstanding of descriptive and inferential statistics. So that's the differencebetween descriptive and inferential nowin the next section, we will go in depthabout descriptive statistics. Right. So let's discuss moreabout descriptive statistics. So like I mentioned earlier descriptivestatistics is a method that is used to describeand understand the features of a specific data set by givingshort summaries about the sample and measures of the data. There are two important measuresin descriptive statistics. We have measureof central tendency, which is also known as measure of center and we havemeasures of variability. This is also knownas Measures of spread so measures of center includemean median and mode now what is measures of center measures of the centerare statistical measures that represent the summaryof a data set? Okay, the three main measuresof center are mean median and mode comingto measures of variability or measures of spread.
We have rangeinterquartile range variance and standard deviation. All right. So now let's discuss eachof these measures. Has in a littlemore depth starting with the measures of center. Now, I'm sure all of you know, what the mean is mean isbasically the measure of the average of allthe values in a sample. Okay, so it's basicallythe average of all the values in a sample. How do you measure the mean Ihope all of you know how the main is measured if there are 10 numbers and you want to find the meanof these 10 numbers. All you have to do is you haveto add up all the 10 numbers and you have to divideit by 10 then. Represents the numberof samples in your data set. All right, since wehave 10 numbers, we're going todivide this by 10. All right, this willgive us the average or the mean so to betterunderstand the measures of central tendency. Let's look at an example. Now the data set over here isbasically the cars data set and it contains a few variables. All right, it hassomething known as cars. It has mileage per galloncylinder type displacement horsepower and relax. Silver ratio. All right, all of these measuresare related to cars. Okay. So what you're goingto do is you're going to use descriptive analysis and you're going to analyzeeach of the variables in the sample data set for the mean standard deviationmedian more and so on. So let's say that you wantto find out the mean or the average horsepower of the cars amongthe population of cards. Like I mentioned earlier what you'll do is you'll checkthe average of all the values. So in this case we will takeThe sum of the horsepower of each car and we'll divide that by the totalnumber of cards. Okay, that's exactly what I've done herein the calculation part. So this hundredand ten basically represents the horsepowerfor the first car. All right. Similarly. I've just added up allthe values of horsepower for each of the cars and I've divided it by 8 now8 is basically the number of cars in our data set. All right, so hundred and threepoint six two five is what army mean is or the averageof horsepower is all right. Now, let's understandwhat median is with an example. Okay. So to Define median medianis basically a measure of the central value of the sample setis called the median. All right, you can seethat it is the middle value. So if we want to findout the center value of the mileage per gallonamong the population of cars first, what we'll do is we'll arrangethe MGP values in ascending or descending Order and choose a middle valueright in this case since we haveeight values, right? We have eight valueswhich is an even entry. So whenever you have evennumber of data points or samples in your data set, then you're goingto take the average of the two middle values. If we had nine values over here. We can easily figureout the middle value and you know choosethat as a median. But since they're even numberof values we are going to take the averageof the two middle values. All right. Right. So 22.8 and 23 aremy two middle values and I'm taking the mean of those 2 and hence Iget twenty two point nine, which is my median. All right, lastly, let's look athow mode is calculated. So what is mode the value that is most recurrent in the sample set is known asmode or basically the value that occurs most often. Okay, that is known as mode. So let's say that we want to find outthe most common type of cylinder among the population of cards. What we have to do iswe will check the value which is repeatedthe most number of times here. We can see that the cylinderscome in two types. We have cylinder of Type4 and cylinder of type 6, right? So take a look at the data set. You can see that the mostrecurring value is 6 right. We have one two,three four and five. We have five sixand we have one two, three. Yeah, we have threefour types of lenders and five six types of lenders. So basically we havethree four type cylinders and we have fivesix type cylinders. All right. So our mode is goingto be 6 since 6 is more recurrent than 4 so guys those were the measuresof the center or the measures of central tendency. Now, let's move on and lookat the measures of the spread. All right. Now, what is the measureof spread a measure of spread? Sometimes also called as measure of dispersion is Usedto describe the variability in a sample or population. Okay, you can thinkof it as some sort of deviation in the sample. All right, so you measurethis with the help of the differentmeasure of spreads. We have rangeinterquartile range variance and standard deviation. Now range is prettyself-explanatory, right? It is the given measure ofhow spread apart the values in a data set arethe range can be calculated as shown in this formula. You basically goingto subtract the maximum value in your data set from the minimum valuein your data set. That's how you calculatethe range of the data. Alright, next wehave interquartile range. So before we discussinterquartile range, let's understand. What a quartile is red. So quartiles basically tell usabout the spread of a data set by breaking the data setinto different quarters. Okay, just like how the medianbreaks the data into two parts the court is We'll break itinto different quarters. So to better understand how quartile andinterquartile are calculated. Let's look at a small example. Now this data set basicallyrepresents the marks of hundred studentsordered from the lowest to the highest scores red. So the quartiles lie in the following rangesthe first quartile, which is also known as q1 it lies between the 25thand 26th observation.
All right. So if you look at thisI've highlighted Add the 25th and the 26th observation. So how you can calculateQ 1 or first quartile is by taking the averageof these two values. Alright, since boththe values are 45 when you add them upand divide them by two you'll still get 45 nowthe second quartile or Q 2 is between the 50thand the 51st observation. So you're going to takethe average of 58 and 59 and you will geta value of 58.5. Now, this is my second quarterthe third quartile. Ah Q3 is between the 75thand the 76th observation here. Again, we'll take the averageof the two values which is the 75th valueand the 76 value right and you'll get a value of 71. All right, so guysthis is exactly how you calculatethe different quarters. Now, let's look atwhat is interquartile range. So IQR or the interquartilerange is a measure of variability basedon dividing a data set into quartiles nowthe The interquartile range is calculated bysubtracting the q1 from Q3. So basically Q3 minus q1 is your IQ are soyour IQR is your Q3 minus q1? All right. Now this is how each of the quartiles are each coretile represents a quarter, which is 25% All right. So guys, I hope allof you are clear with interquartile rangeand what our quartiles now, let's look atvariance covariance is basically a measure that shows How mucha random variable the first from its expected value? Okay. It's basically the variancein any variable now variance can be calculated by usingthis formula right here x basically representsany data point in your data set n is the total numberof data points in your data set and X bar is basicallythe main of data points. All right. This is how you calculatevariance variance is basically a Computingthe squares of deviations. Okay. That's why it sayss Square there. Now let's look at what is deviation deviation isjust the difference between each elementfrom the mean. Okay, so it can be calculatedby using this simple formula where X I basicallyrepresents a data point and mu is the meanof the population or add this is exactly how you calculate the deviationNow population variance and Sample varianceare very specific to whether you're calculating thevariance in your population data set or in your sample data set. That's the A differencebetween population and Sample variance. So the formula for populationvariance is pretty explanatory. So X is basicallyeach data point mu is the mean of the population n is the number of samplesin your data set. All right. Now, let's look at sample. Variance Now sample variance is the average of squareddifferences from the mean. All right here xi is any data point or any sample in your dataset X bar is the mean of your sample. All right. It's not the mainof your population. Ation, it's the meanof your sample. And if you noticen here is a smaller n is the numberof data points in your sample. And this is basicallythe difference between sample and population variance. I hope that is clear coming to standard deviation isthe measure of dispersion of a set of data from its mean. All right, so it's basicallythe deviation from your mean. That's what standard deviationis now to better understand how the measuresof spread are calculated. Let's look at a small use case. So let's see Daeneryshas 20 dragons. They have the numbersnine to five four and so on as shown on the screen, what you have to do isyou have to work out the standard deviation or at in order to calculatethe standard deviation. You need to know the mean right? So first you're going to findout the mean of your sample set. So how do you calculatethe mean you add all the numbers in your data set and divided by the total numberof samples in your data set so you get a value of 7. Here then you calculate the rhs of your standarddeviation formula. All right. So from each data point you'regoing to subtract the mean and you're going to square that. All right. So when you do that, you will getthe following result. You'll basically getthis 425 for 925 and so on so finally youwill just find the mean of the squared differences. All right. So your standard deviation will come up to two pointnine eight three once you take the square root. So guys, it's pretty simple. It's a simpleAt the magic technique, all you have to do is you haveto substitute the values in the formula. All right. I hope this was clearto all of you. Now let's move on and discuss the next topicwhich is Information Gain and entropy now. This is one of my favoritetopics in statistics. It's very interesting andthis topic is mainly involved in machine learning algorithms, like decision treesand random forest. All right, it's very important for you to know how Information Gain and entropyreally work and why they are so essential in buildingmachine learning models. We focus on the statistic partsof Information Gain and entropy and after thatwe'll discuss a use case. And see how Information Gain and entropy is usedin decision trees. So for those of you who don't know whata decision tree is it is basically a machinelearning algorithm. You don't have to knowanything about this. I'll explaineverything in depth. So don't worry. Now. Let's look atwhat exactly entropy and Information Gain Is Now guys entropy isbasically the measure of any sort of uncertaintythat is present in the data. All right, so it can be measuredby using this formula. So here s is the setof all instances in the data set or all the data itemsin the data set n is the different typeof classes in your data set Pi is the event probability. Now this might seema little confusing to y'all but when we gothrough the use case, you'll understand allof these terms even better. All right cam. The information gained as the word suggestsInformation Gain indicates how much informationa particular feature or a particular variable givesus about the final outcome. Okay, it can be measuredby using this formula. So again here headsof s is the entropy of the whole data sets SJ is the number of instances with the J value of an attribute a sis the total number of instances in the data set Vis the set of distinct values of an attribute a h of s j is the entropyof subsets of instances and hedge of a comma s is the entropyof an attribute a even though this seems confusing. I'll clear out the confusion. All right, let's discussa small problem statement where we will understand how Information Gain and entropy is used to studythe significance of a model. So like I said Information Gain and entropy are veryimportant statistical measures that let us understand the significance ofa predictive model. Okay to get a moreclear understanding. Let's look at a use case. All right now suppose weare given a problem statement. All right, the statement isthat you have to predict whether a match can be played or Not by studyingthe weather conditions. So the predictor variables hereare outlook humidity wind day is also a predictor variable. The target variableis basically played or a the target variableis the variable that you're trying to protect. Okay. Now the value of the targetvariable will decide whether or not a gamecan be played. All right, so that'swhy The play has two values. It has no and yes, no, meaning that the weatherconditions are not good. And therefore youcannot play the game. Yes, meaning that the weatherconditions are good and suitable for you to play the game. Alright, so that wasour problem statement. I hope the problem statementis clear to all of you now to solve such a problem. We make use of somethingknown as decision trees. So guys thinkof an inverted tree and each branch of the treedenotes some decision. All right, each branch isIs known as the branch known and at each branch node, you're going to takea decision in such a manner that you will get an outcomeat the end of the branch. All right. Now this figurehere basically shows that out of 14 observations9 observations result in a yes, meaning that out of 14 days. The match can be playedonly on nine days. Alright, so here if you see on day 1 Day2 Day 8 day 9 and 11. The Outlook has been Alright, so basically we tryto plaster a data set depending on the Outlook. So when the Outlook is sunny, this is our data setwhen the Outlook is overcast. This is what we have and when the Outlookis the rain this is what we have. All right, so when it is sunny we havetwo yeses and three nodes. Okay, when theOutlook is overcast. We have all fouras yes has meaning that on the four dayswhen the Outlook was overcast. We can play the game. All right.
Now when it comes to rain, we have three yesesand two nodes. All right. So if you notice here, the decision is being made bychoosing the Outlook variable as the root node. Okay. So the root node isbasically the topmost node in a decision tree. Now, what we've done here iswe've created a decision tree that starts withthe Outlook node. All right, then you're splittingthe decision tree further depending on other parameterslike Sunny overcast and rain. All right now like we knowthat Outlook has three values. Sunny overcast and brainso let me explain this in a more in-depth manner. Okay. So what you're doinghere is you're making the decision Tree by choosingthe Outlook variable at the root node. The root note isbasically the topmost node in a decision tree. Now the Outlook node has threebranches coming out from it, which is sunnyovercast and rain. So basically Outlook can have three valueseither it can be sunny. It can be overcastor it can be rainy. Okay now these three valuesUse are assigned to the immediate Branchnodes and for each of these values the possibility of play is equalto yes is calculated. So the sunny and the rain brancheswill give you an impure output. Meaning that there is a mixof yes and no right. There are two yeseshere three nodes here. There are three yeses hereand two nodes over here, but when it comesto the overcast variable, it results in a hundredpercent pure subset. All right, this shows thatthe overcast baby. Will result in a definiteand certain output. This is exactly what entropyis used to measure. All right, it calculatesthe impurity or the uncertainty. Alright, so the lesserthe uncertainty or the entropy of a variable moresignificant is that variable? So when it comes to overcastthere's literally no impurity in the data set. It is a hundred percentpure subset, right? So be want variables like thesein order to build a model. All right now, we don't always Ways get luckyand we don't always find variables that will resultin pure subsets. That's why we havethe measure entropy. So the lesser the entropy ofa particular variable the most significant that variablewill be so in a decision tree. The root node is assignedthe best attribute so that the decision tree can predict the mostprecise outcome meaning that on the root note. You should have the mostsignificant variable. All right, that's whywe've chosen Outlook or and now some of you might askme why haven't you chosen overcast Okay is overcastis not a variable. It is a valueof the Outlook variable. All right. That's why we've chosenour true cure because it has a hundredpercent pure subset which is overcast. All right. Now the question in your head ishow do I decide which variable or attribute best Blitzthe data now right now, I know I looked at the data and I told you that, you know here we havea hundred percent pure subset, but what if it'sa more complex problem and you're not ableto understand which variable will best split the data, so guys when it comes to decision treeInformation and gain and entropy will help you understand which variablewill best split the data set. All right, or which variable youhave to assign to the root node because whichever variableis assigned to the root node. It will best let the data set and it has to be the mostsignificant variable. All right. So how we can do thisis we need to use Information Gain and entropy. So from the totalof the 14 instances that we saw nine of them said yes and fiveof the instances said know that you cannot playon that particular day. All right. So how do youcalculate the entropy? So this is the formulayou just substitute the values in the formula. So when you substitutethe values in the formula, you will get a value of 0.9940. All right. This is the entropy or this is the uncertaintyof the data present in a sample. Now in order to ensure that we choose the best variablefor the root node. Let us look at allthe possible combinations that you can useon the root node. Okay, so these are Allthe possible combinations you can either haveOutlook you can have windy humidity or temperature. Okay, these are four variables and you can have any oneof these variables as your root note. But how do you select which variable bestfits the root node? That's what we are goingto see by using Information Gain and entropy. So guys now the task at handis to find the information gain for each of these attributes. All right. So for Outlook for windy forhumidity and for temperature, we're going to findout the information. Nation gained all right. Now a point to remember isthat the variable that results in the highestInformation Gain must be chosen because it will give us the mostprecise and output information. All right. So the information gain forattribute windy will calculate that first here. We have six instances of trueand eight instances of false. Okay. So when you substitute allthe values in the formula, you will get a valueof zero point zero four eight. So we get a valueof You 2.0 for it. Now. This is a very low valuefor Information Gain. All right, so the information that you're going to get fromWindy attribute is pretty low. So let's calculatethe information gain of attribute Outlook. All right, so from the totalof 14 instances, we have five instanceswith say Sunny for instances, which are overcastand five instances, which are rainy. All right for Sonny. We have three yeses and to nose for overcast we haveOr the for as yes for any we have three yearsand two nodes. Okay. So when you calculatethe information gain of the Outlook variablewill get a value of zero point 2 4 7 now comparethis to the information gain of the windy attribute. This value isactually pretty good. Right we have zero point 2 4 7which is a pretty good value for Information Gain. Now, let's lookat the information gain of attribute humiditynow over here. We have seven instanceswith say hi and seven instances with same. Right and underthe high Branch node. We have three instanceswith say yes, and the rest for instanceswould say no similarly under the normal Branch. We have one two, three, four, five six seveninstances would say yes and one instance with says no. All right. So when you calculatethe information gain for the humidity variable, you're going to geta value of 0.15 one. Now. This is alsoa pretty decent value, but when you compare itto the Information Gain, Of the attribute Outlook itis less right now. Let's look at the informationgain of attribute temperature. All right, so the temperaturecan hold repeat. So basically the temperatureattribute can hold hot mild and cool. Okay under hot. We have two instanceswith says yes and two instances for no under mild. We have four instances of yesand two instances of no and under col we havethree instances of yes and one instance of no. All right. When you calculatethe information gain for this attribute, you will get a valueof zero point zero to nine,
which is again very less. So what you can summarizefrom here is if we look at the information gain for eachof these variable will see that for Outlook. We have the maximum gain. All right, we havezero point two four seven, which is the highestInformation Gain value and you must alwayschoose a variable with the highest InformationGain to split the data at the root node. So that's why we assignThe Outlook variable at the root node. All right, so guys. I hope this use case was clear. If any of you have doubts. Please keep commentingthose doubts now, let's move on and look at whatexactly a confusion Matrix is the confusion Matrixis the last topic for descriptive statisticsread after this. I'll be running a short demowhere I'll be showing you how you can calculatemean median mode and standard deviation varianceand all of those values by using our okay. So let's talk aboutconfusion Matrix now guys. What is the confusion Matrixnow don't get confused.
This is not any complextopic now confusion. Matrix is a matrix that is often used to describethe performance of a model. Right? And this is specifically usedfor classification models or a classifier and what it does is itwill calculate the accuracy or it will calculate theperformance of your classifier by comparing your actual resultsand Your predicted results. All right. So this is what it lookslike to prosit of true- and all of that. Now this is a little confusing. I'll get back to whatexactly true positive to negative and allof this stands for for now. Let's look at anexample and let's try and understand what exactlyconfusion Matrix is. So guys. I made sure that I put examplesafter each and every topic because it's important you understand the Practicalpart of Statistics. All right statistics hasliterally nothing to do with Theory you needto understand how Calculations are done in statistics. Okay. So here what I've done islet's look at a small use case. Okay, let's consider that your given dataabout a hundred and sixty-five patient's out of which hundredand five patients have a disease and the remaining 50 patientsdon't have a disease.
Thanks For Reading
Post a Comment
If you have any questions ! please let me know