Data Science Full Course - Learn Data Science Beginners day
A linear regression is one of the easiest algorithmin machine learning. It is a statistical model that attempts toshow the relationship between two variables. So the linear equation, but before we drill down to linear regressionalgorithm in depth, I'll give you a quick overviewof today's agenda. So we'll start a session with a quick overviewof what is regression as linear regressionis one of a type of regression algorithm. Once we learn about regression, its use case the varioustypes of it next. We'll learn about the algorithmfrom scratch where I live To its mathematicalimplementation first, then we'll drill downto the coding part and Implement linearregression using python in today's session will deal with linear regression algorithm using least Squaremethod checketts goodness of fit or how close the data is to the fitted regression lineusing the R square method and then finally what we'll do welloptimized it using the gradient descent method in the last parton the coding session. I'll teach you to implementlinear regression using Python and the coding session.
Would be divided into two partsthe first part would consist of linear regressionusing python from scratch where you will usethe mathematical algorithm that you have learnedin this session. And in the next partof the coding session will be using scikit-learnfor direct implementation of linear regression. All right. I hope the agenda is clearto you guys are like so let's begin our sessionwith what is regression. Well regression analysis is a form of predictivemodeling technique which investigates the relationship betweena dependent and independent. Able a regression analysisinvolves graphing a line over a set of data points that most closely fitsthe overall shape of the data or regression shows the changes in a dependent variableon the y-axis to the changes in the explanatory variableon the x-axis fine. Now you would askwhat are the uses of regression? Well, they are major three usesof regression analysis the first being determiningthe strength of predicator, 's the regression might be used to identifythe strength of the effect that the independent. Variables have onthe dependent variable. For example, youcan ask question. Like what is the strengthof relationship between sales and marketing spending or whatis the relationship between age and income second is forecasting an effect in this the regressioncan be used to forecast effects or impact of changes. That is the regression analysishelp us to understand how much the dependent variablechanges with the change in one or moreindependent variable fine. For example, you can askquestion like how Additional seal income will I getfor each thousand dollars spent on marketing thirdis Trend forecasting in this the regression analysisto predict Trends and future values. The regression analysiscan be used to get Point estimates in thisyou can ask questions. Like what will bethe price of Bitcoin and next six months, right? So next topic is linear versuslogistic regression by now. I hope that you know,what a regression is.
So let's move onand understand its type. So there are various kindsof regression like linear. Session logistic regressionpolynomial regression and others. All right, but for this session will be focusing on linearand logistic regression. So let's move on and let me tellyou what is linear regression. And what is logistic regression then what we'll dowe'll compare both of them. All right. So starting withlinear regression in simple linear regression. We are interested in thingslike y equal MX plus C. So what we are trying to findis the correlation between X and Y variable this means that every value of X hasa corresponding value of y in it if it is continuous. I like however in logistic regression weare not fitting our data to a straight linelike linear regression instead what we are doing. We are mapping Y versus X to a sigmoid functionin logistic regression. What we find out is is y 1 or 0for this particular value of x so thus we are essentiallydeciding true or false value for a given value of x fine. So as a core conceptof linear regression You can say that the data is modeledusing a straight line where in the caseof logistic regression the data is model usinga sigmoid function. The linear regression is usedwith continuous variables on the other handthe logistic regression. It is used with categoricalvariable the output or the predictionof a linear regression is the value of the variable on the other handthe output of production of a logistic regressionis the probability of occurrence of the event. Now, how will youcheck the accuracy and goodness of fit in caseof linear regression? We are various methods. Take measured by loss r squaredadjusted r squared Etc while in the caseof logistic regression you have accuracy precisionrecall F1 score, which is nothing butthe harmonic mean of precision and recall next is Roc curve for determining the probabilitythreshold for classification or the confusion Matrix Etc. There are many all right. So summarizing the difference between linear andlogistic regression. You can say that the typeof function you are mapping to is the main pointof difference between linear and regression a linearregression Maps a continuous X2 a continuous fi on the other hand a logisticregression Maps a continuous x to the bindery why so we can use logisticregression to make category or true false decisionsfrom the data find so let's move on ahead. Next is linearregression selection criteria, or you can say when willyou use linear regression? So the first is classification and regression capabilitiesregression models predict a continuous variablesuch as the Don't a day or predict the temperatureof a city their Reliance on a polynomial like a straight lineto fit a data set poses a real challenge when it comes towards buildinga classification capability. Let's imagine that you fita line with the training points that you have now imagine youadd some more data points to it. But in order to fit it,what do you have to do? You have to changeyour existing model that is maybe you haveto change the threshold itself. So this will happen with each new data point you addto the model, hence. The linear regression isnot good for classification. All's fine. Next is data qualityeach missing value removes one data point that couldoptimize the regression in simple linear regression. The outliers can significantly disrupt the outcomejust for now. You can know that if youremove the outliers your model will become very good. All right.
So this is about data quality. Next is computational complexitya linear regression is often not computationally expensive ascompared to the decision tree or the clusteringalgorithm the order of complexity for ntraining example and X features. Usually Falls in either Big O of x square or bigof xn next is comprehensible and transparent thelinear regression are easily comprehensibleand transparent in nature. They can be represented bya simple mathematical notation to anyone and can beunderstood very easily. So these are someof the criteria based on which you will selectthe linear regression algorithm. All right. Next is where is linearregression used first is evaluating Trendsand sales estimate. Well linear regressioncan be used in Business to evaluate Trendsand make estimates or focused for example, if a company sales haveincreased steadily every month for past few years thenconducting a linear analysis on the sales datawith monthly sales on the y axis and time on the x axis. This will give you a line that predicts the upward Trendsin the sale after creating the trendline the companycould use the slope of the lines too focused salein future months. Next is analyzing. The impact of price changeswill linear regression can be To analyze the effectof pricing on consumer behavior. For instance. If a company changes the price on a certainproduct several times, then it can record the quantityitself for each price level and then performa linear regression with sold quantity as a dependent variable and priceas the independent variable. This would result in a linethat depicts the extent to which the customer reducetheir consumption of the product as the prices increasing. So this result would help usin future pricing decisions. Next is assessmentof risk and fine. Financial servicesand insurance domain. Well linear regressioncan be used to analyze the risk, for example health insurancecompany might conduct a linear regression algorithm how it can do it can do itby plotting the number of claims per customer against its ageand they might discover that the old customers then to make morehealth insurance claim. Well the resultof such analysis might guide important business decisions. All right, so by now youhave just a rough idea of what linear regressionalgorithm as like, What it does where it is used when you should useit early now, let's move on and understandthe algorithm and depth. So suppose you have independentvariable on the x-axis and dependent variableon the y-axis. All right suppose. This is the data pointon the x axis. The independent variableis increasing on the x axis. And so does the dependentvariable on the y-axis? So what kind of linearregression line you would get you would get a positivelinear regression line. All right as the slopewould be positive. Next is suppose. You have an independentvariable on the x-axis which is increasing and on the other hand thedependent variable on the y-axis that is decreasing. So what kind of linewill you get in that case? You will geta negative regression line. In this case as the slopeof the line is negative. And this particular linethat is line of y equal MX plus C is a lineof linear regression which shows the relationshipbetween independent variable and dependent variable and this line is only knownas line of linear regression. Okay? So let's add some datapoints to our graph.
So these are some observationor data points on our graphs. Let's plot some more. Okay. Now all our data pointsare plotted now our task is to create a regression lineor the best fit line. All right now once our regressionline is drawn now, it's the taskof production now suppose. This is our estimated valueor the predicted value and this is our actual value. Okay. So what we have to do our maingoal is to reduce this error. That is to reduce the distancebetween the estimated or the predicted valueand the actual value. The best fit line would be theone which had the least error or the least differencein estimated value and the actual value. All right, and other words wehave to minimize the error. This was a brief understanding of linearregression algorithm soon. We'll jump towardsmathematical implementation. All right, but for thenlet me tell you this suppose you draw a graphwith speed on the x-axis and distance covered. On the y axis with the timedemeaning constant, if you plot a graphbetween the speed travel by the vehicle and the distance traveledin a fixed unit of time, then you will geta positive relationship. All right. So suppose the equationof line as y equal MX plus C. Then in this case Y isthe distance traveled in a fixed duration of time x is the speed of vehicle mis the positive slope of the line and see isthe y-intercept of the line. All right supposethe distance remaining constant. You have to plot a graphbetween the Rid of the vehicle and the time taken to travela fixed distance then in that case you will get a linewith a negative relationship. All right, the slope of the lineis negative here the equation of line changes to yequal minus of MX plus C where Y is the timetaken to travel a fixed distance X is the speed of vehicle m isthe negative slope of the line and see isthe y-intercept of the line. All right. Now, let's get back to our independentand dependent variable. So in that term why is our dependent variable and Thatis our independent variable. Now, let's move on and seethe mathematical implementation of the things. Alright, so we have x equal 1 2 3 4 5 let's plotthem on the x-axis. So 0 1 2 3 4 5 6 alikeand we have y as 3 4 2 4 5. All right. So let's plot 1 2 3 4 5 on the y-axis now, let's plot our coordinates 1by 1 so x equal 1 and y equal 3, so We have here xequal 1 and y equal 3. So this is the point1 comma 3 so similarly we have 1 3 2 4 3 2 4 4 & 5 5. All right. So moving on ahead. Let's calculate the mean of Xand Y and plot it on the graph. All right, so mean of X is 1 plus 2 plus 3 plus 4plus 5 divided by 5. That is 3. All right, similarly meanof Y is 3 plus 4 plus 2 plus 4 plus 5 that is 18. So it in divided by 5. That is nothingbut 3.6 aligned so next what we'll do we'll plotour mean that is 3 comma 3 .6 on the graph. Okay. So there's a point 3 comma 3 .6 see our goal is to findor predict the best fit line using the least SquareMethod All right. So in order to find that we first need to findthe equation of line, so let's find the equationof our regression line. All right. So let's suppose this is our regression liney equal MX plus C. Now. We have an equation of line. So all we need to do isfind the value of M and see where m equals summation of x minus X bar X Y minus y barupon the summation of x minus X bar whole Squaredon't get confused. Let me resolve it for you. All right. So moving on aheadas a part of formula. What we are going to dowill calculate x minus X bar. So we have X as 1 minus X baras 3 so 1 minus 3 that is minus 2 next. We have x equalto minus its mean 3 that is minus 1 similarly. We have 3 minus 3 is 0 4 - 3 1 5 - 3 2 alightso x minus X bar. It's nothing but the distanceof all the point through the line y equal 3 and what does this yminus y bar implies it implies that distance of all the pointfrom the line x equal 3 .6 fine. So let's calculate the valueof y minus y bar. So starting with y equal 3 - value of y. A barthat is 3.6. So it is three minus 3.6how much - of 0.6 next is 4 minus 3.6that is 0.4 next to minus 3.6 that is minus of 1 point6 next is 4 minus 3.6 that is 0.4 again, 5 minus 3.6 that is 1.4. Alright, so now we are donewith Y minus y bar fine now next we will calculate x minus X bar whole SquareLet's calculate x minus X bar whole Square. So it is minus 2 whole square. That is 4 minus 1 whole square. That is 1 0 squared is0 1 Square 1 2 square for fine. So now in our table we have xminus X bar y minus y bar and x minus X bar whole Square. Now what we need. We need the product of xminus X bar X Y minus y bar. Alright, so let's seethe product of x minus X bar X Y minus y bar that is minusof 2 x minus of 0.6. That is one. Point 2 minus of 1 x 0 point 4 that isminus of 0 point 4 0 x minus of 1.6. That is 0 1 multipliedby zero point four that is 0.4.
And next 2 multipliedby 1 point for that is 2.8. All right. Now almost all the partsof our formula is done. So now what we needto do is get the summation of last two columns. All right, so the summation of xminus X bar whole square is 10 and the summationof x minus X bar. X Y minus y bar is 4so the value of M will be equal to 4 by 10 fine. So let's put this value of m equals zero point 4and our line y equal MX plus C. So let's file all the pointsinto the equation and find the value of C. So we have y as 3.6 rememberthe mean by m as 0.4 which we calculated justnow X as the mean value of x that is 3 and we have the in as 3 point 6 equals0 point 4 x 3 plus C. Alright that is 3.6 equal1 Point 2 plus C. So what is the value of Cthat is 3.6 minus 1 Point 2. That is 2 point 4. All right. So what we had we had mequals zero point four see as 2.4 and then finally when we calculate the equationof the regression line what we get is y equalzero point four times of X plus two point four. So there is the regression line. Like so there'show you're plotting your points. This is your actual point. All right. Now for given m equalszero point four and SQL 2.4. Let's predict the value of yfor x equal 1 2 3 4 & 5. So when x equal1 the predicted value of y will be zero point four x one plus two pointfour that is 2.8. Similarly when x equalto predicted value of y will be zero point 4 x 2 plus 2 point 4that equals to 3 point. Two similarly x equal 3 y will be 3 point 6 xequal 4 y will be 4 point 0 x equal 5 y will befour point four. So let's plot them on the graph and the line passing throughall these predicting point and cutting y-axis at 2.4as the line of regression. Now your task is to calculatethe distance between the actual and the predicted value and your job isto reduce the distance. All right, or in other words, you have to reduce the errorbetween the actual and the predicted. The line with the leasterror will be the line of linear regression or regression line and itwill also be the best fit line. Alright, so this ishow things work in computer. So what it do it performsa number of iteration for different values of Mfor different values of M. It will calculatethe equation of line where y equals MX plus C. Right? So as the valueof M changes the line is changing so iterationwill start from one. All right, and it will performa number of iteration so after Every iteration what it will do it willcalculate the predicted value according to the line and compare the distance of actual valueto the predicted value and the value of M for which the distancebetween the actual and the predicted value isminimum will be selected as the best fit line. All right. Now that we have calculatedthe best fit line now, it's time to check the goodnessof fit or to check how good a model is performing. So in order to do that, we have a methodcalled R square method. So what is this R square? Well r-squared value isa statistical measure of how close the data are to the fitted regressionline in general. It is considered that a high r-squaredvalue model is a good model, but you can also havea lower squared value for a good model as well ora higher Squad value for a model that does not fit at all. All right. It is also known ascoefficient of determination or the coefficientof multiple determination. Let's move on and seehow a square is calculated. So these are our actual valuesplotted on the graph. We had calculatedthe predicted values of Y as 2.8 3.2 3.6 4.0 4.4. Remember when we calculatedthe predicted values of Y for the equation Ypredicted equals 0 1 4 x of X plus two pointfour for every x equal 1 2 3 4 & 5 from there. We got the power. Good values of Phi. All right. So let's plot it on the graph. So these are pointand the line passing through these points are nothingbut the regression line. All right. Now, what you need to do is you have to check and comparethe distance of actual - mean versus the distanceof predicted - mean. Alright. So basically what you are doingyou are calculating the distance of actual value to the mean to distanceof predicted value to the mean. All right, so there is nothing but a square in mathematicallyyou can represent our school. Whereas summation of Ypredicted values minus y bar whole Square dividedby summation of Y minus y bar whole Square where Y is the actual valuey p is the predicted value and Y Bar is the mean value of ythat is nothing but 3.6. Remember, this is our formula. So next what we'll dowe'll calculate y minus y bar. So we have y is 3y bar as3 point 6 so we'll calculate it as 3 minus 3.6 that is nothing butminus of 0.6 similarly for y equals 4and Y Bar equal 3.6. We have y minus y bar aszero point 4 then 2 minus 3.6. It has 1 point6 4 minus 3.6 again zero point four and fiveminus 3.6 it is 1.4. So we got the valueof y minus y bar. Now what we have to do wehave to take it Square. So we have minus of 0.6 Squareas 0.36 0.4 Square as 0.16 - of 1.6 Square as 2.56 0.4 Squareas 0.16 and 1.4 squared is 1.96 now is a partof formula what we need. We need our YPminus y BAR value. So these are VIP values and we have to subtract itfrom the No, right. So 2 .8 minus 3.6that is minus 0.8. Similarly. We will get 3.2 minus 3.6that is 0.4 and 3.6 minus 3.6 that is 0 for 1 0 minus3.6 that is 0.4. Then 4 .4 minus 3.6 that is 0.8.
So we calculated the valueof YP minus y bar now, it's our turn to calculatethe value of y b minus y bar whole Square next. We have - of 0.8 Square as 0.64 - of Pointfour square as 0.160 Square 0 0 point 4 Square as again 0.16and 0.8 Square as 0.64. All right. Now as a part of formula what it suggests it suggestsme to take the summation of Y P minus y bar whole square and summation of Y minusy bar whole Square. All right. Let's see. So on submitting yminus y bar whole Square what you get is five point twoand summation of Y P minus y bar whole Square youget one point six. So the value of R squarecan be calculated as 1 point 6 upon 5.2 fine. So the result which will get isapproximately equal to 0.3. Well, this is not a good fit. All right, so it suggests that the data points are faraway from the regression line. Alright, so this is how your graph will looklike when R square is 0.3 when you increase the valueof R square to 0.7. So you'll see that the actual value would likecloser to the regression line when it reaches to 0.9 it comes. More clothes and when the valueof approximately equals to 1 then the actual values lieson the regression line itself, for example, in this case. If you get a very low valueof R square suppose 0.02. So in that case what you'll seethat the actual values are very far away fromthe regression line, or you can say that there are toomany outliers in your data. You cannot focusanything from the data. All right. So this was all aboutthe calculation of R square now, you might get a questionlike are low values of Square always bad. Well in some field itis entirely expected that I ask where value will be low. For example any field that attempts to predict humanbehavior such as psychology typically has r-squared valueslower than around 50% through which you can conclude that humans are simply harder to predict the underphysical process furthermore. If you are squared value is low, but you have statisticallysignificant predictors, then you can stilldraw important conclusion about how changes in thepredicator values associated. Oh sated with the changesin the response value regardless of the r-squared the significant coefficientstill represent the mean change in the response for one unitof change in the predicator while holding other predatorsin the model constant, obviously this type of information can beextremely valuable. All right. All right. So this was all aboutthe theoretical concept now, let's move on to the codingpart and understand the code in depth. So for implementinglinear regression using python, I will be using Anaconda with jupyter notebookinstalled on it. So I like there'sa jupyter notebook and we are using python 3.01 it alright, so we are goingto use a data set consisting of head size and human brainof different people.
All right. So let's import our data setpercent matplotlib and line. We are importing numpy as NP pandas as speedy andmatplotlib and from matplotlib. We are importing pipeout of that as PLT. Alright next we will importour data had brain dot CSV and store itin the data variable. Let's execute the Run buttonand see the armor. But so this asterisksymbol it symbolizes that it still executing. So there's a output or dataset consistsof two thirty seven rows and four columns. We have columns asgender age range head size in centimeter Cube and brain weightsand Graham fine. So there's our sample data set that is how it looks it consistsof all these data set. So now that wehave imported our data, so as you can see they are237 values in the training set so we can find a linear. Relationship between the headsize and the Brain weights. So now what we'll dowe'll collect X & Y the X would consistof the head size values and the Y would consistof brain with values. So collecting X and Y.Let's execute the Run. Done next what we'll do weneed to find the values of b 1 or B not or you can say m and C. So we'll need the mean of Xand Y values first of all what we'll do we'll calculatethe mean of X and Y so mean x equal NP dot Min X. So mean is a predefined functionof Numb by similarly mean underscore y equalNP dot mean of Y, so what it will return if you'll returnthe mean values of Y next we'll checkthe total number of values. So m equals. Well length of X. Alright, then we'll use the formulato calculate the values of b 1 and B naught or fnc. All right, let's executethe Run button and see what is the result. So as you can see here on the screen we have gotb 1 as 0 point 2 6 3 + B not as three twentyfive point five seven. Alright, so nowthat we have a coefficient. So comparing it withthe equation y equal MX plus C. You can saythat brain weight equals zero point 2 6 3 X Head sizeplus three twenty five point five seven so you can say that the value of M hereis 0.26 3 and the value of C. Here is three twentyfive point five seven. All right, so there'sour linear model now, let's plot itand see graphically. Let's execute it. So this is how our plot lookslike this model is not so bad. But we need to find outhow good our model is. So in order to findit the many methods like root means Square methodthe coefficient of determination or the a square method. So in this tutorial, I have told youabout our score method. So let's focus on that and seehow good our model is. So let's calculatethe R square value. All right here SS underscore Tis the total sum of square SS. Our is the total sum of squareof residuals and R square as the formula is1 minus total sum of squares upon total sumof square of residuals. All right nextwhen you execute it, you will get the valueof R square as 0.63 which is pretty very good. Now that you have implementedsimple linear regression model using least Square method, let's move on and see how will you implement the modelusing machine learning library called scikit-learn. All right. So this scikit-learnis a simple machine. Young Library in Python weldingmachine learning model are very easy using scikit-learn. So suppose there'sa python code. So using the scikit-learnlibraries your code shortens to this length like so let's executethe Run button and see you will get the same ourto score as Well, this was allfor today's discussion. Most of the entities in this world arerelated in one way or another at times findingrelationship between entities can help you take valuablebusiness decisions today. I'm going to talkabout logistic regression, which is onesuch approach towards predicting relationships. Now, let us seewhat all we are going to cover in today's training. So we'll start off the sessionby getting a quick introduction to what is regression. Then we'll see the differenttypes of regression and we'll be discussing the whatand by of logistic regression. So in this part, we'll discuss whatexactly it is. It is used why it is usedand all those things moving ahead will comparelinear regression versus logistic regression along with the variousreal-life use cases and finally towards the end. I will be practically implementing logisticregression algorithm. So let's quickly start offwith the very first topic what is regression. The regression analysis isa predictive modeling technique. So it alwaysinvolves predictions. So in this session, we'll just talkabout predictive analysis and not prescriptive analysis. Now why because if descriptive analysisyou Need to have a good base and a strongholdon the predictive part first. Now, it estimates relationshipbetween the dependent variable and an independent variable. So for those of you who are not awareof these terminologies, let me give youa quick summary of it. So dependent variable isnothing but a variable which you want to predict now, let's say I want to know what will be the saleson 26th of this month. So sales becomesa dependent variable or you can seethe target variable. Now this dependent variable or Target variable are goingto depend on a lot of actors. The number of productsyou sold till date or what is the season out there? Is there the availabilityof product or how is the product qualityand all these things? So these arethe NeverEnding factors which are nothingbut the different features that leads to sail so these variables are calledas an independent variable or you can say the predictor now if you lookat the graph over here, we have some values of Xand we have values of Y now as you can see over here if X increases the value of by also increases solet me explain you this with an example. Let's say we haveuntil the value of x which is six point seven fiveand somebody asked you. What was the value of y when the valueof x is 7 so the way that you can do it or how regression comesinto the picture is by fitting a straight lineby all these points and getting the valueof M and C. So this is straight line guys and the formula for the straightline is y is equal to MX plus C. So using this we can try topredict the value of y so here if you notice the X variablecan increase as much as it can but the Y variablewill increase according to x so Why is basically dependenton your X variable? So for any arbitrary valueof x You can predict the value of y and this is alwaysdone through regression. So that ishow regression is useful. Now regression is basicallyclassified into three types your linear regression, then your logistic regressionand polynomial regression. So today we will be discussinglogistic regression. So let's move forward and understand the what and byof logistic regression. Now this algorithmis most widely used when the dependent variable or you can see the output isin the binary. A format. So here you needto predict the outcome of a categoricaldependent variable. So the outcome should bealways discreet or categorical in nature Now by discrete. I mean the valueshould be binary or you can say you just havetwo values it can either be 0 or 1 it can either be yes or a no either be trueor false or high or low. So only these can bethe outcomes so the value which you need to createit should be discrete or you can saycategorical in nature. Whereas in linear regression. We have the value of by or you can see Val you needto predict within a range that is how there's a differencebetween linear regression and logistic regression. We must be having question. Why not linear regression now guys in linear regressionthe value of by or the value, which you need topredict is in a range, but in our case asin the logistic regression, we just have two valuesit can be either 0 or it can be one. It should not entertainthe values which is below zero or above one. But in linear regression, we have the value of y in the range so herein order to implement logic regression weneed To clip this part so we don't need the value that is below zeroor we don't need the value which is above 1 so since the value of y will bebetween only 0 and 1 that is the main ruleof logistic regression. The linear line hasto be clipped at 0 and 1 now. Once we clip this graph itwould look somewhat like this. So here you're getting the curve which is nothing butthree different straight lines. So here we need to makea new way to solve this problem. So this has to beformulated into equation. And hence we come upwith logistic regression. So here the outcome is either 0 Or one which is the main ruleof logistic regression. So with this our resulting curvecannot be formulated. So hence our main aimto bring the values to 0 and 1 is fulfilled.
So that is how we came up withlarge stick regression now here once it gets formulatedinto an equation. It looks somewhat like this. So guys, this isnothing but an S curve or you can say the sigmoid curvea sigmoid function curve. So this sigmoid functionbasically converts any value from minus infinity to Infinityto your discrete values, which a Logitech regressionwants or it Can say the values which are in binaryformat either 0 or 1. So if you see herethe values as either 0 or 1 and this is nothingbut just a transition of it, but guys there'sa catch over here. So let's say I havea data point that is 0.8. Now, how can you decide whether your value is 0 or 1 now here youhave the concept of threshold which basicallydivides your line. So here threshold value basically indicates theprobability of either winning or losing so here by winning. I mean the value is equal. One and by losing I meanthe values equal to 0 but how does it do that? Let's have a data pointwhich is over here. Let's say my cursor is at 0.8. So here I check whether this value is lessthan the threshold value or not. Let's say if it is morethan the threshold value. It should give me the resultas 1 if it is less than that, then should give methe result is zero. So here my thresholdvalue is 0.5. I need to Define thatif my value let's is 0.8. It is more than 0.5. Then the value shallbe rounded of two one. One and let's sayif it is less than 0.5. Let's I have a value 0.2 thenshould reduce it to zero. So here you can use the concept of threshold valueto find output. So here it should be discreet. It should be either 0or it should be one. So I hope you caught this curveof logistic regression. So guys, this isthe sigmoid S curve. So to make this curvewe need to make an equation. So let me addressthat part as well. So let's see how an equationis formed to imitate this functionality so over here, we have an equationof a straight. Line, which is y isequal to MX plus C. So in this case, I just have only one independentvariable but let's say if we have many independentvariable then the equation becomes m 1 x 1 plus m 2 x 2 plus m 3 x3 and so on till M NX n now, let us put in B and X. So here the equationbecomes Y is equal to b 1 x 1 plus beta 2 x2 plus b 3 x 3 and so on till be nxn plusC. So guys equation of the straight line has a rangefrom minus infinity to Infinity. Yeah, but in our caseor you can say largest equation the value which we needto predict or you can say the Y value it can havethe range only from 0 to 1. So in that case we needto transform this equation. So to do that what we had done we have just dividethis equation by 1 minus y so now Y is equal to 0 so 0 over 1 minus0 which is equal to 1 so 0 over 1 is again 0 and if we take Y is equals to 1then 1 over 1 minus 1 which is 0 so 1 over 0 is infinity. So here are my range is now. Between 0 to Infinity, but again, we want the rangefrom minus infinity to Infinity. So for that what we'll do we'll havethe log of this equation. So let's go ahead and have the logarithmicof this equation. So here we have this transformit further to get the range between minus infinity to Infinity so overhere we have log of Y over 1 minus 1 and this is your finallogistic regression equation. So guys, don't worry. You don't have to writethis formula or memorize this formula in Python. You just need tocall this function which is logistic regression and Everything will beautomatically for you. So I don't want to scareyou with the maths in the formulas behind it. But it is always good to knowhow this formula was generated. So I hope you guys are clear with how logistic regressioncomes into the picture next. Let us see what arethe major differences between linear regression wasa logistic regression the first of all in linear regression, we have the value of y as a continuous variableor the variable between need to predictare continuous in nature. Whereas in logistic regression. We have the categorical variableso here the value which you need to Shouldbe discrete in nature. It should be either 0 or 1 or should havejust two values to it. For example, whether it is rainingor it is not raining is it humid outsideor it is not humid outside. Now, how's it going to snowand it's not going to snow. So these are the few example,
we need to predict where the values are discreteor you can just predict where this is happening or not. Next linear equation solvesyour regression problems. So here you have a conceptof independent variable and a dependent variable. So here you can calculatethe value of y which you need to Plate it. Using the value of x. So here your y variableor you can see the value that you need topredict are in a range. But whereas inlogistic regression, you have discrete values. So logistic regression basicallysolves a classification problem so it can basically classify itand it can just give you result whether this eventis happening or not. So I hope it is prettymuch Clear till now next in linear regression. The graph that you have seen is a straight linegraph so over here, you can calculate the value of y with respect to the value of xwhere as in logistic regression. Glad that we got was a Escobar. You can see the sigmoid curve. So using the sigmoid functionYou can predict your y values. So I hope you guys are clear with the differencesbetween the linear regression and logistic regressionmoving the a little see the various use cases where in logistic regressionis implemented in real life. So the very first isweather prediction now largest aggression helpsyou to predict your weather. For example, itis used to predict whether it is rainingor not whether it is sunny. Is it cloudy or not? So all these thingsthings can be predicted using logistic regression. Where as you needto keep in mind that both linear regression and logistic regression can beused in predicting the weather. So in that case linearregression helps you to predict what will bethe temperature tomorrow whereas logistic regressionwill only tell you which is going to rain or notor whether it's cloudy or not, which is going to snow or not. So these values are discrete. Whereas if you applylinear regression, you will predicting things likewhat is the temperature tomorrow or what is the temperatureday after tomorrow and all those thing? So these are the slight? Is between linear regression and logistic regressionthe moving ahead. We have classification problem. So python performsmulti-class classification, so here it can help you tellwhether it's a bird. It's not a board. Then you classifydifferent kind of mammals. Let's say whether it's a dogor it's not a dog similarly, you can check it for reptile whether it's a reptileor not a reptile. So in logistic regression, it can performmulti-class classification. So this pointI've already discussed that it is usingclassification problems next. It also helps youto determine the illnesses.
Where so let me take an example. Let's say a patient goes fora routine check up in hospital. So what doctor will do it, it will perform various testson the patient and we'll check whether the patient isactually a law or not. So what will be the features so doctor can checkthe sugar level the blood pressure then whatis the age of the patient? Is it very small or isit the old person then? What is the previous medicalhistory of the patient and all of these featureswill be recorded by the doctor and finally, dr. Checks the patientdata and Data - the outcome of Illnessand the severity of illness. So using all the dataof a doctor can identify whether a patient is ill or not. So these arethe various use cases in which you can uselogistic regression now, I guess enough of theory part. So let's move ahead and see someof the Practical implementation of logistic regressionso over here, I be implementing two projects when I have the data set of a Titanic so over herewill predict what factors made people more likelyto survive the sinking of the Titanic ship anime. Second project will seethe data analysis. On the SUV cars so over here. We have the data of the SUV carswho can purchase it and what factors made peoplemore interested in buying SUV. So these will bethe major questions as to why you should Implement logistic regression andwhat output will you get by it? So let's start bythe very first project that is Titanic data analysis. So some of you might know that there was a shipcalled as Titanic with basically hit an iceberg and sank to the bottom of the ocean and it wasa big disaster at that time because it was the firstvoyage of the ship. It was supposed to be reallyreally strongly built and one of the best ships of that time. So it was a big disasterof that time. And of course there is a movieabout this as well. So many of youmight have washed it. So what we have we have dataof the passengers those who survived and those
who did not survivein this particular tragedy. So what you have to do youhave to look at this data and analyze which factorswould have been contributed the most to the chances of a person survivalon the ship or not. So using the logisticregression, we can predict whether the person survivedor the person died. Now apart from thiswe also have a look with the various featuresalong with that. So first it is explorethe data set so over here, we have the index valuethen the First Column is passenger ID, then my next columnis survived so over here, we have two valuesa 0 and a 1 so 0 stands for did not surviveand one stands for survive. So this column is categorical where the valuesare discrete next. We have passenger classso over here, we have three values 1 2 and 3. So this basically tells youthat whether a I think a stabbing in the first classsecond class or third class. Then we have the nameof the passenger. We have the six or you can seethe gender of the passenger where the passengeris a male or female. Then we have the agewe have the Sip SP. So this basically meansthe number of siblings or the spouses aboardthe Titanic so over here, we have values such as 10 and so on then we have Parts apart is basicallythe number of parents or children aboardthe Titanic so over here, we also have some values then we I havethe ticket number. We have the fear. We have the cabin numberand we have the embarked column. So in my inbox column, we have three valueswe have SC and Q. So s basically stands for Southampton Cstands for Cherbourg and Q stands for Queenstown. So these are the features that will be applyingour model on so here we'll perform various steps and then we'll be implementinglogistic regression. So now these arethe various steps which are requiredto implement any algorithm. So now in our casewe are implementing logistic regression, so, Very first step isto collect your data or to import the libraries that are used forcollecting your data and then taking it forward thenmy second step is to analyze your data so over here, I can go to the various fieldsand then I can analyze the data. I can check did the females or children survivebetter than the males or did the richpassenger survived more than the poor passengeror did the money matter as in who paid more to getinto the shape with the evacuated first? And what about the workersdoes the worker survived or what is the survival rate? If you were the workerin the ship and not just a traveling passenger, so all of these are veryvery interesting questions and you would be goingthrough all of them one by one. So in this stage, you need to analyze our data and explore your data as much asyou can then the third step is to Wrangle your data now data wrangling basically meanscleaning your data so over here, you can simply removethe unnecessary items or if you have a null valuesin the data set. You can just clear that data andthen you can take it forward. So in this step you can buildyour model using the train data. And then you can test it using a test so over here youwill be performing a split which basically splityour data set into training and testing data set and findyou will check the accuracy. So as to ensure how much accurateyour values are. So I hope you guys gotthese five steps that you're going to implementin autistic regression. So now let's go into allthese steps in detail. So number one. We have to collect your data or you can sayimport the libraries.
So it may show youthe implementation part as well. So I just openmy jupyter notebook and I just Implementall of these steps. It's side-by-side. So guys this ismy jupyter notebook first. Let me just renamejupyter notebook to let's say Titanic data analysis. Now our first step wasto import all the libraries and collect the data. So let me just importall the libraries first. So first of all,I'll import pandas. So pandas is usedfor data analysis. So I'll say input pandas as PDthen I will be importing numpy. So I'll say import numpy as NPso numpy is a library in Python which basically standsfor numerical Python and it is widely used to performany scientific computation. Next. We will be importing Seaborn. So c 1 is a library forstatistical brought think so. Say import Seaborn as SNS. I'll also import matplotlib. So matplotlib libraryis again for plotting. So I'll say importmatplotlib dot Pi plot as PLT now to run this libraryin jupyter Notebook all I have to write in his percentagematplotlib in line. Next I will be importingone module as well. So as to calculate the basicmathematical functions, so I'll say import mats. So these are the libraries that I will be needingin this Titanic data analysis. So now let me justimport my data set. So I will take a variable. Let's say Titanic dataand using the pandas. I will just read my CSVor you can see the data set. I like the name of my data setthat is Titanic dot CSV. Now. I have already showed youthe data set so over here. Let me just printthe top 10 rows. So for that I will just say I take the variableTitanic data dot head and I'll say the top ten rules. So now I'll just run this so to run these fellowshave to press shift + enter or else you can just directlyclick on this cell so over here. I have the index. We have the passenger ID,which is nothing. But again the index which is starting from 1 thenwe have the survived column which has a category. Call values or you can saythe discrete values, which is in the form of 0 or 1. Then we havethe passenger class. We have the nameof the passenger 6 8 and so on so thisis the data set that I will be going forward with next let us bringthe number of passengers which are there inthis original data set for that. I'll just simply type in print. I'll say a number of passengers. And using the length function, I can calculatethe total length. So I'll say lengthand inside this I will be passing this variablebecause Titanic data, so I'll just copy it from here. I'll just paste it dot index and next set mejust bring this one. So here the number of passengers which are there in the originaldata set we have is 891 so around this number were traveling inthe Titanic ship so over here, my first step is done where you have just collecteddata imported all the libraries and find out the totalnumber of passengers, which are Titanic sonow let me just go back to presentation and let's see. What is my next step. So we're done withthe collecting data. Next step is to analyzeyour data so over here, we will be creating differentplots to check the relationship between variables as in how one variableis affecting the other so you can simply exploreyour data set by making use of various columns and then you can plota graph between them. So you can either plota correlation graph.
You can plota distribution curve. It's up to you guys. So let me just go back to my jupyter notebook and letme analyze some of the data. Over here. My second part isto analyze data. So I just put this in headed to now to put this in hereto I just have to go and code click on mark downand I just run this so first let us plot account plot where you can paybetween the passengers who survived andwho did not survive. So for that I will be usingthe Seabourn Library so over here I have importedSeaborn as SNS so I don't haveto write the whole name. I'll simply saySNS dot count plot. I say axis with the surviveand the data that I'll be usingis the Titanic data or you can say the name of variable in which youhave store your data set. So now let me just run this so who were here as you can seeI have survived column on my x axis and on the y axis. I have the count. So 0 basically standsfor did not survive and one standsfor the passengers who did survive so over here, you can see that around 550of the passengers who did not survive and theywere around 350 passengers who only survive so hereyou can basically compute. There are very less survivorsthan on survivors. So this was the veryfirst floor now that is not another plotto compare the sex as to whether out of all the passengers who survived andwho did not survive. How many were men andhow many were female so to do that? I'll simply saySNS dot count plot. I add the Hue as sixso I want to know how many females andhow many male survive then I'll bespecifying the data. So I'm using Titanic dataset and let me just run this you have done a mistake over here so over here youcan see I have survived column on the x-axis and I have the counton the why now. So here your view color standsfor your male passengers and orange standsfor your female. So as you can seehere the passengers who did not survive that has a value0 so we can see that. Majority of males did notsurvive and if we see the people who survived here, we can see the majorityof female survive. So this basically concludesthe gender of the survival rate. So it appears on averagewomen were more than three times more likelyto survive than men next. Let us plot another plot where we have the Hue asthe passenger class so over here we can see which class atthe passenger was traveling in whether it was travelingin class one two, or three so for that I justtried the same command. I'll say SNS dot count plot. I keep my x-axis as subtly I'll change my youto passenger class. So my variablenamed as PE class. And the data said that I'll be usingis Titanic data. So this is my result so over here you can see I haveblue for first-class orange for second class and greenfor the third class. So here the passengers who did not survive a majorlyof the third class or you can say the lowest class or the cheapest class to getinto the dynamic and the people who did survive majorly belongto the higher classes. So here 1 & 2 has more eyesthan the passenger who were travelingin the third class.
So here we have concludedthat the passengers who did not survivea majorly of third class. Us all you can seethe lowest class and the passengers who were travelingin first and second class would tend to survive more next. I just got a graph forthe age distribution over here. I can simply use my data. So we'll be usingpandas library for this. I will declare an arrayand I'll pass in the column. That is age. So I plot and I want a histogram so I'll say plot da test. So you can notice over here that we have moreof young passengers, or you can see the childrenbetween the ages 0 to 10 and then we havethe average people and if you go ahead Lesterwould be the population. So this is the analysison the age column. So we saw that we have more young passengers and moremediocre eight passengers, which are travelingin the Titanic. So next let me plota graph of fare as well. So I'll say Titanic data. I say fair. And again, I got a histogramso I'll say haste. So here you can seethe fair size is between zero to hundred now. Let me add the bin size. So as to make itmore clear over here, I'll say Ben is equals to let's say 20 and I'll increasethe figure size as well. So I'll say fixed size. Let's say I'll givethe dimensions as 10 by 5. So it is bins. So this is more clear now next. It is analyzedthe other columns as well. So I'll just typein Titanic data and I want the information asto what all columns are left. So here we have passenger ID, which I guess it'sof no use then you have see how many passengers survived and how many did not wealso see the analysis on the gender basis. We saw when the femaletend to survive more or the maintain to survive morethen we saw the passenger class where the passenger is travelingin the first class second class or third class. Then we have the name. So in name,we cannot do any analysis. We saw the sex wesaw the age as well. Then we have sea bass P. So this stands for the numberof siblings or the spouses which Are aboard the Titanic solet us do this as well. So I'll say SNS dot count plot. I mentioned X SC SP. And I will be usingthe Titanic data so you can see the plotover here so over here you can conclude that. It has the maximum valueon zero so you can conclude that neither childrennor a spouse was on board the Titanic nowsecond most highest value is 1 and then we have various valuesfor 2 3 4 and so on next if I go above the storethis column as well. Similarly can do four parts. So next we have part so you can see the numberof parents or children which were aboard the Titanicso similarly can do. As well then we havethe ticket number. So I don't think so. Any analysis isrequired for Ticket. Then we have fears of a wehave already discussed as in the people would tendto travel in the first class. You will be the highest viewthen we have the cable number and we have embarked. So these are the columns that will be doingdata wrangling on so we have analyzed the data and we have seenquite a few graphs in which we can conclude whichvariable is better than another or what is the relationshipthe whole third step is my data wranglingso data wrangling basically means Cleaning your data. So if you have a large data set, you might be havingsome null values or you can say Nan values. So it's very important that you remove allthe unnecessary items that are presentin your data set. So removing this directlyaffects your accuracy. So I'll just go aheadand clean my data by removing all the n n valuesand unnecessary columns, which has a null valuein the data set the next time you'reperforming data wrangling. Supposed to fall I check whether my data setis null or not. So I'll say Titanic data, which is the name of my data setand I'll say is null. So this will basically tellme what all values are null and will return mea Boolean result. So this basicallychecks the missing data and your result will bein Boolean format as in the result will be trueor false so Falls mean if it is not nulland prove means if it is null, so let me just run this. Over here you can seethe values as false or true. So Falls is where the value isnot null and Drew is where the value is none. So over here you can seein the cabin column. We have the very first value which is null so we have to dosomething on this so you can see that we have a large data set.
So the counting does not stop and we can actuallysee the some of it. We can actually printthe number of passengers who have the Nan valuein each column. So I'll say Titanicunderscore data is null and I want the sum of it all. Same thought some so this is basically printthe number of passengers who have the n n valuesin each column so we can see that we have missing valuesin each column that is 177. Then we have the maximum valuein the cave in column and we have very Lessin the Embark column. That is 2 so here if you don't wantto see this numbers, you can also plot a heat map and then you can visuallyanalyze it let me just do that as well. So I'll say SNSD heat map. And save I take labels. False Choice run thisas we have already seen that there were three columns in which missing datavalue was present. So this might be ageso over here almost 20% of each column hasa missing value. Then we havethe cabling columns. So this is quite a large value and then we have two valuesfor embark column as well. Add a see map for color coding. So I'll say see map. So if I do this so the graph becomesmore attractive so over here yellow stands for Drew or youcan say the values are null. So here we have computed that we have the missing valueof H. We have a lot of missing valuesin the cabin column and we have very less value, which is not even visiblein the Embark column as well. So to removethese missing values, you can either replacethe values and you can put in some dummy values to it or youcan simply drop the column. So here let us supposepick the age column. So first, let mejust plot a box plot and they will analyzewith having a column as H. So I'll say SNS dot box plot. I'll say x is equalsto passenger class. So it's p class. I'll say Y is equalto H and the data set that I'll be usingis Titanic side. So I'll say three times goesto Titanic data. You can see the edge in first class and second classtends to be more older rather than we have itin the third class. Well that dependson The Experience how much you earn or might be there any numberof reasons so here we concluded that passengers who weretraveling in class one and class two a tend to be older thanwhat we have in the class 3 so we have found that we havesome missing values in EM. Now one way is to either justdrop the column or you can just simply fillin some values to them. So this method is calledas imputation now to perform data wrangling or cleaning it is for springthe head of the data set. So I'll saytightening knot head. So it's Titanic. Data, let's say Ijust want the five rows. So here we have survivedwhich is again categorical. So in this particular column, I can applylogic to progression. So this can be my y valueor the value that you need to predict. Then we havethe passenger class. We have the name. Then we have ticket number. We're taping so over here. We have seen that in keeping.
We have a lot of null valuesor you can say that any invalid which is quite visible as well. So first of all, we'll just drop this columnfor dropping it. I'll just sayTitanic underscore data. And I'll simply typein drop and the column which I need to draw so Ihave to drop the cable column. I mention the access equalsto 1 and I'll say in place also to true. So now again, I just printthe head and let us see whether this columnhas been removed from the data set or not. So I'll say Titanic dot head. So as you can see here, we don't havegiven column anymore. Now, you can alsodrop the na values. So I'll sayTitanic data dot drop all the any valuesor you can say Nan which is not a number and I will say in place is equalto True its Titanic. So over here, let me again plotthe heat map and let's say for the values we should beforeshowing a lot of null values. Has it been removed or not. So I'll say SNS dot heat map. I'll pass in the data set. I'll check it is null. I'll say why tick labelsis equal to false. And I don't want color coding. So again I say false. So this will basicallyhelp me to check whether my valueshas been removed from the data set or not. So as you can see here,I don't have any null values. So it's entirely black now. You can actually knowthe some as well. So I'll just go above SoI'll just copy this part and I just use the sum functionto calculate the sum. So here the tells methat data set is clean as in the data set does not containany null value or any Nan value.
Thamks For Reading
Post a Comment
If you have any questions ! please let me know