Now these heterogeneity of layout and categorizationcannot always satisfy individual users needto remove these heterogeneity and classifyingthe news articles. Owing to the user preferenceis a formidable task companies use web crawlerto extract useful text from HTML Pagesthe news articles and each of these news articles is then tokenized nowthese tokens are nothing but the categoriesof the news now in order to achievebetter classification result. We remove the lesssignificant Words, which are the stop wasfrom the documents or the Articles and then we applythe Nave base classifier for classifying the newscontents based on the news.
Now this is by far one of the best examplesof Neighbors classifier, which is Spam filtering. Now. It's the NaveBayes classifier are a popular statistical techniquefor email filtering. They typically use bag-of-wordsfeatures to identify at the spam email and approach commonly usedin text classification as well. Now it works by correlatingthe use of tokens, but the spam and non-spam emailsand then the Bayes theorem, which I explained earlier is used tocalculate the probability that an email is or not a Spam so namedby a Spam filtering is a baseline techniquefor dealing with Spam that container itself to the emails needof an individual user and give low false positivespam detection rates that are generallyacceptable to users. It is one of the oldest waysof doing spam filtering with its roots in the 1990s particular wordshave particular probabilities of occurring in spam. And in legitimate email as wellfor instance most emails users will frequentlyencounter the world lottery or the lucky draw a spam email, but we'll sell themsee it in other emails. The filter doesn't knowthese probabilities in advance and must be friends. So it can build themup to train the filter.
The user must manually indicate whether a new email is Spamor not for all the words in each straining email. The filter willadjust the probability that each word will appearin a Spam or legitimate. All in the database now after training the wordprobabilities also known as the likelihood functions areused to compute the probability that an email with a particularset of words as in belongs to either category each word in the email contributesthe email spam probability. This contribution is calledthe posterior probability and is computed againusing the base 0 then the email spam probability is computed over allthe verse in the email and if the total exceedsa certain threshold say Or 95% the filter will Markthe email as spam. Now object detection isthe process of finding instances of real-world objectssuch as faces bicycles and buildings in images or video now object detection algorithm typicallyuse extracted features and learning algorithm to recognize instance ofan object category here again, a bias plays an importantrole of categorization and classification of objectnow medical area. This is increasingly voluminousamount of electronic data, which are becoming moreand more complicated. The produced medical datahas certain characteristics that make the analysisvery challenging and attractive as well among allthe different approaches. The knave bias is used. It is the most effectiveand efficient classification algorithm and hasbeen successfully applied to many medical problemsempirical comparison of knave bias versus five popular classifiers on Medical data sets shows that may bias is well suitedfor medical application and has high performance in mostof the examine medical problems. Now in the past variousstatistical methods have been used for modeling in the areaof disease diagnosis.
These methods requireprior assumptions and are less capable of dealing with massive and complicatednonlinear and dependent data one of the main advantagesof neighbor as approach which is appealingto Physicians is that all the availableinformation is used? To explain the decisionthis explanation seems to be natural for medicaldiagnosis and prognosis. That is it is veryclose to the way how physician diagnosed patientsnow weather is one of the most influential factorin our daily life to an extent that it may affectthe economy of a country that depends on occupationlike agriculture. Therefore as a countermeasureto reduce the damage caused by uncertaintyin whether Behavior, there should be an efficient wayto print the weather now whether projecting hasChallenging problem in the meteorological department since ears evenafter the technology skill and scientific advancementthe accuracy and production of weatherhas never been sufficient even in current day this domainremains as a research topic in which scientists and mathematicians are workingto produce a model or an algorithm that will accuratelypredict the weather now a bias in approachbased model is created by where procedure probabilitiesare used to calculate the likelihood ofeach class label for input. Data instance and the onewith the maximum likelihood is considered as the resultingoutput now earlier. We saw a small implementationof this algorithm as well where we predicted whether we should playor not based on the data, which we have collected earlier. Now, this is a python Library which is known as scikit-learnit helps to build in a bias and model in Python. Now, there are three typesof named by ass model under scikit-learn Library. The first one is the caution. It is used in classificationand it Assumes that the feature followa normal distribution. The next we have is multinomial. It is used for discrete counts. For example, let's say we havea text classification problem and here weconsider bernouli trials, which is one step further and instead of wordoccurring in the document.
We have count how often word occurs in the document youcan think of it as a number of timesoutcomes number is observed in the given number of Trials. And finally we havethe bernouli type. Of Naples, the binomialmodel is useful if your feature vectors arebinary bag of words model where the once and the zeros are words occurin the document and the verse which do not occur in the document respectivelybased on their data set. You can choose any ofthe given discussed model here, which is the gaussianthe multinomial or the bernouli. So let's understandhow this algorithm works. And what arethe different steps? One can take to createa bison model and use knave bias to predict the output sohere to understand better. We are going to predictthe onset of diabetes Now this problem comprises of 768 observationsof medical details for Pima Indian patients. The record describesinstantaneous measurement taken from the patient such asthe age the number of times pregnantand the blood work group now all the patients arewomen aged 21 and Old and all the attributesare numeric and the unit's vary from attribute to attribute. Each record hasa class value that indicate whether the patient sufferedon onset of diabetes within five yearsor the measurements. Now, these areclassified as zero. Now, I've brokenthe whole process down into the following steps. The first step is handlingthe data in which we load the data from the CSV fileand split it into training and test data sets. The second stepis summarizing the data. In which we summarize the properties in the trainingdata sets so that we can calculate the probabilitiesand make predictions. Now the third step comes ismaking a particular prediction. We use the summaries of the data set to generatea single prediction. And after that wegenerate predictions given a test data set anda summarize training data sets. And finally we evaluate the accuracy of the predictionsmade for a test data set as the percentage correctout of all the predictions made and finally We tiedtogether and form.
Our own modelof nape is classifier. Now. The first thing we need to dois load our data the data is in the CSV formatwithout a header line or any codes. We can open the filewith the open function and read the data linesusing the read functions in the CSV module. Now, we also needto convert the attributes that were loaded asstrings into numbers so that we can work with them. So let me show you how this can be implemented nowfor that you need to Tall python on a system and usethe jupyter notebook or the python shell. Hey, I'm usingthe Anaconda Navigator which has all the things required to dothe programming in Python. We have the Jupiter lab. We have the notebook. We have the QT console. Even we have a studio as well. So what you need to do is justinstall the Anaconda Navigator it comes with the preinstalled python also, so the moment you click launchon The jupyter Notebook. It will take youto the Jupiter homepage in a local system and here you can doprogramming in Python. So let me just rename it asby my India diabetes.
So first, we needto load the data set. So I'm creating here a functionload CSV now before that. We need to importcertain CSV the math and the random method. So as you can see, I've created a load CSV function which will take the piemy Indian diabetes data dot CSV file usingthe CSV dot reader method and then we are convertingevery element of that data set into float originally allthe Ants are in string, but we need to convertthem into floor for our calculation purposes. Now next we need to splitthe data into training data sets that nay bias can useto make the prediction and this data set that we can use to evaluatethe accuracy of the model. We need to split the dataset randomly into training and testing data setin the ratio of usually which is 70 to 30, but for this example, I am going to use 67 and 33 now 70 and 30 is a Ratiofor testing algorithms so you can play aroundwith this number. So this is our splitdata set function. Now the Navy base model iscomprised of summary of the data in the training data set. Now this summary is then usedwhile making predictions. Now the summaryof the training data collected involves the meanthe standard deviation of each attributeby class value now, for example, if there are two class valuesand seven numerical attributes, then we need a mean and the standard deviation foreach of these seven attributes and the class value which makes The 14attribute summaries so we can break the preparationof this summary down into the following sub tasks which are the separating databy class calculating mean calculating standard deviationsummarizing the data sets and summarizingattributes by class. So the first task is to separate the training data setinstances by class value so that we can calculatestatistics for each class. We can do that by creating a mapof each class value to a list of instancesthat belong to the class. Class and sort the entiredataset of instances into the appropriate list. Now the separateby class function just the same. So as you can seethe function assumes that the last attributeis the class value the function returns a mapof class value to the list of data instances next.
We need to calculatethe mean of each attribute for a class value. Now, the mean is the centralmiddle or the central tendency of the data and weuse it as a middle of our gaussian distribution when Calculatingthe probabilities. So this is our functionfor mean now. We also need to calculatethe standard deviation of each attributefor a class value. The standard deviationis calculated as a square root of the variance and the varianceis calculated as the average of the squared differences for each attribute value from the mean nowone thing to note that here is that we are usingn minus one method which subtracts one from the numberof attributes values when calculating the variance. The now that we have the toolsto summarize the data for a given list of instances, we can calculate the meanand standard deviation for each attribute. Now that's if function groupsthe values for each attribute across our data instancesinto their own lists so that we can compute the meanand standard deviation values for each attribute. The next comes the summarizingattributes by class. We can pull it all together by first separatingour training data sets into instances growth by classthen calculating the summaries for each a To be with now. We are ready to make predictionsusing the summaries prepared from our training datamaking predictions involves calculating the probability that a given data instancebelong to each class then selecting the class with the largest probabilityas a prediction. Now we can divide this wholemethod into four tasks which are the calculatinggaussian probability density function calculating classprobability making a prediction and then estimating the accuracy now to calculate the gaussianprobability density function. We use the gaussian functionto estimate the probability of a given attribute valuegiven the node mean and the standard deviationof the attribute estimated from the training data. As you can seethe parameters are x mean and the standard deviation now in the calculateprobability function,
we calculate the exponent firstthen calculate the main division this lets us fit the equationnicely into two lines. Now, the next task is calculating theclass properties now that we had can calculatethe probability of an attribute belonging to a class. We can combine the probabilitiesof all the attributes values for a data instance and come up with a probabilityof the entire. Our data instancebelonging to the class. So now that we have calculatedthe class properties. It's time to finally makeour first prediction now, we can calculate the probabilityof the data instance belong to each class value and we can lookfor the largest probability and return the associated class and for that we are goingto use this function to predict which uses the summaries and the input Vector which isbasically all the probabilities which are being inputfor a particular label now finally we canAn estimate the accuracy of the modelby making predictions for each data instancesin our test data for that. We use the catpredictions method. Now this method is used to calculate the predictionsbased upon the test data sets and the summaryof the training data set. Now, the predictionscan be compared to the class valuesin our test data set and classification accuracycan be calculated as an accuracy ratiobetween the zeros and the hundred percent. Now the get accuracy method willcalculate this accuracy ratio. Now finally to sum it all up. We Define our main functionwe call all these methods which we have definedearlier one by one to get the Courtesy of the modelwhich we have created. So as you can see, this is our main functionin which we have the file name. We have defined the split ratio. We have the data set. We have the trainingand test data set. We are using the splitdata set method next. We are using the summarizedby class function using the get prediction andthe get accuracy method as well. So guys as you can seethe output of this one gives us that we are splitting the sevensixty eight rows into 514 which is the training and 254 which is the test data set rowsand the accuracy of this model is 68% Now we can playwith the amount of training and test data setswhich are to be used so we can changethe split ratio to seventies. 238 is 220 to getdifferent sort of accuracy. So suppose I changethe split ratio from 0.67 20.8. So as you can see, we get the accuracyof 62 percent. So splitting it into 0.67gave us a better result which was 68 percent. So this is how you can ImplementNavy bias caution classifier. These are the stepby step methods which you need to do in case ofusing the Nave Bayes classifier, but don't worry. We do not need to writeall this many lines of code to make a model thiswith The Sacketts. And I really comes into picturethe scikit-learn library has a predefined method or as say a predefined functionof neighbor bias, which converts allof these lines, of course into merely justtwo or three lines of codes. So, let me just openanother jupyter notebook. So let me name itas sklearn a pass. Now here we are going to usethe most famous data set which is the iris dataset.
Now, the iris flower dataset is a multivariate data set introduced bythe British statistician and biologists Roland Fisher and based on this fish is lineardiscriminant model this data set became a typical test case for many statisticalclassification techniques in machine learning. So here we are going to usethe caution NB model, which is already availablein the sklearn. As I mentioned earlier, there were threetypes of Neighbors which are the questionmultinomial and the bernouli. So here we are going to usethe caution and be model which is already presentin the sklearn library, which is thecycle learn Library. So first of all, what we need to do isimport the sklearn data sets and the metrics and we also need to importthe caution and be Now once all these libraries are lowered we needto load the data set which is the iris dataset. The next what we needto do is fit a Nave by a small to this data set. So as you can see we have soeasily defined the model which is the gaussianNB which contains all the programming which I just showed youearlier all the methods which are taking the inputcalculating the mean the standard deviationseparating it bike last and finally making predictions. Calculating theprediction accuracy. All of this comesunder the caution and be method which is inside already presentin the sklearn library. We just need to fit itaccording to the data set which we have so next if we print the model we seewhich is the gaussian NB model. The next what we need to dois make the predictions. So the expected outputis data set dot Target and the projectedis using the pretend model and the model we are usingis the cause in NB here. Here now to summarize the model which created we calculatethe confusion Matrix and the classification report. So guys, as you can seethe classification to provide we have the Precisionof Point Ninety Six, we have the recall of 0.96. We have the F1 score and the support and finally ifwe print our confusion Matrix, as you can see it givesus this output. So as you can seeusing the gaussian and we method justputting it in the model and using any of the data. Fitting the model which you createdinto a particular data set and getting the desiredoutput is so easy with the scikit-learn library. So guys, this is it. I hope you understood a lotabout the nape Bayes classifier how it is used where it is used and what arethe different steps involved in the classification technique and how the scikit-learnmakes all of those techniques very easy to implementin any data set which we have.
As we M or supportVector machine is one of the most effectivemachine learning classifier and it has been usedin various Fields such as face recognitioncancer classification and so on today's session is dedicated to how svm worksthe various features of svm and how it is usedin the real world. So without any further duelet's take a look at the agenda for today. We're going to begin the session with an introductionto machine learning and the different typesof machine learning. Next we'll discuss what exactly supportVector machines are and then we'll move on and see how svm works and how it can be usedto classify linearly separable data will alsobriefly discuss about how nonlinear svm's work and then we'll move on and look at the use case of svmin colon cancer classification and finally we'll endthe session by running a demo where we'll use svm to predict whether a patient is sufferingfrom a heart disease or not. Okay, so that was the agenda. Let's get stoodwith our first topic. So what is machine learningmachine learning is a science of getting computers to actby feeding them data and letting them learna few tricks on their own. Okay, we're not goingto explicitly program the machine instead. We're going to feed itdata and let it learn the key to machine learning isthe data machines learn just like us humans. We humans needto collect information and data to learn similarlymachines must also be fed data in order to learnand make decisions. Let's say that you wanta machine to predict the value of a stock.
All right in such situations. You just feed the machinewith relevant data after which you develop a model which is used to predictthe value of the stock. NOW one thing to keepin mind is the more data you feed the machine thebetter it will learn and make more accuratepredictions obviously machine learning is not so simple in order for a machineto analyze and get useful insights from data. It must process and study the databy running different. Algorithms on it. All right. And today we'll be discussingabout one of the most widely used algorithm calledthe support Vector machine. Okay. Now that you have a brief ideaabout what machine learning is, let's look at the different waysin which machines Lon first. We have supervisedlearning in this type of learning the machinelearns under guidance. All right, that's whyit's called supervised learning now at school. Our teachers guided us and taught us similarlyin supervised learning machines learn by feedingthem labeled data. Explicitly telling them. Hey, this is the inputand this is how the output must look. Okay. So guys the teacher in this caseis the training data. Next we haveunsupervised learning here. The data is not labeled and there is no guideof any sort. Okay, the machine must figureout the data set given and must find hidden patternsin order to make predictions about the output an example of unsupervised learning isan adult's like you and me. We don't need a guide to help uswith our daily activities. They figured things out onour own without any supervision. All right, that's exactly howI'm supervised learning work. Finally. We have reinforcement learning. Let's say you were dropped offat an isolated island. What would you do nowinitially you would panic and you'll be unsureof what to do where to get food from HowTo Live and all of that but after a while you will haveto adapt you must learn how to live in the island adaptto the changing climate learn what to eat and what not to eat. You're basically followingthe hit and trial.
Because you're newto the surrounding and the only way to learnis experience and then learn from your experience. This is exactly whatreinforcement learning is. It is a learning methodwherein an agent interacts with its environmentby producing actions and discovers errors or words. Alright, and once it getstrained it gets ready to predict the new data presented to it. Now in our case the agentwas you basically stuck on the island and the environmentwas the island. All right? Okay now now let'smove on and see what svm algorithm is all about. So guys svm or support Vector machine isa supervised learning algorithm, which is mainly used to classifydata into different classes now unlike most algorithms svmmakes use of a hyperplane which acts likea decision boundary between the various classes in general svm canbe used to generate multiple separating hyperplanes so that the datais divided into segments. Okay and each These segments will contain onlyone kind of data. It's mainly used for classification purposewearing you want to classify or data into two differentsegments depending on the features of the data. Now before moving any further, let's discuss a fewfeatures of svm. Like I mentioned earlier svm isa supervised learning algorithm. This means that svm trains on a set of labeled data svmstudies the label training data and then classifies any new input data depending onwhat it learned in the training. In Phase a main advantageof support Vector machine is that it can be usedfor both classification and regression problems. All right. Now even though svm is mainlyknown for classification the svr which is the supportVector regressor is used for regression problems. All right, so svm can be usedboth for classification. And for regression. Now, this is one of the reasonswhy a lot of people prefer svm because it's a very goodclassifier and along with that. It is also used for regression. Another feature is the svmkernel functions svm can be used for classifying nonlinear data by using the kernel trickthe kernel trick basically means to transform your datainto another dimension so that you can easilydraw a hyperplane between the differentclasses of the data. Alright, nonlinear datais basically data which cannot be separatedwith a straight line. Alright, so svm can even be usedon nonlinear data sets. You just have to usea kernel functions to do this. All right, so Guys, I hope you all are clearwith the basic concepts of svm. Now. Let's move on and look at how svm works so guysan order to understand how svm Works let's considera small scenario now for a second pretend that you own a firm. Okay, and let's saythat you have a problem and you want to set up a fenceto protect your rabbits from the pack of wolves. Okay, but where do you build your fenceone way to get around? The problem is to builda classifier based on the position of the rabbitsand words in your Faster. So what I'm telling you isyou can classify the group of rabbits as one group and draw a decisionboundary between the rabbits and the world.
All right. So if I do that and if I tryto draw a decision boundary between the rabbitsand the Wolves, it looks something like this. Okay. Now you can clearly builda fence along this line in simple terms. This is exactly how SPM work it drawsa decision boundary, which is a hyperplane between any two classes in orderto separate them or class. Asif I them now, I know you're thinkinghow do you know where to draw a hyperplane the basic principle behindsvm is to draw a hyperplane that best separatesthe two classes in our case the two glassesof the rabbits and the Wolves. So you start off by drawinga random hyperplane and then you check the distancebetween the hyperplane and the closest data points from each glove these closeson your is data points to the hyperplane are knownas support vectors and that's where the namecomes from support. Active machine. So basically thehyperplane is drawn based on these support vectors. So guys an Optimumhyperplane will have a maximum distance from eachof these support vectors. All right. So basically the hyper planewhich has the maximum distance from the support vectors isthe most optimal hyperplane and this distancebetween the hyperplane and the support vectorsis known as the margin. All right. So to sum it up svmis used to classify data by using a hyper plane such that the distance distancebetween the hyperplane and the supportvectors is maximum. So basically your marginhas to be maximum. All right, that way, you know that you're actuallyseparating your classes or add because the distance betweenthe two classes is maximum. Okay. Now, let's tryto solve a problem. Okay. So let's say that I inputa new data point. Okay. This is a new data point and now I want to drawa hyper plane such that it best separatesthe two classes. Okay, so I start off by drawinga hyperplane like this and then I check the distancebetween Hyper plane and the support vectors. Okay, so I'm trying to check if the margin is maximumfor this hyperplane, but what if I draw a hyper planewhich is like this? All right. Now I'm going to checkthe support vectors over here. Then I'm goingto check the distance from the support vectorsand with this hyperplane, it's clear that themargin is more right when you compare the margin of the previous oneto this hyperplane. It is more. So the reason why I'm choosingthis hyperplane is because the distancebetween the support vectors and the hi Hyperplaneis maximum in this scenario. Okay, so guys this ishow you choose a hyperplane. You basically have to make sure that the hyper planehas a maximum. Margin. All right, it has two bestseparate the two classes. All right. Okay so far it was quite easy. Our data was linearly separable which means that youcould draw a straight line to separate the two classes. All right, but what will you do?
If the data set is like this you possibly can't drawa hyper plane like this. All right. It doesn't separate the two. At all, so what do you do in such situations now earlierin the session I mentioned how a kernel can be usedto transform data into another dimension that has a clear dividing marginbetween the classes of data. Alright, so kernel functionsoffer the user this option of transforming nonlinear spacesinto linear ones. Nonlinear data set is the one that you can't separateusing a straight line. All right, in order to dealwith such data sets you're going to Ants form theminto linear data sets and then use svm on them. Okay. So simple trick would beto transform the two variables X and Y into a newfeature space involving a new variable called Z. All right, so guys so farwe were plotting our data on two dimensional space. Correct? We will only using the X and the y axis so we had onlythose two variables X and Y now in order to deal with this kind of data a simple trick would beto transform the two variables X and I into a new feature spaceinvolving a new variable called Z. Ok, so we're basicallyvisualizing the data on a three-dimensional space. Now when you transformthe 2D space into a 3D space, you can clearly seea dividing margin between the two classesof data right now. You can go aheadand separate the two classes by drawing the besthyperplane between them. Okay, that's exactly what we discussedin the previous slides. So guys, why don't you try this yourself drydrawing a hyperplane, which is the most Optimum. For these two classes. All right, so guys, I hope you havea good understanding about nonlinear svm's now. Let's look at a real world usecase of support Vector machines. So guys s VM as a classifier has been usedin cancer classification since the early 2000s. So there was an experiment heldby a group of professionals who applied svm in a coloncancer tissue classification. So the data set consisted of about 2,000transmembrane protein samples and Only about 50 to 200genes samples were input Into the svm classifier Now this sample which was input into the svm classifier hadboth colon cancer tissue samples and normal colon tissuesamples right now. The main objective of this studywas to classify Gene samples based on whether theyare cancerous or not. Okay, so svm was trainedusing the 50 to 200 samples in order to discriminatebetween non-tumor from tumor specimens. So the performanceof The svm classifier was very accuratefor even a small data set. All right, we had only50 to 200 samples. And even for the small dataset svm was pretty accurate with its results. Not only that itsperformance was compared to other classificationalgorithm like naive Bayes and in each case svmoutperform naive Bayes. So after this experimentit was clear that svm classifythe data more effectively and it worked exceptionally goodwith small data sets. Let's go ahead and understand what exactlyis unsupervised learning. So sometimes the given datais unstructured and unlabeled so it becomes difficultto classify the data into different categories. So unsupervised learninghelps to solve this problem. This learning is usedto Cluster the input data in classes on the basisof their statistical properties. So example, we cancluster Different Bikes based upon the speedlimit their acceleration or the average. Average that they are giving so and suppose learning is a typeof machine learning algorithm used to draw inferences from data sets consistingof input data without labels responses. So if you have a lookat the workflow or the process flowof unsupervised learning, so the training data iscollection of information without any label. We have the machinelearning algorithm and then we havethe clustering malls. So what it does is that distributes the datainto different clusters and again if you provideany Lebanon new data, it will make a prediction and find out to which clusterthat particular data or the data set belongs to or the particular data pointbelongs to so one of the most important algorithms in unsupervisedlearning is clustering. So let's understand exactlywhat is clustering.
So a clustering basically is the processof dividing the data sets into groups consistingof similar data points. It means groupingof objects based on the information found inthe data describing the objects or their relationships, so So clustering malls focus on and defying groupsof similar records and labeling recordsaccording to the group to which they belong now. This is done without the benefit of prior knowledgeabout the groups and their creator districts. So and in fact, we may not even know exactlyhow many groups are there to look for. Now. These models are oftenreferred to as unsupervised learning models, since there's no externalstandard by which to judge the mallsclassification performance. There are no right or wronganswers to these model and if we talk about whyclustering is used so the goal of clusteringis to determine the intrinsic growth in a setof unlabeled data sometime. The partitioning is the goal or the purpose of clusteringalgorithm is to make sense of and exact value from the last set of structuredand unstructured data. So that is why clusteringis used in the industry. And if you have a lookat the various use cases of clustering in Industryso first of all, it's being used in marketing. So discovering distinct groups in customer databasessuch as customers who make a lot of longdistance calls customers who use internet more than cause they're alsousing insurance companies for like identifying groups of Corporation insurance policyholders with high average claim rate Farmers crash cops, which is profitable. They are using C Smith studies and Define probability areasof oil or gas exploration based. Don't cease make dataand they're also used in the recommendation of movies. If you'd say they are also usedin Flickr photos. They also used by Amazon for recommending the productwhich category it lies in. So basically if we talk about clustering there arethree types of clustering. So first of all, we have the exclusive clustering which is the hard clusteringso here and item belongs exclusively to one clusternot several clusters and the datapoint belongexclusively to one cluster. ER so an example of this isthe k-means clustering so claiming clustering doesthis exclusive kind of clustering so secondly, we have overlapping clustering so it is also known assoft clusters in this and item can belong to multiple clusters asits degree of association with each clusteris shown and for example, we have fuzzyor the c means clustering which has been usedfor overlapping clustering and finally we havethe hierarchical clustering so When two clusters havea parent-child relationship or a tree-like structure, then it is knownas hierarchical cluster. So as you can see herefrom the example, we have a parent-child kind of relationship inthe cluster given here. So let's understand what exactly isK means clustering.
So today means clustering isan Enquirer them whose main goal is to group similar elementsof data points into a cluster and it is a processby which objects are classified into a predefinednumber of groups so that they They areas much just similar as possible from one groupto another group but as much as similar orpossible within each group now if you have a lookat the algorithm working here, you're right. So first of all, it starts with and defyingthe number of clusters, which is K that I can we find the centroidwe find that distance objects to the distance object to the centroid distanceof object to the centroid. Then we find the grouping basedon the minimum distance. Past the centroid Converse if true then we makea cluster false. We then I can't findthe centroid repeat all of the stepsagain and again, so let me show you how exactly clustering waswith an example here. So first we needto decide the number of clusters to be made nowanother important task here is how to decide the importantnumber of clusters or how to decide the numberof classes will get into that later. So first, let's assume that the numberof clusters we have decided. It is three. So after that then weprovide the centroids for all the Clusters which is guessing and the algorithm calculatesthe euclidean distance of the point from each centroid and assize the data point to the closest clusternow euclidean distance. All of you knowis the square root of the distance the square rootof the square of the distance. So next when the centroidsare calculated again, we have our new clusters for each data point then againthe distance from the points. To the new classesare calculated and then again the points are assignedto the closest cluster. And then again, we have the new centroid scattered and nowthese steps are repeated until we havea repetition the centroids or the new centralized are veryclose to the very previous ones. So until unless our output gets repeated or the outputsare very very close enough. We do not stop this process. We keep on calculatingthe euclidean distance of all the pointsto the centroid. It's then we calculatethe new centroids and that is how K meansclustering Works basically, so an important parthere is to understand how to decide the value of Kor the number of clusters because it doesnot make any sense. If you do not know how many classesare you going to make? So to decidethe number of clusters? We have the elbow method.
So let's assume first of all computethe sum squared error, which is sse4 some valueof a for example. Take two four sixand eight now the SSE which is the sum squared erroris defined as a sum of the squared distancebetween each number member of the cluster and its centroidmathematically and if you mathematically itis given by the equation which is provided here. And if you broughtthe key against the SSE, you will see that the error decreasesas K gets large not this is because the numberof cluster increases they should be smaller. So the Distortion isalso smaller know. The idea of the elbow methodis to choose the K at which the SSE decreases abruptly. So for example here if we have a lookat the figure given here. We see that the best numberof cluster is at the elbow as you can see here the graphhere changes abruptly after the number four. So for this particular example, we're going to usefor as a number of cluster. So first of all while working withk-means clustering there are two key pointsto know first of all, Be careful aboutwhere you start so choosing the first center at randomduring the second center. That is far away from the firstcenter similarly choosing the NIH Center as far awayas possible from the closest of the of the other centers and the second ideais to do as many runs of k-means each with differentrandom starting points so that you get an ideaof where exactly and how many clustersyou need to make and where exactlythe centroid lies and how the datais getting converted. Divorced now k-means isnot exactly a very good method. So let's understand the prosand cons of k-means clustering. We know that k-means is simpleand understandable. Everyone learns to the first gothe items automatically assigned to the Clusters. Now if we havea look at the cons, so first of all one needs todefine the number of clusters, there's a veryheavy task asks us if we have three four orif we have 10 categories, and if you do not know what the numberof clusters are going to be. It's very difficult for anyone. You know to guess the numberof clusters not all the items are forced into clusters whether they are actually belongto any other cluster or any other category. They are forced to rely in that other categoryin which they are closest to this against happensbecause of the number of clusters with not definingthe correct number of clusters or not being able to guessthe correct number of clusters. So and for most of all, it's unable to handlethe noisy data and the outliners because anyways machinelearning engineers and date.
Our scientists haveto clean the data. But then again it comesdown to the analysis what they're doingand the method that they are using so typicallypeople do not clean the data for k-means clustering or even if the clean there'ssometimes a now see noisy and outliners datawhich affect the whole model so that was allfor k-means clustering. So what we're going to dois now use k-means clustering for the movie datasets, so, Have to find outthe number of clusters and divide it accordingly. So the use case isthat first of all, we have a data setof five thousand movies. And what we wantto do is grip them if the movies into clustersbased on the Facebook likes, so guys, let's have a lookat the demo here. So first of all, what we're going to do isimport deep copy numpy pandas Seaborn the various libraries, which we're going to use now and from my proclivitiesin the use ply plot. And we're going to usethis ggplot and next what we're going to do isimport the data set and look at the shape of the data set. So if you have a look at theshape of the data set we can see that it has 5043 rosewith 28 columns. And if you have a lookat the head of the data set we can see itjust 5043 data points, so George we going to dois place the data points in the plot we takethe director Facebook likes and we have a look at the data columnsface number in post cars total Facebook likesdirector Facebook likes. So what we have done here now is taking the directorFacebook likes and the actor three Facebook likes, right. So we have five thousandforty three rows and two columns Now usingthe k-means from sklearn what we're goingto do is import it. First we're going to importk-means from scale and Dot cluster. Remember guys eschaton isa very important library in Python for machine learning. So and the number of cluster what we're going to do isprovide as five now this again, the number of clusterdepends upon the SSE, which is the sum of squared errors all the we'regoing to use the elbow method. So I'm not going to gointo the details of that again.
So we're going to fit the datainto the k-means to fit and if you find the cluster, Us than for thek-means and printed. So what we find is isan array of five clusters and Fa print the labelof the k-means cluster. Now next what we're goingto do is plot the data which we have with the Clusterswith the new data clusters, which we have found and for this we're goingto use the CC Bond and as you can see here,we have plotted that car. We have plotted the data into the grid and you can seehere we have five clusters. So probably what I would say is that the cluster3 and the cluster zero are very very close. So it might depend see that's exactly what Iwas going to say. Is that initiallythe main Challenge and k-means clustering isto define the number of centers which are the K. So as you can see here that the third Center and the zeroth clusterthe third cluster and the zeroth cluster upvery very close to each other. So guys It probablycould have been in one another clusterand the another disadvantage was that we do not exactly know how the points areto be arranged. So it's very difficult to forcethe data into any other cluster which makes our analysisa little different works fine. But sometimes itmight be difficult to code in the k-means clustering now, let's understand what exactly isc means clustering. So the fuzzy see means is an extension of the k-meansclustering the popular simple. Clustering technique sofuzzy clustering also referred as soft clustering is a form of clustering in whicheach data point can belong to more than one cluster. So k-means tries to findthe heart clusters where each point belongsto one cluster. Whereas the fuzzy c meansdiscovers the soft clusters in a soft clusterany point can belong to more than one cluster at a time witha certain Affinity value towards each 4zc means assignsthe degree of membership, which Just from 0 to 1to an object to a given cluster. So there is a stipulationthat the sum of Z membership of an object to all the cluster. It belongs to must be equalto 1 so the degree of membership of this particular point to pullof these clusters as 0.6 0.4. And if you add up we get 1 so that is one of the logicbehind the fuzzy c means so and and this Affinityis proportional to the distance from the point to the centerof a cluster now then again We have the prosand cons of fuzzy see means. So first of all, it allows a data point to bein multiple cluster. That's a pro. It's a more neutralrepresentation of the behavior of jeans jeans usually areinvolved in multiple functions. So it is a verygood type of clustering when we're talkingabout genes First of and again, if we talk about the cons again, we have to Define cwhich is the number of clusters same as K next. We need to determine themembership cutoff value also, so that takes a lot of I'mand it's time-consuming and the Clusters are sensitive to initialassignment of centroid.
So a slight change or deviation from the center'sit's going to result in a very differentkind of, you know, a funny kind of output with thatfrom the fuzzy see means and one of the major disadvantageof c means clustering is that it's thisa non-deterministic algorithm. So it does not give youa particular output as in such that's that now let's have a lookat At the throat type of clustering which isthe hierarchical clustering. So hierarchical clusteringis an alternative approach which builds a hierarchyfrom the bottom up or the top to bottom and does not requireto specify the number of clusters beforehand. Now, the algorithm worksas in first of all, we put each data point in its own cluster andif I the closest to Cluster and combine them into one morecluster repeat the above step till the data points arein a single cluster. Now, there are two types ofhierarchical clustering one is I've number 80 plus string and the other oneis division clustering. So a cumulative clustering billsthe dendogram from bottom level while the division clusteringit starts all the data points in one clusterthe fruit cluster now again hierarchical clustering alsohas some sort of pros and cons. So in the prosdon't know Assumption of a particular numberof cluster is required and it may correspondto meaningful tax anomalies. Whereas if we talkabout the cons once a decision is madeto combine two clusters. It cannot be undone and one of the major disadvantage ofthese hierarchical clustering is that it becomes very slow. If we talked about very verylarge data sets and nowadays. I think every industryare using last year as it's and collectinglarge amounts of data. So hierarchical clustering isnot the act or the best method someone might needto go for so there's that Hello everyone and welcome to this interestingsession on a prairie algorithm. Now many of us have visitedretails shops such as Walmart or Targetfor our household needs. Well, let's say that we are planning to buya new iPhone from Target. What we would typically do is search for the model by visitingthe mobile section of the stove and then select the product and head towardsthe billing counter. But in today's world the goal of the organization isto increase the revenue. Can this be doneby just pitching one?
I worked at a timeto the customer. Now. The answer to Is is clearlyno hence organization began mining data relatingto frequently bought items. So a Market Basket analysisis one of the key techniques used by large retailersto uncover associations between items now examplescould be the customers who purchase Bread havea 60 percent likelihood to also purchase Jam customers who purchase laptops aremore likely to purchase laptop bags as well. They try to find out associations between differentitems and products that can be sold together which gives assistingin the right product placement. Typically, it figures out what products arebeing bought together and organizations can placeproducts in a similar manner, for example, people who buy bread alsotend to buy butter, right and the marketing team at retail storesshould Target customers who buy bread and butterand provide an offer to them so that they buy aBut item suppose X so if a customer buys bread and butter and seesa discount offer on X, he will be encouragedto spend more and buy the eggs and this is what MarketBasket analysis is all about. This is what we are goingto talk about in this session, which is Association rule Mining and the a prayer real Corinth now Association rulecan be thought of as an if-then relationshipjust to elaborate on that. We have come upwith a rule suppose if an item a is Been boughtby the customer. Then the chances of Item B being pickedby the customer to under the same transaction ID is foundout you need to understand here that it's not acash reality rather. It's a co-occurrence patternthat comes to the force. Now, there are two elements to this rule first ifand second is the then now if is also known as antecedent. This is an itemor a group of items that are typicallyfound in the item set and the later one. Is called the consequentthis comes along as an item with an antecedent group or the groupof antecedents a purchase. Now if we lookat the image here a arrow B, it means that if a person buys an item athen he will also buy an item b or he will mostprobably by an item B. Now the simple example that I gave you aboutthe bread-and-butter and the x is just a small example, but what if you have thousandsand thousands of items if you go to any proofadditional data scientist with that data, you can just imaginehow much of profit you can make if the data scientist providesyou with the right examples and the right placementof the items, which you can do and youcan get a lot of insights. That is why Associationrule mining is a very good algorithm which helpsthe business make profit. So, let's seehow this algorithm works.
So Association rule mining isall about building the rules and we have just seen one rule that If you buy a thenthere's a slight possibility or there is a chance that you might buybe also this type of a relationship in whichwe can find the relationship between these two itemsis known as single cardinality, but what if the customer who bought a and b also wantsto buy C or if a customer who bought a b and calso wants to buy D. Then in these cases the cardinalityusually increases and we can have a lotof combination around. These data and if you have around 10,000or more than 10,000 data or items just imagine how many rules you're goingto create for each product. That is why Association rulemining has such measures so that we do not end up creatingtens of thousands of rules. Now that is where the a priorialgorithm comes in. But before we getinto the a priori algorithm, let's understand. What's the maths behind it. Now there are threetypes of matrices. Which help tomeasure the association? We have supportconfidence and lift. So support isthe frequency of item a or the combination of item ARB. It's basically the frequencyof the items, which we have bought and what are the combinationof the frequency of the item. We have bought. So with this what we can dois filter out the items, which have beenbought less frequently. This is one of the measureswhich is support now what confidence tellsus so conference. Gives us how often the items NB occur together giventhe number of times a occur. Now this also helps us solvea lot of other problems because if somebody is buying a and b together and not buyingsee we can just rule out see at that point of time. So this solvesanother problem is that we obviously do not needto analyze the process which people just by barely. So what we can do is according to the sages wecan Define our minimum support and confidence and when you have set Values we can putthis values in the algorithm and we can filterout the data and we can create different rules and suppose even after filtering you havelike five thousand rules. And for every item wecreate these 5,000 rules.
So that'spractically impossible. So for that we needthe third calculation, which is the lift so lift is basicallythe strength of any Rule now, let's have a lookat the denominator of the formula given hereand if you see Here, we have the independentsupport values of A and B. So this gives us the independent occurrenceprobability of A and B. And obviously there'sa lot of difference between the random occurrence and Association and if the denominatorof the lift is more what it means is that the occurrence of Randomness is morerather than the occurs because of any association. So left is the final verdictwhere we know whether we have to spend time. On this particular rulewhat we have got here or not. Now, let's have a lookat a simple example of Association rule mining. So suppose. We have a set of items a b c d and e and a setof transactions T1 T2 T3 T4 and T5 and as you can see here, we have the transactions T1 in which we have ABC T to a CDt3b CDT for a d e and T5 BCE. Now what we generallydo is create. At some rules or Associationrules such as a gives T or C gives a a gift C B and C gives a what thisbasically means is that if a person buys a thenhe's most likely to buy D. And if a person by C, then he's most likelyto buy a and if you have a lookat the last one, if a person buys B and C ismost likely to buy the item a as well now if we calculatethe support confidence and lift using these rules as you can see herein the table, we have the rule. And the support confidencehandle lift values. Let's discuss about a prairie. So a priori algorithmuses the frequent itemsets to generate the association Ruleand it is based on the concept that subset of a frequentitemsets must also be a frequent item set itself. Now this raises the question what exactly isa frequent item set. So a frequent itemset is an item set whose support value is greater than the threshold valuejust now we discussed that the marketing team according to the says havea minimum threshold value for the confidence aswell as the support. So frequent itemsetsis that animset who support value is greater than the threshold valuealready specified example, if A and B is a freaker item set Than A and B should also befrequent itemsets individually. Now, let's considerthe following transaction to make the thingssuch as easier suppose. We have transactions 1 2 3 4 5 and theseItems out there. So T 1 has 1 3 & 4 T2 has 2 3 and 5 T3 has 1 2 3 5 T 4 to 5 and T 5 1 3 & 5 now the first stepis to build a list of items sets of size 1 byusing this transactional data. And one thing to note here isthat the minimum support count which is given here isto Let's suppose it's too so the first step isto create item sets of size 1 and calculatetheir support values.
So as you can see here. We have the table see one in which we havethe item sets 1 2 3 4 5 and the support values if you rememberthe formula of support, it was frequency divided bythe total number of occurrence. So as you can seehere for the items that one the support is 3 as you can see herethat item set one up here s and t 1 T 3 and T 5. So as you can see, it's frequency is 1 2 & 3 nowas you can see the item set for has a support of one as it occurs only oncein Transaction one but the minimumsupport value is 2 that's why it's goingto be eliminated. So we have the final tablewhich is the table F1, which we have the itemsets 1 2 3 and 5 and we have the support values3 3 4 & 4 now the next step is to create Adam sets of size 2 and calculatetheir support values now all the combinationof the item sets in the F1, which is the final tablein which it is carded the for are going to be usedfor this iteration. So So we get the table c 2. So as you can see here,we have 1 2 1 3 1 5 2 3 2 5 & 3 5 now if you calculatethe support here again, we can see that the item set 1 comma2 has a support of one which is again lessthan the specified threshold. So we're going to discardthat so if we have a look at the table f 2 we have 1 comma 3 1 52 3 2 5 & 3 5 again, we're going to move forwardand create the atoms. That of size 3 and calculatethis support values. Now all the combinationsare going to be used from the item set F tofor this particular iterations. Now before calculatingsupport values, let's perform proningon the data set. Now what is pruning now after the combinationsare being made we device c 3 item sets to check if there is another subsetwhose support is less than the minimum support value. That is what frequentitems that means. So if you have a lookhere the item sets. We have is 1 2 3 1 2 13 2 3 4 the first one because as you can see here if we have a lookat the subsets of one two, three, we have1 comma 2 as well, so we are going to discardthis whole item set same goes for the second one. We have one to five. We have 1/2 in that which was discardedin the previous set or the previous step. That's why we're goingto discard that also which leaves uswith only two factors, which is 1 3 5 8. I'm set and the two three fiveand the support for this is 2 and 2 as well. Now if we create the table Cfor using four elements, we going to haveonly one item set, which is 1 2 3 and 5 and if you have a look at the tablehere the transaction table one, two, three and fiveappears only one. So the support is one and since C for the supportof the whole table C 4 is less than 2 sowe're going to stop here and return to theprevious item set that It is 3 3 so the frequent itemsets have1 3 5 and 2 3 5 now let's assume our minimum confidence value is60 percent for that.
We're going to generateall the non-empty subsets for each frequent itemsets. Now for I equals 1 comma3 comma 5 which is the item set. We get the subset one three one five three five one threeand five similarly for 2 3 5 we get to three to fivethree five two three. and five now this rule states that for every subset s of I the output of the rulegives something like s gives i2s that implies s recommends I of s and this is only possible if the support of I dividedby the support of s is greater than equal to the minimumconfidence value now applying these rules to the item setof F3 we get rule 1 which is 1 3 gives 1 comma 3 comma 5 and 1/33 it means 1 and 3 gives 5 so the confidence is equal to the support of 1 comma3 comma fire driver support of 1 comma 3 that equals 2 by 3 which is 66% andwhich is greater than the 60 percent. So the rule 1 is selected nowif we come to rule 2 which is 1 comma 5 it gives1 comma 3 comma 5 and 1 5 it means if we have1 & 5 it implies. We also goingto have three know. Calculate the confidenceof this one. We're going to have support1 3 5 whereby support 1/5 which gives us a hundred percent which means rule2 is selected as well. But again if you have a lookat rule 506 over here similarly, if it's select 3 gives 1 3 5 & 3 it meansif you have three, we also get one and five. So the confidence for this comes at 50% Which is less thanthe given 60 percent Target. So we're going to rejectthis Rule and same. Goes for the rule number six. Now one thing to keepin mind here is that all those are rule1 and Rule 5 look a lot similar they arenot so it really depends what's on the lefthand side of the arrow. And what's on the right-handsides of the arrow. It's the if-then possibility. I'm sure you guys can understandwhat exactly these rows are and how to proceedwith this rules. So, let's see how we can implementthe same in Python, right? So for that what I'm goingto do is create a new python. and I'm going to usethe chapter notebook. You're free to useany sort of ID. I'm going to nameit as a priority. So the first thing what we're going to dois we will be using the online transactional data of retail store forgenerating Association rules. So firstly what we need to dois get the pandas and ml x 10 libraries importedand read the file. So as you can see here, we are using the onlineretail dot xlsx format file and from ml extant.
We're going to import a prairie and Association rules atall comes under MX 10. So as you can see here, we have the invoicethe stock quote the description the quantity the invoice data unitprice customer ID and the country nownext in this step. What we're going to dois do data cleanup which includes removingthe spaces from some of the descriptions. And drop the rulesthat do not have invoice numbers and removethe great grab transactions because that is of no use to us. So as you can see hereat the output in which we have like five hundredand thirty two thousand rows with eight columns. So after the cleanup, we need to consolidate the itemsinto one transaction per row with each product for the sakeof keeping the data set small. We are only lookingat the sales for France. So as you can see here, we have excluded all the othersays we're just looking at the sales for France. Now. There are a lotof zeros in the data. But we also need to make sure any positive valuesare converted to 1 and anything lessthan zero is set to 0 so as you can see here, we are still 392 Rose. We're going toencode it and see. Check again. Now that you have structuredthe data properly in this step. What we're going to do isgenerate frequent itemsets that have support atleast seven percent, but this number is chosen so that you can get close enoughand generated rules with the correspondingsupport confidence and lift. So go ahead you can see here. The minimum supportis 0.71 of what if we add another constraint on the rules such asthe lift is greater than 6 and the conferenceis greater than 0.8. So as you can see here, we have the left-hand sideand the right-hand side of the association rule, which is the antecedentand the consequence.
Thanks For Readings
Post a Comment
If you have any questions ! please let me know