2) Data Analytics Overview [+]
3) The Self-Directed Learner’s Guide (to Success) [+]
4) The Self-Directed Learner’s Guide to Success [+]
5) Introduction to Machine Learning [+]
6) Data Empowerment [+]
1. Pre Survey
You do not have permission to view this form.2. Reinforce Mindshift
Your Why
Defining your why gives you more direction.
Once you have your first why go a level deeper and ask yourself, why is that your why and come up with a second why.
Exercise
You do not have permission to view this form.MindShift
Your Why
- Your Why can give you purpose and direction
Internal Feelings and Conversations
- Emotions – Choice
- Positive Self Talk
- Not hard or easy but unknown to known
Learning To Learn How To Learn
- Build Your Own World Model of What you are Learning (Why am I and how to apply)
- Learn Functions not all Focus on Skill
- Problem Solving (What I know and don’t know, take what I don’t know and turn it into what I know
Generalize
- How can what I am learning here apply somewhere else
- How can what I know somewhere else apply here
Exercise
You do not have permission to view this form. Have Fun and Choose Powerfully3. Rating Introduction
You do not have permission to view this form.4. Build Small Program - 10 steps (Categorization)
Introduction
If you don’t have your notebook from the previous Data Empowerment course then grab a notebook from the github repository https://github.com/vizmotion/notebooks. The notebook is called 10 Steps.ypynb.
Go through and run each of the cells to make sure that they are working. We will delve deeper into each step in this course.
Exercise
You do not have permission to view this form. Have Fun and Choose Powerfully5. Closer Look at 1 - 5 (example Categorization)
MindShift Minute
- Get comfortable with not knowing everything about a problem or the solution in the beginning. It is a journey to get to the end.
- In class we have various levels of experience in programming and technology. Another educational MindShift is to ownership of your own education. If you are more advance then push your self with new resources and techniques. You can use other sources. If you are still working through the concepts then again push your self to look at other resources. Sometimes by looking at things told a different way can make all the difference in the world.
- If you are shown a new solution there’s no use thinking, “How could I have known to do that?” You probably wouldn’t. But now that you have been introduced to it, it’s a new tool in your arsenal. Learn it and understand how to use it. You may have to look up the details but now you know it exists.
- Don’t stare at something for too long thinking that you should know the answer. Define what you don’t know and take action to figure it out (Google, ask someone, search for YouTube videos, design an experiment to test a hypothesis, etc.)
- You don’t have to fully understand why something works. Sometimes you just use it. You don’t have to be a mechanic to drive a car.
- The idea is to complete the homework for the lecture before the following lecture.
Introduction
You’re doing well to make it to this point. The exercises have been about being in a Machine Learning discussion. For those of you who were concerned with getting it right, don’t be. You have all done it right. You are thinking about how AI and Machine Learning works and that is what the first set of exercises were all about.
The last lesson was an introduction to how a machine learning project would work. Now we’re going deeper into each section. Don’t worry if you didn’t understand all that happened in the last lesson. Part of embracing technology is being OK with not knowing everything. The next four lessons will fill in those gaps. You will create a new notebook to do another example with different data.
In this lesson, we will take a deeper dive into Problem Definition, Finding Data, Loading Data, Doing the Analysis of the Data and finally Data Cleaning. We will be doing Python coding. Keep in mind that the goal is not to understand all of Python. That is beyond the scope of this class, however, you can and should use alternative articles, tutorials, and videos to continue to grow in your understanding. Sometimes it takes multiple viewings and different perspectives.
1. Problem Definition
Below is a Problem Definition/Solution Approach worksheet. Let’s go through it.
6 Steps to Defining a Machine Learning Problem
The first and most important step in any data science project is defining the problem. This article will take you through 6 steps to help you define a machine learning problem.
1.1. Explain the Business Problem(s) in English
As a starting point, describe the problem in informal language as you would describe it to a colleague. For example, “I need a machine-learning algorithm to tell me how many customers will buy my product.”
1.2. Explain the Benefits of the Solution
Identify why the problem needs to be solved and what will you achieve. In our example, perhaps you would seek insights about the products that will have maximum sales. You can then allot marketing resources according to the predicted sales for various products generating more profit. This will help you to have an advantage over your competitors who do not use machine learning for such decisions.
1.3. List All the Important Information About the Problem
To define a problem that delivers real results, start by listing the key information about the problem. Some basic things you should list are tasks, assumptions, and performance evaluators. Tasks are processes that a machine-learning algorithm should perform. In our example, the task is to predict the number of sales a product would have. Assumptions are rules of thumb and domain-specific information that are very helpful to get to the solution. Be extremely careful while defining assumptions; wrong assumptions may lead to a false solution that may not provide results as expected in the real world. Here, assumptions may be about conversion rate, return policy, etc. Performance evaluators are variables whose value represents the actual results of the project. For example, the total number of sales a product would have is one of the performance evaluators, another could be the total profit you make which you want your algorithm to calculate.
1.4. Identify the Data Needed to Solve the Problem
Once you have a preliminary list of tasks, assumptions, and performance evaluators, it is time to think of what data would be needed to solve your problem. Here in our example, to predict sales of a specific product you might need historical sales data with categorical and demographic details of customers. For example, a good dataset would contain the following information as a column for each sale made: Order date, Order amount, Product Name, Product Category, Product subcategory, Customer City, Customer State, Customer Country, Product Price, Product Cost, Discount, Profit and maybe more depending on your problem.
Another question that arises is how do you get access to this data? If you are working for a large company, then this data might be readily available from historical sales invoices but if you are trying to figure out products to sell on your e-commerce store there would be no historical data. In that case, you should try to get the data from the US Department of Commerce website or any other open-source on the internet which is relevant to your problem. You may also need to pre-process or transform data if the format required is not readily available.
1.5. Now Restate the Problem in One of the 3 Categories Below:
Types of Problems:
- Classification
Classification, also known as supervised learning, is a problem where it is required to identify to which class or category does a new observation belong. Here in our example, if you want to determine which products will cross a certain sales mark, that’s a classification problem. - Regression
Regression is a type of problem where the conditional expectation of one variable needs to be estimated while keeping other variables fixed. In simpler words, regression is a prediction of value under certain known conditions. Regression is mainly used for forecasting. If you want to predict the number of sales each product would make or predict the number of customers that would buy your product, that is a regression problem. - Unsupervised – Data Mining
Unsupervised learning is a type of machine learning where there are no specific classifications or categorization of observations or there is no training data available. In such scenarios, the accuracy of an algorithm cannot be evaluated from outputs. Unsupervised data mining is used for a problem of clustering or pattern detection. Clustering is different from classification as there are no previously known categories or training data available.
1.6. Define What Success Looks Like
Make sure to list out the benefits you would gain by successfully implementing a machine learning project. Consider what will be fulfilled when the problem is solved. In our example, success could be making 1000 sales per month or gaining 1000 customers every month. If you don’t define success properly, chances are, you won’t capitalize on your project efficiently.
Problem Definition Example
- Problem: Ability to predict sales of product X in my 3 regions
- Benefits: It would allow me to better allocate marketing resources.
- Domain Knowledge: Sale price could be more in Region 1 than Region 2 and 3 resulting in more profits there.
- Data Needed:
What: Sales Date, Price, Cost, Discount, Profit, Region of Customer.
Where: I can get that data from our accounting system. - Type of Problem: Since I want numbers it is a regression problem.
- Success: I would like to get 1000 sales in Region 1, more than 500 sales in Region 2 and 3.
Summary
Investing more time in defining all these aspects of the problem will eventually lead to successful implementations of machine learning projects. One of the best practices is looking at machine learning problems or projects similar to the one you are trying to solve. Similar problems can provide information about assumptions, algorithms, data transformations, and limitations of a machine learning model.
In this example, we are using a standard data set that has been used to show the power of machine learning in medical practice. This data was from a study in 1992 and is made available on this site https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Original). You can read is easily into the database by using this URL that we have provided for you: https://teammindshift.com/data/breast-cancer-wisconsin.csv
2. Getting Data
At this point, you have defined the problem with the help of the 6-step approach above. You may already know what dataset to use and where to get it but in case you don’t, here’s a quick guide to get the data you need.
What Data Do You Need/Want?
- The first question you should ask yourself is, “What data do you need to complete your machine learning project?” For example, if you were doing sales forecasts for your products, then you probably want sales dates and sales amount or sales quantity as a minimum. If you can get additional fields like region, customer type, sales channel, product category, profit, discount, etc. that may be good information that may be helpful with the forecast and getting more insights and may even help improve the accuracy of the forecast. It is advisable to spend time thinking about what data you would need.
- Once you have identified the data you need or want to use for your project, think about, “In what form, if available, would the data be easy as well as appropriate to use in your project?” The answer should look something like: “If data is available as a Microsoft Excel spreadsheet containing columns with names Date, Customer Region, Product Name, Product Category, Quantity, Sales Amount, Discount, Profit it would be very easy to use and I would have all the required information for further analysis.”
Possible Sources
The next step is to collect the required data for your project. There are possibly infinite sources of data in this data-driven global economy depending on your condition but some of the sources are mentioned here for your reference.
- Internal Sources
a. ERP systems: ERP systems or Enterprise Resource Planning systems are the heart and soul of a lot of big companies. The major systems are SAP, JDA, etc. Usually, the IT department oversees the running of these systems and provides access to the data on an as-needed basis.
b. Financial/Ordering/CRM system: CRM software consolidates customer information and documents into a single database so business users can more easily access and manage it. Any admin or assigned user can access data from CRM software. Data is also available through invoicing software or order management systems that are used by your organization.
c. Other reports where the data is being used in your organization: Web analytics platforms such as Google Analytics, SQL databases, Excel spreadsheets, etc. - External Sources
a. Government sources: Data.gov – Official US government website for open data, Commerce.gov – US Department of Commerce website, analytics.usa.gov – provides data about US government websites traffic, Healthdata.gov – For data related to healthcare, etc.
b. Associations or other Public Resources (May need to join or publicly available): Kaggle.com, GitHub, etc.
c. Places that sell data: Towerdata, Transunion, Axciom, ID Analytics, etc. - Newly Generated Data
a. Collect data using surveys, Google forms, etc.
b. Collect data using sensors or other mechanisms.
More About Getting The Data
Earlier, I mentioned defining the required format and fields of data. Now, what if data is not available in that exact format or some fields you really want to have are not there? One thing you must remember is that data is not always obvious or complete or perfect. Sometimes you might have to make extrapolations, assumptions, calculations, or transformation as required. For example, you found a dataset that doesn’t have revenue data but the quantity sold reflects revenue very well. Similarly, the area of a house is the length by the width of the foundation which can be easily calculated.
So never give up the project just because you couldn’t find the exact data you wanted, instead, think of ways to make whatever data is available more complete.
NOTE: Our Resource page lists some publicly available data resources.
3. Loading Data
Data can be stored in files with different delimiters. We have learned about CSV files. These are files with the fields separated by commas (,) but there are files where the fields are separated with tabs (\t), semicolons (;) and spaces ( ) as well as any other character. You may have seen this in Excel when you have tried to open a text file and it asked you about delimiters. Files can also be read from a URL or your local drive. Files from the local drive are represented with C:/directory path along with the file name. Files from a URL usually start with http or https. Below is a file being read from a URL. In the homework, you will get more experience in loading files. The data from the breast-cancer-wisconsin file is read into the data frame people. A data frame is a variable that stores data in a structured way allowing you to manipulate that data as needed to accomplish your goals.
[colab]import pandas as pd url = "https://teammindshift.com/data/breast-cancer-wisconsin.csv" people = pd.read_csv(url, sep=",")
4. Data Analysis
Now that you have loaded the data, we are going to analyze some aspects of the data. Some commands will allow you to quickly get an overview understanding of the data. We will discuss two of them and for homework, you will discover and use a third.
The first command is describe(). It will give you a quick look at the data frame’s overall dataset.
people.describe()
The next command is correlation, corr(), in which you can start to see pairwise if there is a possible relationship between different fields.
people.corr()
You can see a strong correlation between the size and thickness to the class but a very weak relationship between the class and the id. For more information about correlation see https://en.wikipedia.org/wiki/Correlation_and_dependence
Finally, you will explore other commands that can give you insights into data in the homework. Yeah!!!
5. Data Cleaning
Cleaning a dataset means making sure the data is in a useable format for doing machine learning. This means that all data needs to be numeric and have a value that is indicative of the problem that you are trying to solve. This last condition requires some domain knowledge and common sense. For example, in the people data frame, there is a data value where a person has 0 blood pressure. We know that is an error so we need to replace that value with something reasonable.
Next, we are going to switch datasets and use one that I made up to help you understand data cleaning.
Most of the manipulation that we want to do with a dataset will be in numeric form. This lesson will show you techniques to transform non-numeric data into a numeric form, remove data that is not important/cannot be used, and fix or update erroneous or misleading data.
# Data Set Meaning # Variable,Definition # X1,Interest Rate on the loan # X2,A unique id for the loan. # X3,A unique id assigned for the borrower. # X4,Loan amount requested # X5,Loan amount funded # X6,Number of payments (36 or 60) #X7,age of signer #X8,Answer to getting another loan from pandas import read_csv df = read_csv('https://teammindshift.com/data/DataCleaningLoanEx.csv')
This is a small dataset so you can see all of the values. Imagine you want to use this data to predict the rate at which this loan will be funded. In this dataset, we will see some techniques that can be used on much larger datasets.
df.describe()
Note that describe() only shows columns that are numbers.
df.describe(include='all')
Note that describe(include=’all’) shows all columns including non-numeric ones.
Notice that although X1 represents numbers, it has a % character making that column a string data type. Also, notice the columns that have NaN which means there is no information in them.
df['X6'].value_counts()
value_counts() – This will return the count of all the unique values in a field. Notice how this shows you all the text strings in the X6 fields.
Now, using our understanding of the data, we will start to make changes to the data frame. This is data cleaning.
# Drop a column df = df.drop(["X2","X3"], axis=1) print(df)
The columns X2 and X3 have nothing to do with predicting the loan funding rate so we will remove those columns from the dataset.
# Remove all rows that have NaN for field X4 by saving all not null lines df = df[df.X4.notnull()] print(df)
We are going to remove all the rows with ? in them.
df = df[~(df == '*').any(axis='columns')]
The rows that don’t have a value for this demo will also be deleted. Notice here that a new data frame is formed with just the rows that have a value in them for column X4.
# remove everything not numbers from X1 values df['X1'] = pd.to_numeric(df['X1'].str.replace(r'[^-\d.]', '')) print(df)
This code will take all the characters that are not numbers out of the cell values and convert the values in the cell to a number.
# set NaN values in this column to max+1 values maxX7 = df['X7'].max(); df['X7'].fillna(maxX7+1,inplace=True) print(df)
This code will find the NaN values in the X7 column and make them the max value + 1. In this case, we can do this because if there is no value we are making the assumption that the person never had a default for as long as the data has been being recorded. By having a number in the set that is close to being relevant we make it the highest number in the data set.
inplace – replaces the data frame instead of creating a new one.
df['X6'].value_counts()
value_counts – shows you how many of each unique item is in the field. You see that there is a typo of 36 mnths in the X6 field.
OK now I want to replace the 36 mnths with 36 months
df['X6'] = df['X6'].str.replace('36 mnths','36 months')
And finally I want to change the X6 values into 0 and 1
# transform text into numbers mapto = {"36 months":1,"60 months":0} df['X6'] = df['X6'].map(mapto) print(df)
How would we change the no and yes to 1 and 0? Click to Show/Hide Solution
# transform text into numbers mapto = {"yes":1,"no":0} df['X8'] = df['X8'].map(mapto) print(df)
Exercise
You do not have permission to view this form. Have Fun and Choose Powerfully6. Closer Look at step 6 - 9 (example Categorization)
MindShift Minute
In the face of fear, frustration and self doubt you have to plunge ahead in the faith that it will get better always looking to make more powerful choices. Some things that can help:
- Meditation – Mini Meditation
- Sleep – Good Night Sleep Benefits
- Breathing – Breathing Exercises
- Power Posing – Amy Cuddy
5. Data Cleaning – From Homework
This is the answer to parts of the homework. The code below:
- Reads in the tsv file
- Removes the “?” rows
- Drops the id axis
- Prints out the top 10 entries in the Data Frame.
import pandas as pd # reads in tsv file # url = "https://warnermedia.teammindshift.com/data/breast-cancer-wisconsin.tsv" women = pd.read_csv(url, sep="\t") # drop values that have ? women = women[~(women == '?').any(axis='columns')].astype(int) # drop column id women.drop(['id'], axis=1, inplace=True) #top 10 entries women.head(10)
6. Data Visualization
Now that we have the data let’s start to look at it visually. Sometimes a picture is worth 1,000 words.
Creating a Box Plot for select DataFrame values
A box and whisker plot (or simply box plot for short) can reveal a lot of information in a very simple picture. Below we show a box plot of both the thickness and the size fields. If you are unfamiliar with the box plot then do some independent research such as searching for videos or definitions to help you understand. Here is one place to start: Wikipedia Box Plot
# subset box plot # box plot comparison of thickness and size women[['thickness','size']].boxplot();
Creating a Histogram
Another plot that can be helpful is a histogram. They allow you to see the distribution of the data in the field. Below we can see the histogram for all number fields with one command. Then if we want to change the size of the image we can use the rcParams to change the figure size and then show a histogram for just one field. There are more parameters for all the matplotlib functions so feel free to look up the functions and play around with them.
#general histogram for all values women.hist()
Change image size:
# Histogram with change in size for a specific value import matplotlib.pyplot as plt plt.rcParams["figure.figsize"] = [5, 5] women['thickness'].hist()
Scatter Plot
Next, let’s look at a scatter plot where size and shape pairs are plotted. Understanding that there is some correlation between size and shape, I would expect to see some type of closeness centered around a line through the plot.
import matplotlib.pyplot as plt x = women['size'] y = women['shape'] plt.scatter(x, y)
But when I do the scatter plot, it seems the points are fairly evenly distributed. Then I realize that some of those points on the plot are really multiple occurrences of those values. So it would help if each point that I see shows how many occurrences are at the same location through the size of the point. Hmm, how do I do that?
More Complex Scatter Plot
Well, since I’m not sure how to do that, I do a Google search with this phrase “repeating point locations increases point size in scatter plot for Python”. From that search, I find this entry: https://stackoverflow.com/questions/46700733/how-to-have-scatter-points-become-larger-for-higher-density-using-matplotlib
So I copy/paste the code in place, set the x and y variables to my values, and run the code to see what it yields. Voila, now I see a more distinct pattern around the middle of the scatter plot where more of the points seem to follow a linear line from left bottom to right top. The thicker points show me that more points are at the center location of the circle. Did I need to understand the code to use it? No, yet it solved the problem that I had.
import matplotlib.pyplot as plt from collections import Counter x=women['size'] y=women['shape'] # count the occurrences of each point c = Counter(zip(x,y)) # create a list of the sizes, here multiplied by 10 for scale s = [10*c[(xx,yy)] for xx,yy in zip(x,y)] # plot it plt.scatter(x, y, s=s)
Ah, but some of you will want to understand how the code works. That’s a good question to help generalize the solution. So with excitement and wonderment, I now start to dissect the code. The first thing is to figure out what the collections package does and how the Counter routine works. The possible search string “collection package in Python” or “collection package in Python counter” may help. Give it a try as an exercise to figure out how the code above works.
If you are stuck, here’s another helpful tip:
The trick is to identify what you know and investigate what you don’t. If you are clear about what you don’t know, then you can be more efficient at finding answers. So now let’s look into zip to discover how it works.
The nice thing about all of these functions is that we can not only read about them but we can experiment to verify that our assumptions about what we are reading are right. So I can look at the variable c and see the results of using Counter on the zip function. I can look at s and see the results of the code run for it, etc.
7. Data Preparation
Now we’re ready to prepare the data for training the model. This is the subset of the test data that we use to have the model figure out how to do predictions. Once we get the model trained then we use the remaining test data to verify the accuracy of the model. Note that if you were to use the training data to verify the accuracy, there is a risk of overfitting such that the model would only work really well on that particular set of training data but not well on other data.
We also need to separate the target, the data field that we want to categorize, and the attributes — all the fields that we want to use to make a decision about the category of the individual.
from sklearn import model_selection array = women.values # converts data frame into array of values X = array[:,0:9] Y = array[:,9] validation_size = 0.20 seed = 7 X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)
8 & 9. Machine Learning and Refinement
Now let’s train with a Logistic Regression classifier and see how close we can develop the prediction.
from sklearn.linear_model import LogisticRegression from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score model = LogisticRegression() model.fit(X_train, Y_train) predictions = model.predict(X_validation) print("1. Accuracy: {}".format(accuracy_score(Y_validation, predictions))) print("2. Confusion Matrix:\n{}".format(pd.crosstab(Y_validation, predictions, rownames=['True'], colnames=['Predicted'], margins=True)))
Now it’s time to try out different techniques to train a model. You can see all the different model techniques at the scikit-learn website. They even have a system that will help you figure out what models to try for what situations. All the models below have additional parameters that you can tweak to refine your results but remember, only use the test data once you have finalized the complete model and you are ready to go. To this end, some people make three sets of data (training, testing, and final verification). The final verification data is used only one time at the end of the process once you are happy with your model.
from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KNeighborsClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.svm import SVC from sklearn.naive_bayes import GaussianNB from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score # Spot Check Algorithms models = [] models.append(('LR', LogisticRegression())) models.append(('KNN', KNeighborsClassifier())) models.append(('CART', DecisionTreeClassifier())) models.append(('NB', GaussianNB())) models.append(('SVM', SVC())) models.append(('RF', RandomForestClassifier())) for name, model in models: model.fit(X_train, Y_train) predictions = model.predict(X_validation) print("\n\n** {} Validate Model on Test Data **".format(name)) print("1. Accuracy: {}".format(accuracy_score(Y_validation, predictions))) print("2. Confusion Matrix:\n{}".format(pd.crosstab(Y_validation, predictions, rownames=['True'], colnames=['Predicted'], margins=True)))
Note: when selecting different models to try I have found this page to be very helpful:
https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
10. Deploy Model
Finally, here is an example of how a set of numbers are used to make a prediction about what classification this new instance of a woman belongs to.
# Train the model model = RandomForestClassifier() model.fit(X_train, Y_train) # Get a prediction on the previous values newvalues = [[4,3,3,2,2,3,5,1,1]] predictions = model.predict(newvalues) print(predictions)
Exercise
You do not have permission to view this form. Have Fun and Choose Powerfully7. Rating Small Project
You do not have permission to view this form.8. Clustering
MindShift Minute
The Art of Hanging In There – Everyone has a point at which they shut down. For some when they see numbers and letters together they tell their brain to stop thinking. This is the time when you have to engage your power to choose to hang in there. Although it may seem as if you are not understanding anything, I urge you to stay engaged. That is the thing that enables you to go beyond where others have stopped. Continue to make powerful choices.
Understanding Homework Learning – Are you familiar with the following scenario? During class, everything looks great. You understand what the professor has done and you’re feeling confident. Then you go home to do the homework and it seems as if you haven’t learned anything? Don’t fret. This is normal and apart of the learning process. When you’re in class working, and there is information coming at you that you don’t know; Homework helps you to figure out the information that you missed or that you misunderstood. This is a good reason on why homework or doing it, is so important. It helps to reaffirm what you know, point out the gaps in knowledge and allow you to continue to grow in your understanding of the subject matter.
Positive Talk: You’re all programmers now.
Lesson 5 Exercise Solution
There are only four minor changes you needed to make to the previous code with the Wisconsin breast cancer dataset. That code is shown with a #***** next to it.
Click to Show/Hide Solution
##### 3. LOADING # 1. Number of times pregnant # 2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test # 3. Diastolic blood pressure (mm Hg) # 4. Triceps skin fold thickness (mm) # 5. 2-Hour serum insulin (mu U/ml) # 6. Body mass index (weight in kg/(height in m)^2) # 7. Diabetes pedigree function # 8. Age (years) # 9. Class variable (0 or 1) import pandas as pd data = pd.read_csv("https://warnermedia.teammindshift.com/data/pima-indians-diabetes.csv") #****** ######### 4. ANALYSIS print(data.describe()) ########## 5. Cleaning # fill in all the blod pressure that is 0 with normal blood pressure of 90. # There are 35 rows with blood pressure 0 data.loc[data.pres == 0, 'pres'] = 90 #***** ######### 6. Visualization # box plot # Box plot of all fields on one graph import matplotlib.pyplot as plt data.boxplot() ######### 7. Prep Data from sklearn import model_selection array = data.values # converts data frame into array of values X = array[:,0:8] #***** Y = array[:,8] #***** validation_size = 0.20 seed = 7 X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed) ################ 8. Try Model from sklearn.linear_model import LogisticRegression from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score model = LogisticRegression() model.fit(X_train, Y_train) predictions = model.predict(X_validation) print("1. Accuracy: {}".format(accuracy_score(Y_validation, predictions))) print("2. Confusion Matrix:\n{}".format(pd.crosstab(Y_validation, predictions, rownames=['True'], colnames=['Predicted'], margins=True))) ################ 9. Compare with other models from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KNeighborsClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.svm import SVC from sklearn.naive_bayes import GaussianNB from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score # Spot Check Algorithms models = [] models.append(('LR', LogisticRegression())) models.append(('KNN', KNeighborsClassifier())) models.append(('CART', DecisionTreeClassifier())) models.append(('NB', GaussianNB())) models.append(('SVM', SVC())) models.append(('RF', RandomForestClassifier())) for name, model in models: model.fit(X_train, Y_train) predictions = model.predict(X_validation) print("\n\n** {} Validate Model on Test Data **".format(name)) print("1. Accuracy: {}".format(accuracy_score(Y_validation, predictions))) print("2. Confusion Matrix:\n{}".format(pd.crosstab(Y_validation, predictions, rownames=['True'], colnames=['Predicted'], margins=True)))
Clustering
Clustering is an unsupervised machine learning technique that involves the grouping of data points to find useful patterns. Clustering is different from classification. In classification, we are seeking to make predictions about new data instances by placing them into known categories, e.g. like the woman in the previous Lesson 5 dataset. In clustering, we don’t know what the categories will be in advance. We are seeking to discover what groupings of data rows are similar to each other in some manner.
For example, if we had a dataset with viewership demographic information, we might want to identify groups of viewers with shared characteristics so that we could perhaps target marketing campaigns to some of those groups. We might think we know what some categories could be, such as by a sport, but we might discover some previously unknown similarities by doing cluster analysis.
In this lesson, we’ll use iris data as our example to demonstrate how to do cluster analysis from just the attributes to identify groupings that occur in the dataset. Note that we will run the analysis on data that contains measurements of various attributes of flowers but not the species.
We will look at how to determine the optimal number of clusters we should form. Once we have formed the clusters, we’ll see how well they correspond to the actual clusters of the data, in our case the target is the species of the flowers represented in the dataset. Note: for most data that you cluster you will not have the target available to check for validation. So keep in mind that this is an exercise in understanding clustering and how it works.

3. Read In the Dataset
As before, this is acquiring the data and reading it into the Python environment. This allows us to interactively view and manipulate this data in a DataFrame.
import pandas as pd df = pd.read_csv("https://warnermedia.teammindshift.com/data/iris.csv")
4. Analysis
There may be more things that we want to do with a new dataset. With this dataset notice that the sepal-length, petal-length, and petal-width are highly correlated. Let’s use those three fields to do our clustering.
df.corr()
5. Visualization
We might also want to visualize the data in a plot. Notice again that the sepal-width seems to act differently than the other attributes. Data science includes exploration.
df.plot()
6. Cleaning
So here I am going to create an array of only the attributes of interest. I will also create an array of the species so I can use that at a later time.
x = df.iloc[:, [0, 2, 3]].values y = df['species'].values
7. Preparation
Now we are looking for the optimal number of clusters, i.e. the minimum number of interesting groups. We will use the elbow method for this. We want to run the number of different clusters and see where there is a bend in the curve for the within-cluster sum of squares (WCSS) of each of the clusters. This bend occurs when the WCSS doesn’t decrease significantly with every iteration. To see more about WCSS click here but keep in mind you don’t need to understand this to use it.
#Finding the optimum number of clusters for k-means classification import matplotlib.pyplot as plt from sklearn.cluster import KMeans wcss = [] for i in range(1, 11): kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0) kmeans.fit(x) wcss.append(kmeans.inertia_) #Plotting the results onto a line graph, allowing us to observe 'The elbow' plt.plot(range(1, 11), wcss) plt.title('The elbow method') plt.xlabel('Number of clusters') plt.ylabel('WCSS') #within cluster sum of squares plt.show()
8. Model
Finally, we run the kmeans classifier. Note there are other classifiers but this is the one that we are using for now. With this classifier, we will have all the rows placed into a class.
#Applying kmeans to the dataset / Creating the kmeans classifier kmeans = KMeans(n_clusters = 3,random_state = 0) y_kmeans = kmeans.fit_predict(x)
9. Accuracy (not usually available)
Finally, to see how well this worked we can compare the classification that kmeans found with the classification that we previously know exists, i.e. which species. Usually, we don’t have this ability to evaluate but it’s good to know that at least for this data this method does pretty well with creating classes that make sense.
pd.crosstab(y_kmeans, y)
BONUS
This is a bonus to see a 3D graph of the different attributes used. It also shows the classifications that Kmeans made.
from sklearn.cluster import KMeans import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D import numpy as np %matplotlib inline from sklearn import datasets X = x labels = y_kmeans #Plotting fig = plt.figure(1, figsize=(7,7)) ax = Axes3D(fig, rect=[0, 0, 0.95, 1], elev=48, azim=134) ax.scatter(X[:, 2], X[:, 0], X[:, 1], c=labels.astype(np.float), edgecolor="k", s=50) ax.set_xlabel("Petal width") ax.set_ylabel("Sepal length") ax.set_zlabel("Petal length") plt.title("K Means", fontsize=14)
Here is a 4-minute clip of the 20-minute video by Josh Kaufmaun on learning anything in 20 hours. I recommend watching the entire video in the resource section but this clip also helps. Enjoy!
Exercise
You do not have permission to view this form. Have Fun and Choose Powerfully9. Rating Clustering
You do not have permission to view this form.10. Visualization Closer Look
Visualization in Python
The ability to take a set of data and display that data in different ways to understand the nature of the system being measured.
The first thing is to read in the data. We are using data from “Contraceptive Method Choice Data Set” The location of the data is here: https://archive.ics.uci.edu/ml/datasets/Contraceptive+Method+Choice.
Here is what the website says about the data:
Data Set Information:
This dataset is a subset of the 1987 National Indonesia Contraceptive Prevalence Survey. The samples are married women who were either not pregnant or do not know if they were at the time of interview. The problem is to predict the current contraceptive method choice (no use, long-term methods, or short-term methods) of a woman based on her demographic and socio-economic characteristics.
Attribute Information:
1. Wife’s age (numerical)
2. Wife’s education (categorical) 1=low, 2, 3, 4=high
3. Husband’s education (categorical) 1=low, 2, 3, 4=high
4. Number of children ever born (numerical)
5. Wife’s religion (binary) 0=Non-Islam, 1=Islam
6. Wife’s now working? (binary) 0=Yes, 1=No
7. Husband’s occupation (categorical) 1, 2, 3, 4
8. Standard-of-living index (categorical) 1=low, 2, 3, 4=high
9. Media exposure (binary) 0=Good, 1=Not good
10. Contraceptive method used (class attribute) 1=No-use, 2=Long-term, 3=Short-term
import pandas as pd url = "http://learn.glulife.net/data/cmc.csv" df = pd.read_csv(url) df.head()
To allow graphical plots we are loading in the matplotlib library. We will start out with a bar chart. Note that
+ value_counts- counts thte number fo values
+ sort_index sorts by the index that is being used
+ plot.bar – takes the data and plots it into a bar graph
The graph below represents the number of wives at a certain age in the data set.
import matplotlib.pyplot as plt # bar charts df['age'].value_counts().sort_index().plot.bar() plt.show()
Now let’s move on to the single line visualization. The graph below represents the number of education levels for the women in the data set. Try just looking at the data below the graph by removing the .plot() at the end of the statement.
# single line charts df['wife_edu'].value_counts().sort_index().plot() plt.show()
If you want to see two values in a line plot then keep the x-axis the same and create 2 data frames. Below you see the educational levels of wives and the husbands. What do you notice in the comparison below? Note that you can also view the data as a bar graph. This may be more useful in depicting information about the data depending on what you are looking for and what the data represents.
# multiple line/bar chart for comparison wifeedu = df['wife_edu'].value_counts().sort_index() husbandedu = df['husband_edu'].value_counts().sort_index() newdf = pd.concat([wifeedu, husbandedu], axis=1) newdf.plot.bar() plt.show() newdf.plot() plt.show()
For more plots go to https://pandas.pydata.org/pandas-docs/stable/visualization.html.
We can now look at scatter plots. The scatter plots below compare age wives education for plot one. You can also see age of women with the number of children they have. As you look at the charts see what information you can derive from the visuals.
# scatter plot df.plot.scatter(x='age', y='wife_edu') plt.show() df.plot.scatter(x='age', y='num_children') plt.show()
To take the plot a little further you can add a 3rd variable that is represented by the size of the plot dot. See the chart below that is the age vs wife’s education and then the size of the dot represented by the number of children.
# bubble plot df.plot.scatter(x='age', y='wife_edu', s=df['num_children']) plt.show()
Notice that in this plot the size of the dots are related to the number of children for each wife.
Now let’s look at the age represented as a histogram. You can see the most age is between 25 – 30 years old. The histogram is useful for displaying how the data is distributed in the data set.
# column histogram df.age.hist() plt.show()
We can look at that same data as a density chart. Not that the peak is around 27.
# line histogram or density df.age.plot.density() plt.show()
We also have 2d area charts that can give an even different view of the data. Notice that you can stack the charts on top of one another or you can do the same thing with a bar graph.
# 2d Area chart newdf.plot.area() plt.show() newdf.plot.area(stacked=False) plt.show()
# stacked column plot newdf.plot.bar(stacked=True) plt.show()
You may want to see the percentage make up of values. Here is the husband’s education broken down in a pie chart. Notice that over the husbands have over 60% of their education at the high level.
# pie chart newdf['husband_edu'].plot.pie() plt.show()
newdf['husband_edu'].plot.pie(autopct='%.2f') plt.show()
You may want to see the data just as a table. The tabluate library can provide you the ability to create a tabluar representation of the data. Here is more information about tabluate (https://pypi.python.org/pypi/tabulate).
# Simple Tabular Grid from tabulate import tabulate print("") print (tabulate(df[['age','wife_edu']].head(), headers='keys', tablefmt='psql'))
Finally one last way we will discuss to compare data is through boxplots. You see below the wife’s education in comparison with the husband’s education. The box plot can give you quick insight into how wide a spread the data is, where is the mean, where are the quardrants, how each data set commpares with each other.
# boxplot df[['wife_edu','husband_edu']].boxplot() newdf.boxplot() plt.show()
For more visualization, you can explore the following sites.
- https://python-graph-gallery.com/
- https://matplotlib.org/
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html
- https://pandas.pydata.org/pandas-docs/stable/visualization.html
Exercise
You do not have permission to view this form. Have Fun and Choose Powerfully11. Additional ML Techniques
MindShift Minute
Using other tools is good practice – Excel, text editor, Tableau, MySQL, etc.
Using additional sources is good practice – friends, Google, YouTube, books, experts, LinkedIn,
Learn from examples – When you see something interesting, learn from it. Start to think about the implications of what you see. How can that apply to other things that you are working on or could be working on in the future?
Generalization – How I used these techniques for teaching someone to swim
Active Learning – As you learn new information continues to put what you learned into your own map/understanding of the topic. Also, map what you are learning into how it can apply to what you are trying to accomplish as well as what it might accomplish in another domain or area of interest.
Area Highlight
Glossary – you can suggest new terms that you have found and we will review and add to list.
Blogs – New blog posts are being added. Check out the blog on visualization.
Completed Exercise View – on the courses page you can see your customized view
Little Extra on Exercise Completion – What do people think?
Lesson Introduction
You now have a good understanding of some of the basic ideas behind Data Science and Machine Learning. Take a moment to rejoice in how far you have come.
As with many new ideas, there is more to know but you have a good foundation to connect other information with. Today we are going to talk about two more modeling subjects: Feature Engineering and K Folding.
Feature Engineering
Feature Engineering includes several techniques that allow you to drop, ignore, modify, or create new data features for a given dataset. This allows you to use the most complete and relevant data in training your models as well as creating a richer environment for modeling. Below we will outline some common data issues and techniques for dealing with those issues.
Missing Data
Remove Rows and Columns
We may want to remove any column or row that has a large percentage of missing data. In the case below we set the threshold at 70% missing data in either row or column and we will delete that row or column. Missing data is data that has a None value or no value at all (blank).
import pandas as pd threshold = .7 df = pd.DataFrame({'col1': [1,None,1,None], 'col2': [1,None,None,None], 'col3': [None,None,1,1], 'col4': [None,None,1,1]}) print("Original DataFrame\n{}\n\n".format(df)) df = df[df.columns[df.isnull().mean() < threshold]] #Dropping rows with missing value rate higher than threshold print("Columns Deleted (col2)\n{}\n\n".format(df)) df = df.loc[df.isnull().mean(axis=1) < threshold] print("Row Deleted (Row 1)\n{}\n\n".format(df))
NOTICE: Examine the print statement above to see how it works. Now you have a new way of printing to add to your toolbox.
Fill in Missing Information
Below we want to replace missing data with either 0 in case 1 or the mean of the values that are there in case 2.
df1 = pd.DataFrame({'col1': [1,None,4,None]}); df2 = pd.DataFrame({'col1': [1,None,4,None]}); #Filling all missing values with 0 df1 = df1.fillna(0) df2 = df2.fillna(df2.median()) print("Fill 0s\n{}\n\n".format(df1)) print("Fill mean\n{}\n\n".format(df2))
Remove Outliers
Next, we might want to drop rows with values more than a number of standard deviations away from the mean. This is used to get rid of outliers.
#Dropping the outlier rows with standard deviation factor = 2 df = pd.DataFrame({'col1': [1,1,1,400,1,1,1,1,1,0,0,0,0,0,1,-400]}); upper_lim = df['col1'].mean () + df['col1'].std () * factor lower_lim = df['col1'].mean () - df['col1'].std () * factor print("DataFrame: \n{}nUpper Limit: {} Lower Limit{} Factor STD Away {}".format(df,upper_lim, lower_lim, factor)) df = df[(df['col1'] < upper_lim) & (df['col1'] > lower_lim)] print(df)
Encoding Category Data
Instead of doing a substitution for values in a row you may want to create a separate column for each value. This is good if the unique values have an equal relationship.
df = pd.DataFrame({'col1': ['yellow','red','red','yellow','yellow']}); encoded_columns = pd.get_dummies(df['col1']) df = df.join(encoded_columns).drop('col1', axis=1) print(df)
Breaking Data Apart
You may have a column that you want to use to derive other columns for features that may be important. There may be some importance of a weekday in building a model. So predicting traffic in a sports bar may be tied to the weekday and not just date. So you can derive additional fields from data that is present in the DataFrame.
from datetime import date df = pd.DataFrame({'date': ['01-01-2017', '04-12-2008', '23-06-1988', '25-08-1999', '20-02-1993', ]}) #Transform string to date df['date'] = pd.to_datetime(df.date, format="%d-%m-%Y") #Extracting Year df['year'] = df['date'].dt.year #Extracting Month df['month'] = df['date'].dt.month #Extracting passed years since the date df['passed_years'] = date.today().year - df['date'].dt.year #Extracting passed months since the date df['passed_months'] = (date.today().year - df['date'].dt.year) * 12 + date.today().month - df['date'].dt.month #Extracting the weekday name of the date df['day_name'] = df['date'].dt.day_name() print(df)
Kfolding
Kfolding is one additional method for improving models to test multiple classifier techniques in order to see which ones perform the best based on the dataset. The idea is to divide the training set of data up into a number of same size buckets. Then sequentially the system will set aside use each bucket to use as the test data after training on the other buckets of data. It records how well each classifier works on the data bucket iterations and returns that to the user. The user can then see on average how well each classifier worked seeing and training on different datasets. This helps to combat overfitting the data towards getting the best classifier.
This is the idea behind kfolding.
from sklearn.linear_model import LogisticRegression from sklearn import model_selection import warnings warnings.filterwarnings("ignore") url = "https://warnermedia.teammindshift.com/data/iris.csv" df = pd.read_csv(url) # Split-out validation dataset array = df.values X = array[:,0:4] Y = array[:,4] # Get Training and Validation sets X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=0.2, random_state=7) model = LogisticRegression() kfold = model_selection.KFold(n_splits=10, random_state=7) cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold) print(cv_results)
import warnings warnings.filterwarnings("ignore") ############################ import pandas as pd import numpy as np # Load dataset url = "https://warnermedia.teammindshift.com/data/iris.csv" df = pd.read_csv(url) ############################ # import sklearn model_selection code from sklearn import model_selection # Split-out validation dataset array = df.values X = array[:,0:4] Y = array[:,4] # Get Training and Validation sets X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=0.2, random_state=7) ################################# from sklearn.linear_model import LogisticRegression from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KNeighborsClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.svm import SVC from sklearn.naive_bayes import GaussianNB from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier ################################## names = [] models = [] # Spot Check Algorithms models.append(('LR', LogisticRegression())) models.append(('KNN', KNeighborsClassifier())) models.append(('CART', DecisionTreeClassifier())) models.append(('NB', GaussianNB())) models.append(('SVM', SVC())) models.append(('RF', RandomForestClassifier())) results = [] seed = 7 kfold = model_selection.KFold(n_splits=10, random_state=seed) for name, model in models: cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold) results.append(cv_results) names.append(name) msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std()) print(msg)
Exercise
You do not have permission to view this form. Have Fun and Choose Powerfully12. Rating More Techniques
You do not have permission to view this form.13. Summary
Your Why
MindShift
Internal Feelings and Conversations
- Emotions – Choice
- Positive Self Talk
- Not hard or easy but unknown to known
Learning To Learn How To Learn
- Build Your Own World Model of What you are Learning (Why am I and how to apply)
- Learn Functions not all Focus on Skill
- Problem Solving (What I know and don’t know, take what I don’t know and turn it into what I know
Generalize
- How can what I am learning here apply somewhere else
- How can what I know somewhere else apply here
Data Science
Using Data Science to Solve Problems
- Self Learning Game
- Driverless Cars
Data Usage Progression
- Spreadsheets – Access to Data
- Dashboards – Discovery Patterns in Data
- Predictive – What May Happen
- Prescriptive – What to Do
- Automation – Correct for You
AI Techniques
- Visualization
- Machine Learning
- Deep Learning
- Other
Anatomy of a Data Science Project
- Data Gathering
- Data Understanding
- Data Cleaning
- Data Modeling
- System Deployment
Data Science Mindset
- Problem Solver
- Constant Learner
- Right Tool
- Know-How to Code
- Understand the Business
Bias and Ethics in a Data-Driven Society
- Data Acquisition
- Datasets
- Data-Driven Applications
Recognizing Data Science Projects
- Problem Definition – What Problems Exist
- If I had a Crystal Ball Could I solve the problem
- Finding Data – What Data could help with prediction
Describing a Data Science Project
- Describe Why This Project
- Define Problem Being Solved
- Outline Possible Solutions
- Define Rough Steps to Solving Problem
- Layout Possible Timeline
- Identify Stakeholders
Ten Steps for Machine Learning Project
- Define the Problem
- Identify Data Set
- Load Data Into Environment
- Analyze Data
- Clean Data
- Visualize Data
- Prepare Data For Machine Learning
- Try a Machine Learning Algorithm
- Refine the Machine Learning Algorithm
- Deploy
Here is how you answered the exercises. This gives you some initial information to go back and start to build out a plan on how to do a simple data science project and what are some of the ethical issues you may have to deal with.
MindShift Review – Exercise
Small Project
sasdfasdf
Closer Look 1 – 5
Read in TSV:
asdasdf
Functions that will display rows
asdfasdf
asdfasdf
Function to get rid of a column:
asdfasdf
Function to delete “?”
Closer Look 6 – 9
Clustering
Visualization
Visual Chart Code
waesfaseraweraszdf