1) Data Analytics/Data Science/AI (Career or Interest) Prep Certification [+]
2) Data Analytics Overview [+]
3) The Self-Directed Learner’s Guide (to Success) [+]
4) The Self-Directed Learner’s Guide to Success [+]
5) Introduction to Machine Learning [+]
6) Data Empowerment [+]

1. Pre Survey

You do not have permission to view this form.

2. Reinforce Mindshift

Your Why

Video Study Notes - Hide/Show
Defining Your Why

Defining your why gives you more direction.
Once you have your first why go a level deeper and ask yourself, why is that your why and come up with a second why.

Video Study Notes - Hide/Show

Exercise

You do not have permission to view this form.

MindShift

Your Why

  • Your Why can give you purpose and direction

Internal Feelings and Conversations

  • Emotions – Choice
  • Positive Self Talk
  • Not hard or easy but unknown to known

Learning To Learn How To Learn

  • Build Your Own World Model of What you are Learning (Why am I and how to apply)
  • Learn Functions not all Focus on Skill
  • Problem Solving (What I know and don’t know, take what I don’t know and turn it into what I know

Generalize

  • How can what I am learning here apply somewhere else
  • How can what I know somewhere else apply here

Exercise

You do not have permission to view this form. Have Fun and Choose Powerfully

3. Rating Introduction

You do not have permission to view this form.

4. Build Small Program - 10 steps (Categorization)

Introduction

If you don’t have your notebook from the previous Data Empowerment course then grab a notebook from the github repository https://github.com/vizmotion/notebooks. The notebook is called 10 Steps.ypynb.

Go through and run each of the cells to make sure that they are working. We will delve deeper into each step in this course.

Exercise

You do not have permission to view this form. Have Fun and Choose Powerfully

5. Closer Look at 1 - 5 (example Categorization)

MindShift Minute

  1. Get comfortable with not knowing everything about a problem or the solution in the beginning. It is a journey to get to the end.
  2. In class we have various levels of experience in programming and technology. Another educational MindShift is to ownership of your own education. If you are more advance then push your self with new resources and techniques. You can use other sources. If you are still working through the concepts then again push your self to look at other resources. Sometimes by looking at things told a different way can make all the difference in the world.
  3. If you are shown a new solution there’s no use thinking, “How could I have known to do that?” You probably wouldn’t. But now that you have been introduced to it, it’s a new tool in your arsenal. Learn it and understand how to use it. You may have to look up the details but now you know it exists.
  4. Don’t stare at something for too long thinking that you should know the answer. Define what you don’t know and take action to figure it out (Google, ask someone, search for YouTube videos, design an experiment to test a hypothesis, etc.)
  5. You don’t have to fully understand why something works. Sometimes you just use it. You don’t have to be a mechanic to drive a car.
  6. The idea is to complete the homework for the lecture before the following lecture.

Introduction

You’re doing well to make it to this point. The exercises have been about being in a Machine Learning discussion. For those of you who were concerned with getting it right, don’t be. You have all done it right. You are thinking about how AI and Machine Learning works and that is what the first set of exercises were all about.

The last lesson was an introduction to how a machine learning project would work. Now we’re going deeper into each section. Don’t worry if you didn’t understand all that happened in the last lesson. Part of embracing technology is being OK with not knowing everything. The next four lessons will fill in those gaps. You will create a new notebook to do another example with different data.

In this lesson, we will take a deeper dive into Problem Definition, Finding Data, Loading Data, Doing the Analysis of the Data and finally Data Cleaning. We will be doing Python coding. Keep in mind that the goal is not to understand all of Python. That is beyond the scope of this class, however, you can and should use alternative articles, tutorials, and videos to continue to grow in your understanding. Sometimes it takes multiple viewings and different perspectives.

1. Problem Definition

Below is a Problem Definition/Solution Approach worksheet. Let’s go through it.

6 Steps to Defining a Machine Learning Problem

The first and most important step in any data science project is defining the problem. This article will take you through 6 steps to help you define a machine learning problem.

1.1. Explain the Business Problem(s) in English

As a starting point, describe the problem in informal language as you would describe it to a colleague. For example, “I need a machine-learning algorithm to tell me how many customers will buy my product.”

1.2. Explain the Benefits of the Solution

Identify why the problem needs to be solved and what will you achieve. In our example, perhaps you would seek insights about the products that will have maximum sales. You can then allot marketing resources according to the predicted sales for various products generating more profit. This will help you to have an advantage over your competitors who do not use machine learning for such decisions.

1.3. List All the Important Information About the Problem

To define a problem that delivers real results, start by listing the key information about the problem. Some basic things you should list are tasks, assumptions, and performance evaluators. Tasks are processes that a machine-learning algorithm should perform. In our example, the task is to predict the number of sales a product would have. Assumptions are rules of thumb and domain-specific information that are very helpful to get to the solution. Be extremely careful while defining assumptions; wrong assumptions may lead to a false solution that may not provide results as expected in the real world. Here, assumptions may be about conversion rate, return policy, etc. Performance evaluators are variables whose value represents the actual results of the project. For example, the total number of sales a product would have is one of the performance evaluators, another could be the total profit you make which you want your algorithm to calculate.

1.4. Identify the Data Needed to Solve the Problem

Once you have a preliminary list of tasks, assumptions, and performance evaluators, it is time to think of what data would be needed to solve your problem. Here in our example, to predict sales of a specific product you might need historical sales data with categorical and demographic details of customers. For example, a good dataset would contain the following information as a column for each sale made: Order date, Order amount, Product Name, Product Category, Product subcategory, Customer City, Customer State, Customer Country, Product Price, Product Cost, Discount, Profit and maybe more depending on your problem.

Another question that arises is how do you get access to this data? If you are working for a large company, then this data might be readily available from historical sales invoices but if you are trying to figure out products to sell on your e-commerce store there would be no historical data. In that case, you should try to get the data from the US Department of Commerce website or any other open-source on the internet which is relevant to your problem. You may also need to pre-process or transform data if the format required is not readily available.

1.5. Now Restate the Problem in One of the 3 Categories Below:

Types of Problems:
  • Classification
    Classification, also known as supervised learning, is a problem where it is required to identify to which class or category does a new observation belong. Here in our example, if you want to determine which products will cross a certain sales mark, that’s a classification problem.
  • Regression
    Regression is a type of problem where the conditional expectation of one variable needs to be estimated while keeping other variables fixed. In simpler words, regression is a prediction of value under certain known conditions. Regression is mainly used for forecasting. If you want to predict the number of sales each product would make or predict the number of customers that would buy your product, that is a regression problem.
  • Unsupervised – Data Mining
    Unsupervised learning is a type of machine learning where there are no specific classifications or categorization of observations or there is no training data available. In such scenarios, the accuracy of an algorithm cannot be evaluated from outputs. Unsupervised data mining is used for a problem of clustering or pattern detection. Clustering is different from classification as there are no previously known categories or training data available.

1.6. Define What Success Looks Like

Make sure to list out the benefits you would gain by successfully implementing a machine learning project. Consider what will be fulfilled when the problem is solved. In our example, success could be making 1000 sales per month or gaining 1000 customers every month. If you don’t define success properly, chances are, you won’t capitalize on your project efficiently.

Problem Definition Example

  1. Problem: Ability to predict sales of product X in my 3 regions
  2. Benefits: It would allow me to better allocate marketing resources.
  3. Domain Knowledge: Sale price could be more in Region 1 than Region 2 and 3 resulting in more profits there.
  4. Data Needed:
    What: Sales Date, Price, Cost, Discount, Profit, Region of Customer.
    Where: I can get that data from our accounting system.
  5. Type of Problem: Since I want numbers it is a regression problem.
  6. Success: I would like to get 1000 sales in Region 1, more than 500 sales in Region 2 and 3.

Summary

Investing more time in defining all these aspects of the problem will eventually lead to successful implementations of machine learning projects. One of the best practices is looking at machine learning problems or projects similar to the one you are trying to solve. Similar problems can provide information about assumptions, algorithms, data transformations, and limitations of a machine learning model.

In this example, we are using a standard data set that has been used to show the power of machine learning in medical practice. This data was from a study in 1992 and is made available on this site https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Original). You can read is easily into the database by using this URL that we have provided for you: https://teammindshift.com/data/breast-cancer-wisconsin.csv

2. Getting Data

At this point, you have defined the problem with the help of the 6-step approach above. You may already know what dataset to use and where to get it but in case you don’t, here’s a quick guide to get the data you need.

What Data Do You Need/Want?

  1. The first question you should ask yourself is, “What data do you need to complete your machine learning project?” For example, if you were doing sales forecasts for your products, then you probably want sales dates and sales amount or sales quantity as a minimum. If you can get additional fields like region, customer type, sales channel, product category, profit, discount, etc. that may be good information that may be helpful with the forecast and getting more insights and may even help improve the accuracy of the forecast. It is advisable to spend time thinking about what data you would need.
  2. Once you have identified the data you need or want to use for your project, think about, “In what form, if available, would the data be easy as well as appropriate to use in your project?” The answer should look something like: “If data is available as a Microsoft Excel spreadsheet containing columns with names Date, Customer Region, Product Name, Product Category, Quantity, Sales Amount, Discount, Profit it would be very easy to use and I would have all the required information for further analysis.”

Possible Sources

The next step is to collect the required data for your project. There are possibly infinite sources of data in this data-driven global economy depending on your condition but some of the sources are mentioned here for your reference.

  1. Internal Sources
    a. ERP systems: ERP systems or Enterprise Resource Planning systems are the heart and soul of a lot of big companies. The major systems are SAP, JDA, etc. Usually, the IT department oversees the running of these systems and provides access to the data on an as-needed basis.
    b. Financial/Ordering/CRM system: CRM software consolidates customer information and documents into a single database so business users can more easily access and manage it. Any admin or assigned user can access data from CRM software. Data is also available through invoicing software or order management systems that are used by your organization.
    c. Other reports where the data is being used in your organization: Web analytics platforms such as Google Analytics, SQL databases, Excel spreadsheets, etc.
  2. External Sources
    a. Government sources: Data.gov – Official US government website for open data, Commerce.gov – US Department of Commerce website, analytics.usa.gov – provides data about US government websites traffic, Healthdata.gov – For data related to healthcare, etc.
    b. Associations or other Public Resources (May need to join or publicly available): Kaggle.com, GitHub, etc.
    c. Places that sell data: Towerdata, Transunion, Axciom, ID Analytics, etc.
  3. Newly Generated Data
    a. Collect data using surveys, Google forms, etc.
    b. Collect data using sensors or other mechanisms.

More About Getting The Data

Earlier, I mentioned defining the required format and fields of data. Now, what if data is not available in that exact format or some fields you really want to have are not there? One thing you must remember is that data is not always obvious or complete or perfect. Sometimes you might have to make extrapolations, assumptions, calculations, or transformation as required. For example, you found a dataset that doesn’t have revenue data but the quantity sold reflects revenue very well. Similarly, the area of a house is the length by the width of the foundation which can be easily calculated.

So never give up the project just because you couldn’t find the exact data you wanted, instead, think of ways to make whatever data is available more complete.

NOTE: Our Resource page lists some publicly available data resources.

3. Loading Data

Data can be stored in files with different delimiters. We have learned about CSV files. These are files with the fields separated by commas (,) but there are files where the fields are separated with tabs (\t), semicolons (;) and spaces ( ) as well as any other character. You may have seen this in Excel when you have tried to open a text file and it asked you about delimiters. Files can also be read from a URL or your local drive. Files from the local drive are represented with C:/directory path along with the file name. Files from a URL usually start with http or https. Below is a file being read from a URL. In the homework, you will get more experience in loading files. The data from the breast-cancer-wisconsin file is read into the data frame people. A data frame is a variable that stores data in a structured way allowing you to manipulate that data as needed to accomplish your goals.

[colab]
import pandas as pd

url = "https://teammindshift.com/data/breast-cancer-wisconsin.csv"

people = pd.read_csv(url, sep=",")

4. Data Analysis

Now that you have loaded the data, we are going to analyze some aspects of the data. Some commands will allow you to quickly get an overview understanding of the data. We will discuss two of them and for homework, you will discover and use a third.

The first command is describe(). It will give you a quick look at the data frame’s overall dataset.

people.describe()

The next command is correlation, corr(), in which you can start to see pairwise if there is a possible relationship between different fields.

people.corr()

You can see a strong correlation between the size and thickness to the class but a very weak relationship between the class and the id. For more information about correlation see https://en.wikipedia.org/wiki/Correlation_and_dependence

Finally, you will explore other commands that can give you insights into data in the homework. Yeah!!!

5. Data Cleaning

Cleaning a dataset means making sure the data is in a useable format for doing machine learning. This means that all data needs to be numeric and have a value that is indicative of the problem that you are trying to solve. This last condition requires some domain knowledge and common sense. For example, in the people data frame, there is a data value where a person has 0 blood pressure. We know that is an error so we need to replace that value with something reasonable.

Next, we are going to switch datasets and use one that I made up to help you understand data cleaning.

Most of the manipulation that we want to do with a dataset will be in numeric form. This lesson will show you techniques to transform non-numeric data into a numeric form, remove data that is not important/cannot be used, and fix or update erroneous or misleading data.

# Data Set Meaning
# Variable,Definition
# X1,Interest Rate on the loan
# X2,A unique id for the loan.
# X3,A unique id assigned for the borrower.
# X4,Loan amount requested
# X5,Loan amount funded
# X6,Number of payments (36 or 60)
#X7,age of signer
#X8,Answer to getting another loan

from pandas import read_csv
df = read_csv('https://teammindshift.com/data/DataCleaningLoanEx.csv')

This is a small dataset so you can see all of the values. Imagine you want to use this data to predict the rate at which this loan will be funded. In this dataset, we will see some techniques that can be used on much larger datasets.

df.describe()

Note that describe() only shows columns that are numbers.

df.describe(include='all')

Note that describe(include=’all’) shows all columns including non-numeric ones.

Notice that although X1 represents numbers, it has a % character making that column a string data type. Also, notice the columns that have NaN which means there is no information in them.

df['X6'].value_counts()

value_counts() – This will return the count of all the unique values in a field. Notice how this shows you all the text strings in the X6 fields.

Now, using our understanding of the data, we will start to make changes to the data frame. This is data cleaning.

# Drop a column
df = df.drop(["X2","X3"], axis=1)
print(df)

The columns X2 and X3 have nothing to do with predicting the loan funding rate so we will remove those columns from the dataset.

# Remove all rows that have NaN for field X4 by saving all not null lines
df = df[df.X4.notnull()] 
print(df)

We are going to remove all the rows with ? in them.

df = df[~(df == '*').any(axis='columns')]

The rows that don’t have a value for this demo will also be deleted. Notice here that a new data frame is formed with just the rows that have a value in them for column X4.

# remove everything not numbers from X1 values
df['X1'] = pd.to_numeric(df['X1'].str.replace(r'[^-\d.]', ''))
print(df)

This code will take all the characters that are not numbers out of the cell values and convert the values in the cell to a number.

# set NaN values in this column to max+1 values
maxX7 = df['X7'].max();
df['X7'].fillna(maxX7+1,inplace=True)
print(df)

This code will find the NaN values in the X7 column and make them the max value + 1. In this case, we can do this because if there is no value we are making the assumption that the person never had a default for as long as the data has been being recorded. By having a number in the set that is close to being relevant we make it the highest number in the data set.

inplace – replaces the data frame instead of creating a new one.

df['X6'].value_counts()

value_counts – shows you how many of each unique item is in the field. You see that there is a typo of 36 mnths in the X6 field.

OK now I want to replace the 36 mnths with 36 months

df['X6'] = df['X6'].str.replace('36 mnths','36 months')

And finally I want to change the X6 values into 0 and 1

# transform text into numbers
mapto = {"36 months":1,"60 months":0}
df['X6'] = df['X6'].map(mapto)
print(df)

How would we change the no and yes to 1 and 0? Click to Show/Hide Solution

# transform text into numbers
mapto = {"yes":1,"no":0}
df['X8'] = df['X8'].map(mapto)
print(df)
Finally when we have a select number of cell values in a column then we can convert the items to numbers with each unique string being a unique number. In the above case we have two values so “36 months” becomes 1 and “60 months” becomes 0.

Exercise

You do not have permission to view this form. Have Fun and Choose Powerfully

6. Closer Look at step 6 - 9 (example Categorization)

MindShift Minute

In the face of fear, frustration and self doubt you have to plunge ahead in the faith that it will get better always looking to make more powerful choices.  Some things that can help:

  1. Meditation – Mini Meditation
  2. Sleep – Good Night Sleep Benefits
  3. Breathing – Breathing Exercises
  4. Power Posing – Amy Cuddy

5. Data Cleaning – From Homework

This is the answer to parts of the homework. The code below:

  • Reads in the tsv file
  • Removes the “?” rows
  • Drops the id axis
  • Prints out the top 10 entries in the Data Frame.
import pandas as pd

# reads in tsv file #
url = "https://warnermedia.teammindshift.com/data/breast-cancer-wisconsin.tsv"
women = pd.read_csv(url, sep="\t")

# drop values that have ?
women = women[~(women == '?').any(axis='columns')].astype(int)

# drop column id
women.drop(['id'], axis=1, inplace=True)

#top 10 entries
women.head(10)

6. Data Visualization

Now that we have the data let’s start to look at it visually. Sometimes a picture is worth 1,000 words.

Creating a Box Plot for select DataFrame values

A box and whisker plot (or simply box plot for short) can reveal a lot of information in a very simple picture. Below we show a box plot of both the thickness and the size fields. If you are unfamiliar with the box plot then do some independent research such as searching for videos or definitions to help you understand. Here is one place to start: Wikipedia Box Plot

# subset box plot
# box plot comparison of thickness and size
women[['thickness','size']].boxplot();

Creating a Histogram

Another plot that can be helpful is a histogram. They allow you to see the distribution of the data in the field. Below we can see the histogram for all number fields with one command. Then if we want to change the size of the image we can use the rcParams to change the figure size and then show a histogram for just one field. There are more parameters for all the matplotlib functions so feel free to look up the functions and play around with them.

#general histogram for all values
women.hist()

Change image size:

# Histogram with change in size for a specific value
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = [5, 5]
women['thickness'].hist()

Scatter Plot

Next, let’s look at a scatter plot where size and shape pairs are plotted. Understanding that there is some correlation between size and shape, I would expect to see some type of closeness centered around a line through the plot.

import matplotlib.pyplot as plt

x = women['size']
y = women['shape']
plt.scatter(x, y)

But when I do the scatter plot, it seems the points are fairly evenly distributed. Then I realize that some of those points on the plot are really multiple occurrences of those values. So it would help if each point that I see shows how many occurrences are at the same location through the size of the point. Hmm, how do I do that?

More Complex Scatter Plot

Well, since I’m not sure how to do that, I do a Google search with this phrase “repeating point locations increases point size in scatter plot for Python”. From that search, I find this entry: https://stackoverflow.com/questions/46700733/how-to-have-scatter-points-become-larger-for-higher-density-using-matplotlib

So I copy/paste the code in place, set the x and y variables to my values, and run the code to see what it yields. Voila, now I see a more distinct pattern around the middle of the scatter plot where more of the points seem to follow a linear line from left bottom to right top. The thicker points show me that more points are at the center location of the circle. Did I need to understand the code to use it? No, yet it solved the problem that I had.

import matplotlib.pyplot as plt
from collections import Counter

x=women['size']
y=women['shape']

# count the occurrences of each point
c = Counter(zip(x,y))
# create a list of the sizes, here multiplied by 10 for scale
s = [10*c[(xx,yy)] for xx,yy in zip(x,y)]

# plot it
plt.scatter(x, y, s=s)

Ah, but some of you will want to understand how the code works. That’s a good question to help generalize the solution. So with excitement and wonderment, I now start to dissect the code. The first thing is to figure out what the collections package does and how the Counter routine works. The possible search string “collection package in Python” or “collection package in Python counter” may help. Give it a try as an exercise to figure out how the code above works.

If you are stuck, here’s another helpful tip:

The trick is to identify what you know and investigate what you don’t. If you are clear about what you don’t know, then you can be more efficient at finding answers. So now let’s look into zip to discover how it works.

The nice thing about all of these functions is that we can not only read about them but we can experiment to verify that our assumptions about what we are reading are right. So I can look at the variable c and see the results of using Counter on the zip function. I can look at s and see the results of the code run for it, etc.

7. Data Preparation

Now we’re ready to prepare the data for training the model. This is the subset of the test data that we use to have the model figure out how to do predictions. Once we get the model trained then we use the remaining test data to verify the accuracy of the model. Note that if you were to use the training data to verify the accuracy, there is a risk of overfitting such that the model would only work really well on that particular set of training data but not well on other data.

We also need to separate the target, the data field that we want to categorize, and the attributes — all the fields that we want to use to make a decision about the category of the individual.

from sklearn import model_selection

array = women.values # converts data frame into array of values
X = array[:,0:9]
Y = array[:,9]

validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)

8 & 9. Machine Learning and Refinement

Now let’s train with a Logistic Regression classifier and see how close we can develop the prediction.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

model = LogisticRegression()
model.fit(X_train, Y_train)
predictions = model.predict(X_validation)
print("1. Accuracy: {}".format(accuracy_score(Y_validation, predictions)))
print("2. Confusion Matrix:\n{}".format(pd.crosstab(Y_validation, predictions, rownames=['True'], colnames=['Predicted'], margins=True)))

Now it’s time to try out different techniques to train a model. You can see all the different model techniques at the scikit-learn website. They even have a system that will help you figure out what models to try for what situations. All the models below have additional parameters that you can tweak to refine your results but remember, only use the test data once you have finalized the complete model and you are ready to go. To this end, some people make three sets of data (training, testing, and final verification). The final verification data is used only one time at the end of the process once you are happy with your model.

from sklearn.linear_model import LogisticRegression

from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
models.append(('RF', RandomForestClassifier()))

for name, model in models:
	model.fit(X_train, Y_train)
	predictions = model.predict(X_validation)
	print("\n\n** {} Validate Model on Test Data **".format(name))
	print("1. Accuracy: {}".format(accuracy_score(Y_validation, predictions)))
	print("2. Confusion Matrix:\n{}".format(pd.crosstab(Y_validation, predictions, rownames=['True'], colnames=['Predicted'], margins=True)))

Note: when selecting different models to try I have found this page to be very helpful:
https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

10. Deploy Model

Finally, here is an example of how a set of numbers are used to make a prediction about what classification this new instance of a woman belongs to.

# Train the model
model = RandomForestClassifier()
model.fit(X_train, Y_train)
# Get a prediction on the previous values
newvalues = [[4,3,3,2,2,3,5,1,1]]
predictions = model.predict(newvalues)
print(predictions) 

Exercise

You do not have permission to view this form. Have Fun and Choose Powerfully

7. Rating Small Project

You do not have permission to view this form.

8. Clustering

MindShift Minute

The Art of Hanging In There – Everyone has a point at which they shut down. For some when they see numbers and letters together they tell their brain to stop thinking. This is the time when you have to engage your power to choose to hang in there. Although it may seem as if you are not understanding anything, I urge you to stay engaged. That is the thing that enables you to go beyond where others have stopped. Continue to make powerful choices.

Understanding Homework Learning – Are you familiar with the following scenario? During class, everything looks great. You understand what the professor has done and you’re feeling confident. Then you go home to do the homework and it seems as if you haven’t learned anything? Don’t fret. This is normal and apart of the learning process. When you’re in class working, and there is information coming at you that you don’t know; Homework helps you to figure out the information that you missed or that you misunderstood. This is a good reason on why homework or doing it, is so important. It helps to reaffirm what you know, point out the gaps in knowledge and allow you to continue to grow in your understanding of the subject matter.

Positive Talk: You’re all programmers now.

Lesson 5 Exercise Solution

There are only four minor changes you needed to make to the previous code with the Wisconsin breast cancer dataset. That code is shown with a #***** next to it.
Click to Show/Hide Solution

##### 3. LOADING
# 1. Number of times pregnant 
# 2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test 
# 3. Diastolic blood pressure (mm Hg) 
# 4. Triceps skin fold thickness (mm) 
# 5. 2-Hour serum insulin (mu U/ml) 
# 6. Body mass index (weight in kg/(height in m)^2) 
# 7. Diabetes pedigree function 
# 8. Age (years) 
# 9. Class variable (0 or 1) 
import pandas as pd
data = pd.read_csv("https://warnermedia.teammindshift.com/data/pima-indians-diabetes.csv") #******

######### 4. ANALYSIS
print(data.describe())

########## 5. Cleaning
# fill in all the blod pressure that is 0 with normal blood pressure of 90.
# There are 35 rows with blood pressure 0
data.loc[data.pres == 0, 'pres'] = 90 #*****

######### 6. Visualization
# box plot
# Box plot of all fields on one graph
import matplotlib.pyplot as plt
data.boxplot()

######### 7. Prep Data
from sklearn import model_selection

array = data.values # converts data frame into array of values
X = array[:,0:8] #*****
Y = array[:,8] #*****

validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)

################ 8. Try Model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

model = LogisticRegression()
model.fit(X_train, Y_train)
predictions = model.predict(X_validation)
print("1. Accuracy: {}".format(accuracy_score(Y_validation, predictions)))
print("2. Confusion Matrix:\n{}".format(pd.crosstab(Y_validation, predictions, rownames=['True'], colnames=['Predicted'], margins=True)))

################ 9. Compare with other models
from sklearn.linear_model import LogisticRegression

from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
models.append(('RF', RandomForestClassifier()))

for name, model in models:
	model.fit(X_train, Y_train)
	predictions = model.predict(X_validation)
	print("\n\n** {} Validate Model on Test Data **".format(name))
	print("1. Accuracy: {}".format(accuracy_score(Y_validation, predictions)))
	print("2. Confusion Matrix:\n{}".format(pd.crosstab(Y_validation, predictions, rownames=['True'], colnames=['Predicted'], margins=True)))

Clustering

Clustering is an unsupervised machine learning technique that involves the grouping of data points to find useful patterns. Clustering is different from classification. In classification, we are seeking to make predictions about new data instances by placing them into known categories, e.g. like the woman in the previous Lesson 5 dataset. In clustering, we don’t know what the categories will be in advance. We are seeking to discover what groupings of data rows are similar to each other in some manner.

For example, if we had a dataset with viewership demographic information, we might want to identify groups of viewers with shared characteristics so that we could perhaps target marketing campaigns to some of those groups. We might think we know what some categories could be, such as by a sport, but we might discover some previously unknown similarities by doing cluster analysis.

In this lesson, we’ll use iris data as our example to demonstrate how to do cluster analysis from just the attributes to identify groupings that occur in the dataset. Note that we will run the analysis on data that contains measurements of various attributes of flowers but not the species.

We will look at how to determine the optimal number of clusters we should form. Once we have formed the clusters, we’ll see how well they correspond to the actual clusters of the data, in our case the target is the species of the flowers represented in the dataset. Note: for most data that you cluster you will not have the target available to check for validation. So keep in mind that this is an exercise in understanding clustering and how it works.

Photograph courtesy of kaggle.com

3. Read In the Dataset

As before, this is acquiring the data and reading it into the Python environment. This allows us to interactively view and manipulate this data in a DataFrame.

import pandas as pd
df = pd.read_csv("https://warnermedia.teammindshift.com/data/iris.csv")

4. Analysis

There may be more things that we want to do with a new dataset. With this dataset notice that the sepal-length, petal-length, and petal-width are highly correlated. Let’s use those three fields to do our clustering.

df.corr()

5. Visualization

We might also want to visualize the data in a plot. Notice again that the sepal-width seems to act differently than the other attributes. Data science includes exploration.

df.plot()

6. Cleaning

So here I am going to create an array of only the attributes of interest. I will also create an array of the species so I can use that at a later time.

x = df.iloc[:, [0, 2, 3]].values
y = df['species'].values

7. Preparation

Now we are looking for the optimal number of clusters, i.e. the minimum number of interesting groups. We will use the elbow method for this. We want to run the number of different clusters and see where there is a bend in the curve for the within-cluster sum of squares (WCSS) of each of the clusters. This bend occurs when the WCSS doesn’t decrease significantly with every iteration. To see more about WCSS click here but keep in mind you don’t need to understand this to use it.

#Finding the optimum number of clusters for k-means classification
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
wcss = []

for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
    kmeans.fit(x)
    wcss.append(kmeans.inertia_)
    
#Plotting the results onto a line graph, allowing us to observe 'The elbow'
plt.plot(range(1, 11), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS') #within cluster sum of squares
plt.show()

8. Model

Finally, we run the kmeans classifier. Note there are other classifiers but this is the one that we are using for now. With this classifier, we will have all the rows placed into a class.

#Applying kmeans to the dataset / Creating the kmeans classifier
kmeans = KMeans(n_clusters = 3,random_state = 0)
y_kmeans = kmeans.fit_predict(x)

9. Accuracy (not usually available)

Finally, to see how well this worked we can compare the classification that kmeans found with the classification that we previously know exists, i.e. which species. Usually, we don’t have this ability to evaluate but it’s good to know that at least for this data this method does pretty well with creating classes that make sense.

pd.crosstab(y_kmeans, y)

BONUS

This is a bonus to see a 3D graph of the different attributes used. It also shows the classifications that Kmeans made.

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
%matplotlib inline
from sklearn import datasets
X = x
labels = y_kmeans
#Plotting
fig = plt.figure(1, figsize=(7,7))
ax = Axes3D(fig, rect=[0, 0, 0.95, 1], elev=48, azim=134)
ax.scatter(X[:, 2], X[:, 0], X[:, 1],
          c=labels.astype(np.float), edgecolor="k", s=50)
ax.set_xlabel("Petal width")
ax.set_ylabel("Sepal length")
ax.set_zlabel("Petal length")
plt.title("K Means", fontsize=14)

Here is a 4-minute clip of the 20-minute video by Josh Kaufmaun on learning anything in 20 hours. I recommend watching the entire video in the resource section but this clip also helps. Enjoy!

Exercise

You do not have permission to view this form. Have Fun and Choose Powerfully

9. Rating Clustering

You do not have permission to view this form.

10. Visualization Closer Look

Visualization in Python

The ability to take a set of data and display that data in different ways to understand the nature of the system being measured.

The first thing is to read in the data. We are using data from “Contraceptive Method Choice Data Set” The location of the data is here: https://archive.ics.uci.edu/ml/datasets/Contraceptive+Method+Choice.

Here is what the website says about the data:

Data Set Information:

This dataset is a subset of the 1987 National Indonesia Contraceptive Prevalence Survey. The samples are married women who were either not pregnant or do not know if they were at the time of interview. The problem is to predict the current contraceptive method choice (no use, long-term methods, or short-term methods) of a woman based on her demographic and socio-economic characteristics.

Attribute Information:

1. Wife’s age (numerical)
2. Wife’s education (categorical) 1=low, 2, 3, 4=high
3. Husband’s education (categorical) 1=low, 2, 3, 4=high
4. Number of children ever born (numerical)
5. Wife’s religion (binary) 0=Non-Islam, 1=Islam
6. Wife’s now working? (binary) 0=Yes, 1=No
7. Husband’s occupation (categorical) 1, 2, 3, 4
8. Standard-of-living index (categorical) 1=low, 2, 3, 4=high
9. Media exposure (binary) 0=Good, 1=Not good
10. Contraceptive method used (class attribute) 1=No-use, 2=Long-term, 3=Short-term

import pandas as pd
url = "http://learn.glulife.net/data/cmc.csv"
df = pd.read_csv(url)
df.head()

To allow graphical plots we are loading in the matplotlib library. We will start out with a bar chart. Note that
+ value_counts- counts thte number fo values
+ sort_index sorts by the index that is being used
+ plot.bar – takes the data and plots it into a bar graph

The graph below represents the number of wives at a certain age in the data set.

import matplotlib.pyplot as plt

# bar charts
df['age'].value_counts().sort_index().plot.bar()
plt.show()

Now let’s move on to the single line visualization. The graph below represents the number of education levels for the women in the data set. Try just looking at the data below the graph by removing the .plot() at the end of the statement.

# single line charts
df['wife_edu'].value_counts().sort_index().plot()
plt.show()

If you want to see two values in a line plot then keep the x-axis the same and create 2 data frames. Below you see the educational levels of wives and the husbands. What do you notice in the comparison below? Note that you can also view the data as a bar graph. This may be more useful in depicting information about the data depending on what you are looking for and what the data represents.

# multiple line/bar chart for comparison
wifeedu = df['wife_edu'].value_counts().sort_index()
husbandedu = df['husband_edu'].value_counts().sort_index()

newdf = pd.concat([wifeedu, husbandedu], axis=1)

newdf.plot.bar()
plt.show()

newdf.plot()
plt.show()


For more plots go to https://pandas.pydata.org/pandas-docs/stable/visualization.html.

We can now look at scatter plots. The scatter plots below compare age wives education for plot one. You can also see age of women with the number of children they have. As you look at the charts see what information you can derive from the visuals.

# scatter plot
df.plot.scatter(x='age', y='wife_edu')
plt.show()


df.plot.scatter(x='age', y='num_children')
plt.show()


To take the plot a little further you can add a 3rd variable that is represented by the size of the plot dot. See the chart below that is the age vs wife’s education and then the size of the dot represented by the number of children.

# bubble plot
df.plot.scatter(x='age', y='wife_edu', s=df['num_children'])
plt.show()

Notice that in this plot the size of the dots are related to the number of children for each wife.

Now let’s look at the age represented as a histogram. You can see the most age is between 25 – 30 years old. The histogram is useful for displaying how the data is distributed in the data set.

# column histogram
df.age.hist()
plt.show()

We can look at that same data as a density chart. Not that the peak is around 27.

# line histogram or density
df.age.plot.density()
plt.show()

We also have 2d area charts that can give an even different view of the data. Notice that you can stack the charts on top of one another or you can do the same thing with a bar graph.

# 2d Area chart
newdf.plot.area()
plt.show()

newdf.plot.area(stacked=False)
plt.show()

# stacked column plot
newdf.plot.bar(stacked=True)
plt.show()

You may want to see the percentage make up of values. Here is the husband’s education broken down in a pie chart. Notice that over the husbands have over 60% of their education at the high level.

# pie chart
newdf['husband_edu'].plot.pie()
plt.show()

newdf['husband_edu'].plot.pie(autopct='%.2f')
plt.show()

You may want to see the data just as a table. The tabluate library can provide you the ability to create a tabluar representation of the data. Here is more information about tabluate (https://pypi.python.org/pypi/tabulate).

# Simple Tabular Grid
from tabulate import tabulate

print("")
print (tabulate(df[['age','wife_edu']].head(), headers='keys', tablefmt='psql'))


Finally one last way we will discuss to compare data is through boxplots. You see below the wife’s education in comparison with the husband’s education. The box plot can give you quick insight into how wide a spread the data is, where is the mean, where are the quardrants, how each data set commpares with each other.

# boxplot
df[['wife_edu','husband_edu']].boxplot()
newdf.boxplot()
plt.show()


For more visualization, you can explore the following sites.

  1. https://python-graph-gallery.com/
  2. https://matplotlib.org/
  3. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html
  4. https://pandas.pydata.org/pandas-docs/stable/visualization.html

Exercise

You do not have permission to view this form. Have Fun and Choose Powerfully

11. Additional ML Techniques

MindShift Minute

Using other tools is good practice – Excel, text editor, Tableau, MySQL, etc.

Using additional sources is good practice – friends, Google, YouTube, books, experts, LinkedIn,

Learn from examples – When you see something interesting, learn from it. Start to think about the implications of what you see. How can that apply to other things that you are working on or could be working on in the future?

Generalization – How I used these techniques for teaching someone to swim

Active Learning – As you learn new information continues to put what you learned into your own map/understanding of the topic. Also, map what you are learning into how it can apply to what you are trying to accomplish as well as what it might accomplish in another domain or area of interest.

Area Highlight

Glossary – you can suggest new terms that you have found and we will review and add to list.

Blogs – New blog posts are being added. Check out the blog on visualization.

Completed Exercise View – on the courses page you can see your customized view

Little Extra on Exercise Completion – What do people think?

Lesson Introduction

You now have a good understanding of some of the basic ideas behind Data Science and Machine Learning. Take a moment to rejoice in how far you have come.

As with many new ideas, there is more to know but you have a good foundation to connect other information with. Today we are going to talk about two more modeling subjects: Feature Engineering and K Folding.

Feature Engineering

Feature Engineering includes several techniques that allow you to drop, ignore, modify, or create new data features for a given dataset. This allows you to use the most complete and relevant data in training your models as well as creating a richer environment for modeling. Below we will outline some common data issues and techniques for dealing with those issues.

Missing Data

Remove Rows and Columns

We may want to remove any column or row that has a large percentage of missing data. In the case below we set the threshold at 70% missing data in either row or column and we will delete that row or column. Missing data is data that has a None value or no value at all (blank).

import pandas as pd

threshold = .7

df = pd.DataFrame({'col1': [1,None,1,None],
'col2': [1,None,None,None],
'col3': [None,None,1,1],
'col4': [None,None,1,1]})

print("Original DataFrame\n{}\n\n".format(df))

df = df[df.columns[df.isnull().mean() < threshold]] #Dropping rows with missing value rate higher than threshold
print("Columns Deleted (col2)\n{}\n\n".format(df))

df = df.loc[df.isnull().mean(axis=1) < threshold]
print("Row Deleted (Row 1)\n{}\n\n".format(df))

NOTICE: Examine the print statement above to see how it works. Now you have a new way of printing to add to your toolbox.

Fill in Missing Information

Below we want to replace missing data with either 0 in case 1 or the mean of the values that are there in case 2.

df1 = pd.DataFrame({'col1': [1,None,4,None]});
df2 = pd.DataFrame({'col1': [1,None,4,None]});
#Filling all missing values with 0
df1 = df1.fillna(0)
df2 = df2.fillna(df2.median())

print("Fill 0s\n{}\n\n".format(df1))
print("Fill mean\n{}\n\n".format(df2))

Remove Outliers

Next, we might want to drop rows with values more than a number of standard deviations away from the mean. This is used to get rid of outliers.

#Dropping the outlier rows with standard deviation
factor = 2
df = pd.DataFrame({'col1': [1,1,1,400,1,1,1,1,1,0,0,0,0,0,1,-400]});

upper_lim = df['col1'].mean () + df['col1'].std () * factor
lower_lim = df['col1'].mean () - df['col1'].std () * factor

print("DataFrame: \n{}nUpper Limit: {} Lower Limit{} Factor STD Away {}".format(df,upper_lim, lower_lim, factor))

df = df[(df['col1'] < upper_lim) & (df['col1'] > lower_lim)]
print(df)

Encoding Category Data

Instead of doing a substitution for values in a row you may want to create a separate column for each value. This is good if the unique values have an equal relationship.

df = pd.DataFrame({'col1': ['yellow','red','red','yellow','yellow']});
encoded_columns = pd.get_dummies(df['col1'])
df = df.join(encoded_columns).drop('col1', axis=1)
print(df)

Breaking Data Apart

You may have a column that you want to use to derive other columns for features that may be important. There may be some importance of a weekday in building a model. So predicting traffic in a sports bar may be tied to the weekday and not just date. So you can derive additional fields from data that is present in the DataFrame.

from datetime import date

df = pd.DataFrame({'date':
['01-01-2017',
'04-12-2008',
'23-06-1988',
'25-08-1999',
'20-02-1993',
]})

#Transform string to date
df['date'] = pd.to_datetime(df.date, format="%d-%m-%Y")

#Extracting Year
df['year'] = df['date'].dt.year

#Extracting Month
df['month'] = df['date'].dt.month

#Extracting passed years since the date
df['passed_years'] = date.today().year - df['date'].dt.year

#Extracting passed months since the date
df['passed_months'] = (date.today().year - df['date'].dt.year) * 12 + date.today().month - df['date'].dt.month

#Extracting the weekday name of the date
df['day_name'] = df['date'].dt.day_name()

print(df)

Kfolding

Kfolding is one additional method for improving models to test multiple classifier techniques in order to see which ones perform the best based on the dataset. The idea is to divide the training set of data up into a number of same size buckets. Then sequentially the system will set aside use each bucket to use as the test data after training on the other buckets of data. It records how well each classifier works on the data bucket iterations and returns that to the user. The user can then see on average how well each classifier worked seeing and training on different datasets. This helps to combat overfitting the data towards getting the best classifier.

This is the idea behind kfolding.

from sklearn.linear_model import LogisticRegression
from sklearn import model_selection

import warnings
warnings.filterwarnings("ignore")

url = "https://warnermedia.teammindshift.com/data/iris.csv"

df = pd.read_csv(url)

# Split-out validation dataset
array = df.values
X = array[:,0:4]
Y = array[:,4]

# Get Training and Validation sets
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=0.2, random_state=7)

model = LogisticRegression()
kfold = model_selection.KFold(n_splits=10, random_state=7)
cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold)
print(cv_results)
import warnings
warnings.filterwarnings("ignore")

############################

import pandas as pd
import numpy as np

# Load dataset
url = "https://warnermedia.teammindshift.com/data/iris.csv"

df = pd.read_csv(url)

############################
# import sklearn model_selection code
from sklearn import model_selection

# Split-out validation dataset
array = df.values
X = array[:,0:4]
Y = array[:,4]

# Get Training and Validation sets
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=0.2, random_state=7)
#################################

from sklearn.linear_model import LogisticRegression
from sklearn import model_selection

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

##################################
names = []
models = []

# Spot Check Algorithms
models.append(('LR', LogisticRegression()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
models.append(('RF', RandomForestClassifier()))

results = []
seed = 7

kfold = model_selection.KFold(n_splits=10, random_state=seed)

for name, model in models:
	cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold)
	results.append(cv_results)
	names.append(name)
	msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
	print(msg)

Exercise

You do not have permission to view this form. Have Fun and Choose Powerfully

12. Rating More Techniques

You do not have permission to view this form.

13. Summary

Your Why

No Entries Found

MindShift

Internal Feelings and Conversations

  • Emotions – Choice
  • Positive Self Talk
  • Not hard or easy but unknown to known

Learning To Learn How To Learn

  • Build Your Own World Model of What you are Learning (Why am I and how to apply)
  • Learn Functions not all Focus on Skill
  • Problem Solving (What I know and don’t know, take what I don’t know and turn it into what I know

Generalize

  • How can what I am learning here apply somewhere else
  • How can what I know somewhere else apply here

Data Science

Using Data Science to Solve Problems

  • Self Learning Game
  • Driverless Cars

Data Usage Progression

  • Spreadsheets – Access to Data
  • Dashboards – Discovery Patterns in Data
  • Predictive – What May Happen
  • Prescriptive – What to Do
  • Automation – Correct for You

AI Techniques

  • Visualization
  • Machine Learning
  • Deep Learning
  • Other

Anatomy of a Data Science Project

  • Data Gathering
  • Data Understanding
  • Data Cleaning
  • Data Modeling
  • System Deployment

Data Science Mindset

  • Problem Solver
  • Constant Learner
  • Right Tool
  • Know-How to Code
  • Understand the Business

Bias and Ethics in a Data-Driven Society

  • Data Acquisition
  • Datasets
  • Data-Driven Applications

Recognizing Data Science Projects

  • Problem Definition – What Problems Exist
  • If I had a Crystal Ball Could I solve the problem
  • Finding Data – What Data could help with prediction

Describing a Data Science Project

  • Describe Why This Project
  • Define Problem Being Solved
  • Outline Possible Solutions
  • Define Rough Steps to Solving Problem
  • Layout Possible Timeline
  • Identify Stakeholders

Ten Steps for Machine Learning Project

  1. Define the Problem
  2. Identify Data Set
  3. Load Data Into Environment
  4. Analyze Data
  5. Clean Data
  6. Visualize Data
  7. Prepare Data For Machine Learning
  8. Try a Machine Learning Algorithm
  9. Refine the Machine Learning Algorithm
  10. Deploy

Here is how you answered the exercises. This gives you some initial information to go back and start to build out a plan on how to do a simple data science project and what are some of the ethical issues you may have to deal with.

MindShift Review – Exercise

No Entries Found

Small Project

My thoughts about the programming exercise:

sasdfasdf

Closer Look 1 – 5

Read in TSV:

asdasdf

Functions that will display rows

asdfasdf

asdfasdf

Function to get rid of a column:

asdfasdf

Function  to delete “?”

Closer Look 6 – 9

No Entries Found

Clustering

No Entries Found

Visualization

Visual Chart Code

waesfaseraweraszdf

Additional Techniques

No Entries Found

14. Post survey

You do not have permission to view this form.

15. Rating Overall Course

You do not have permission to view this form.