09. Data Science Project Overview – Content [261]

Introduction

We are now to the point of doing some programming. This is the time that you may start to feel those emotions creeping in. If it is then remember the first lesson about choice. If they are then good for you. This section will take you through developing a simple categorization model. This development will be taken one step at a time with explanations for each of the steps that you take. Remember you are not looking to learn everything about programming. This is to give you an overview of how programming these models are done through the coding of a very simple model.

Model Creation Steps

    1. Defining Problem
    2. Identifying Data Set
    3. Loading Data Into Environment
    4. Analyze Data
    5. Cleaning Data
    6. Visualizing Data
    7. Preparing Data For Machine Learning
    8. Trying a Machine Learning Algorithm
    9. Refining the Machine Learning Algorithm
    10. Deploying

1. Defining Problem

This step is to be very clear about what your goals are, the type of data that you will use and the possible techniques you will use. The project identification is good to be used for this section.

We have filled out the Worksheet below.

Loader Loading...
EAD Logo Taking too long?

Reload Reload document
| Open Open in new tab

Download [52.19 KB]

Want to classify the type of iris of a sample based on sepal and petal information obtained. 

Use gathered data to train a model to do prediction.

Iris setosa Iris versicolor Iris virginica
  1. Identifying Data Set

https://warnermedia.teammindshift.com/data/iris.csv

https://en.wikipedia.org/wiki/Iris_flower_data_set – more information

 

3.Loading Data Into Environment

######################## READ IN DATA ############################

import pandas as pd
import numpy as np

# Load dataset
url = "https://warnermedia.teammindshift.com/data/iris.csv"

df = pd.read_csv(url)

  1. Analyze Data
df.describe(include='all')
  1. Cleaning Data
from sklearn.preprocessing import LabelEncoder

LE = LabelEncoder()
df['species'] = LE.fit_transform(df['species'])

print(df)

# 0 - Setosa
# 1 - Versicolor
# 2 - Virgina
  1. Visualizing Data
df.boxplot()
  1. Preparing Data For Machine Learning
# import sklearn model_selection code
from sklearn import model_selection

# Split-out validation dataset
array = df.values
X = array[:,0:4]
Y = array[:,4]

# Get Training and Validation sets
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=0.2, random_state=7)

NOTE: Discuss overfitting

  1. Trying a Machine Learning Algorithm
from sklearn.metrics import accuracy_score<
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, Y_train)
predictions = model.predict(X_validation)
print("1. Accuracy: {}".format(accuracy_score(Y_validation, predictions)))
print("2. Confusion Matrix:\n{}".format(pd.crosstab(Y_validation, predictions, rownames=['True'], colnames=['Predicted'], margins=True)))
  1. Refining the Machine Learning Algorithm
from sklearn.metrics import accuracy_score<
from sklearn.linear_model import LogisticRegression

model = LinearDiscriminantAnalysis()
model.fit(X_train, Y_train)
predictions = model.predict(X_validation)

print("1. Accuracy: {}".format(accuracy_score(Y_validation, predictions)))
print("2. Confusion Matrix:\n{}".format(pd.crosstab(Y_validation, predictions, rownames=['True'], colnames=['Predicted'])))