# 09. Data Science Project Overview – Content 

## Introduction

We are now to the point of doing some programming. This is the time that you may start to feel those emotions creeping in. If it is then remember the first lesson about choice. If they are then good for you. This section will take you through developing a simple categorization model. This development will be taken one step at a time with explanations for each of the steps that you take. Remember you are not looking to learn everything about programming. This is to give you an overview of how programming these models are done through the coding of a very simple model.

## Model Creation Steps

1. Defining Problem
2. Identifying Data Set
4. Analyze Data
5. Cleaning Data
6. Visualizing Data
7. Preparing Data For Machine Learning
8. Trying a Machine Learning Algorithm
9. Refining the Machine Learning Algorithm
10. Deploying

## 1. Defining Problem

This step is to be very clear about what your goals are, the type of data that you will use and the possible techniques you will use. The project identification is good to be used for this section.

We have filled out the Worksheet below. Loading... Taking too long? Reload document
| Open in new tab

Want to classify the type of iris of a sample based on sepal and petal information obtained.

Use gathered data to train a model to do prediction.

1. Identifying Data Set

https://warnermedia.teammindshift.com/data/iris.csv

```######################## READ IN DATA ############################

import pandas as pd
import numpy as np

url = "https://warnermedia.teammindshift.com/data/iris.csv"

```
1. Analyze Data
`df.describe(include='all')`
1. Cleaning Data
```from sklearn.preprocessing import LabelEncoder

LE = LabelEncoder()
df['species'] = LE.fit_transform(df['species'])

print(df)

# 0 - Setosa
# 1 - Versicolor
# 2 - Virgina
```
1. Visualizing Data
`df.boxplot()`
1. Preparing Data For Machine Learning
```# import sklearn model_selection code
from sklearn import model_selection

# Split-out validation dataset
array = df.values
X = array[:,0:4]
Y = array[:,4]

# Get Training and Validation sets
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=0.2, random_state=7)```

NOTE: Discuss overfitting

1. Trying a Machine Learning Algorithm
```from sklearn.metrics import accuracy_score<
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, Y_train)
predictions = model.predict(X_validation)
print("1. Accuracy: {}".format(accuracy_score(Y_validation, predictions)))
print("2. Confusion Matrix:\n{}".format(pd.crosstab(Y_validation, predictions, rownames=['True'], colnames=['Predicted'], margins=True)))```
1. Refining the Machine Learning Algorithm
```from sklearn.metrics import accuracy_score<
from sklearn.linear_model import LogisticRegression

model = LinearDiscriminantAnalysis()
model.fit(X_train, Y_train)
predictions = model.predict(X_validation)

print("1. Accuracy: {}".format(accuracy_score(Y_validation, predictions)))
print("2. Confusion Matrix:\n{}".format(pd.crosstab(Y_validation, predictions, rownames=['True'], colnames=['Predicted'])))```