10. Creating a Data Science Project – Content [262]

In this section, you will create a predictive model. This is to give you a basic hands-on understanding of machine learning, not to achieve a highly accurate prediction. The process is a simple structure that can be expanded upon with additional tools and techniques to increase the accuracy of the model but our focus is on basic understanding. We will be doing the work using Python. If you are new to programming then here is some additional reading before you get started.

Programming Mindset Help for Beginners

7 Concepts to Help Programming In Any Language

The system we will be using is described below. It will allow you to put run your code in a web environment and elevates you have to install anything on your machine. What the 3 min video for an introduction to colab by google.com


This problem will use real data to allow you to get a feel for how machine learning works to predict an outcome based on training a model with data. This is a very simple machine learning project but is a skeleton for how machine learning projects work. This exercise will take that data and based on different characteristics allow you to predict an outcome using machine learning techniques.

Hands-On Steps:

  1. Click on the Python environment, login or create a Gmail account (it’s free), then start up a Jupyter notebook
  2. Copy the code into a cell of the notebook then run it
  3. The code will:
    1. Load the needed libraries for the task at hand: the pandas library for data storage and manipulation and sklearn library for machine learning
    2. Read data into the environment. The data is read into a dataframe
    3. To give you an understanding of the data, we will use a function called describe. This gives you some statistics on the data as well as insight into the fields that exist. Look at the printout and note the range of the different fields of information.
    4. The data will now be prepared for machine learning. We will create a training set of data and then a data set to validate the accuracy of the model
    5. Create an evaluator using RandomTreeClassifier. There are many different classifiers you can use and parameters you can set for them. Each classifier has its strengths and weaknesses. Once the classifier is selected we train the model. We now check to see how the model did. We don’t want to train it to the validation set so we only use this as a check and only at the end. We don’t want the model to overfit to the training dataset.
    6. To get further insight into our accuracy we can create a confusion matrix that tells us how we did with our predictions. It will tell us what was the correct answer and how many of the different answers we guess.

Things to note:
# signifies comments and the remaining line is not executed
The comments describe the code under the comments

Try the Exercise Below Try the following to test your skills. You may have to look up how to do specific coding in python.
Click to Show/Hide Solution