This lesson is still being designed and assembled (Pre-Alpha version)

Data Pre-Processing using Python

Steps In A Data Science Project

Overview

Time: 0 min
Objectives
  • Understand the steps in Data science project

What is Data Science?

Data science is the field of study that combines domain expertise, programming skills, and knowledge of mathematics and statistics to extract meaningful insights from data. Data scientists apply machine learning algorithms to numbers, text, images, video, audio, and more to produce artificially intelligent systems to perform tasks. These systems generate insights which analysts and business users can translate into tangible business value.

What are the steps in a Data Science Project?

The following are the steps that are generally followed in Data Science projects.

Step 1 : Obtain Data

In this step, we obtain the data that we need from available data sources.

Step 2 : Scrubbing Data

Once we obtain the data from various sources, we need to clean it. The reason for this is that performing analysis or modelling unclean data gives results that are not useful and are not accurate. This step includes handling missing values, data encoding etc. This step is also know as “Data Pre-Processing”

Step 3: Explore Data

Once the data has been cleaned, we examine the data. in this step, we try to make sense of the data. what does this data represent?. What questions can be answered using this data?. What needs to be predicted using this data?. These are some of the questions answered in this step. We also try to identify significant patterns and trends in our data using data visualization.

Step 4 : Model Data

In this stage, we use the cleaned data to train machine learning models. These trained models then can be used to predict the outcome when a new entry of data is presented to it. For example: Train a spam detector using the mails in your inbox. When a new mail arrives, the trained model identifies if this mail is spam or not.

Step 5 : Interpreting Data

In this step, We deliver the results in to answer the business questions we asked when we first started the project, together with the actionable insights that we found through the data science process.

Key Points


Steps in Data Pre-Processing

Overview

Time: 0 min
Objectives
  • To understand variables and the rules of naming variables

What is Data Pre-Processing?

Before we start exploring our data to extract the insights out of it, we need to process our data to make it understandable. Preprocessing the raw data helps to organize, scaling, clean, simplifying it for being used to train machine learning algorithms.

The following are the steps in Data Pre-Processing:

Handling Missing Values

Handling missing values is an important step, as it can affect your model. It is important to identify the missing values and know with which value they can be replaced.Depending on the missing data a decision is made regarding whether or not to keep entries with missing data.

A better strategy is to impute the missing values. Imputing means to infer them from the known part of the data. The SimpleImputer class provides basic strategies for imputing missing values. Missing values can be imputed with a provided constant value, or using the statistics (mean, median, or most frequent) of each column in which the missing values are located.

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.impute import MissingIndicator


indicator = MissingIndicator(missing_values=np.NaN)
indicator = indicator.fit_transform(df)
indicator = pd.DataFrame(indicator, columns=['horsepower'])


#replacing the missing values by their mean 

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(df.iloc[:, 1:7])
df.iloc[:, 1:7] = imputer.transform(df.iloc[:, 1:7])
df

Standardization

Standardization is a transformation that centers the data by removing the mean value of each feature and then scale it by dividing features by their standard deviation. After standardizing data the mean will be zero and the standard deviation one. For this task, we can use Standard Scaler.


sc_X = StandardScaler(with_mean=False)
X = sc_X.fit_transform(X.drop(['car name'], axis=1))

Normalization

Normalization is the process of scaling individual samples to have a unit norm. We need to normalize data when the algorithm predicts based on the weighted relationships formed between data points.

from sklearn.preprocessing import Normalizer
nm = Normalizer()
x_sc = nm.fit_transform(X)
X=pd.DataFrame(x_sc)

Encoding categorical features

Sklearn’s machine learning library does not support handling categorical data. Therefore, it is important to convert categorical features to a numerical representation.

Label Encoding converts the labels into the numeric form. For Example: If a column in the data set contains the values {Chicago, Urbana, Springfield}. The can be mapped to 0, 1, 2.

However, the dataset contains multiple car model names which have a string as their datatype.Therefore, we use the OneHot Encoding technique.

from sklearn.preprocessing import OneHotEncoder
onehot = OneHotEncoder(dtype=np.int, sparse=True)
nominals = pd.DataFrame(
    onehot.fit_transform(X[['car name']])\
    .toarray())
nominals

Discretization

Data discretization refers to a method of converting a huge number of data values into smaller ones so that the evaluation and management of data become easy. There are two forms of data discretization first is supervised discretization, and the second is unsupervised discretization. Supervised discretization refers to a method in which the class data is used. Unsupervised discretization refers to a method depending upon the way which operation proceeds.

from sklearn.preprocessing import KBinsDiscretizer 
disc = KBinsDiscretizer(
n_bins=6, encode='onehot',strategy='uniform')
disc.fit_transform(X)

Key Points


Visualizing data in Python

Overview

Time: 0 min
Objectives
  • To understand Visualization in python

What do you need to visualize data in python?

To graph data in python, we need to use the matplotlib library. More specifically, the pyplot module of this library is the most useful module to plot data. To install this, type - pip install matplotlib

Types of plots in python

We can plot the following graphs in python

Line Graph

Draws a line between x-axis and corresponding y-axis values.


import matplotlib.pyplot as plt


x = [1,2,3,4,5]

y = [1,4,9,16,25]

# plotting the points
plt.plot(x, y, color='green', linestyle='dashed', linewidth = 3,
         marker='o', markerfacecolor='blue', markersize=12)


plt.xlabel('x - axis: numbers')

plt.ylabel('y - axis: Square(x)')


plt.title('x^2')


plt.show()

Bar Graph


import matplotlib.pyplot as plt

# x-coordinates of left sides of bars
left = [1, 2, 3, 4, 5]

# heights of bars
height = [10, 24, 36, 40, 5]

# labels for bars
tick_label = ['one', 'two', 'three', 'four', 'five']

# plotting a bar chart
plt.bar(left, height, tick_label = tick_label,
		width = 0.8, color = ['red', 'green'])

# naming the x-axis
plt.xlabel('x - axis')
# naming the y-axis
plt.ylabel('y - axis')
# plot title
plt.title('My bar chart!')

# function to show the plot
plt.show()


Histogram


import matplotlib.pyplot as plt

# frequencies
ages = [2,5,70,40,30,45,50,45,43,40,44,
		60,7,13,57,18,90,77,32,21,20,40]

# setting the ranges and no. of intervals
range = (0, 100)
bins = 10

# plotting a histogram
plt.hist(ages, bins, range, color = 'green',
		histtype = 'bar', rwidth = 0.8)

# x-axis label
plt.xlabel('age')
# frequency label
plt.ylabel('No. of people')
# plot title
plt.title('My histogram')

# function to show the plot
plt.show()

Scatter Plot



import matplotlib.pyplot as plt

# x-axis values
x = [1,2,3,4,5,6,7,8,9,10]
# y-axis values
y = [2,4,5,7,6,8,9,11,12,12]

# plotting points as a scatter plot
plt.scatter(x, y, label= "stars", color= "green",
			marker= "*", s=30)

# x-axis label
plt.xlabel('x - axis')
# frequency label
plt.ylabel('y - axis')
# plot title
plt.title('My scatter plot!')
# showing legend
plt.legend()

# function to show the plot
plt.show()


Pie Chart



import matplotlib.pyplot as plt

# defining labels
activities = ['eat', 'sleep', 'work', 'play']

# portion covered by each label
slices = [3, 7, 8, 6]

# color for each label
colors = ['r', 'y', 'g', 'b']

# plotting the pie chart
plt.pie(slices, labels = activities, colors=colors,
		startangle=90, shadow = True, explode = (0, 0, 0.1, 0),
		radius = 1.2, autopct = '%1.1f%%')

# plotting legend
plt.legend()

# showing the plot
plt.show()


Key Points