Steps In A Data Science Project

Overview

Time: 0 min

Objectives

Understand the steps in Data science project

What is Data Science?

Data science is the field of study that combines domain expertise, programming skills, and knowledge of mathematics and statistics to extract meaningful insights from data. Data scientists apply machine learning algorithms to numbers, text, images, video, audio, and more to produce artificially intelligent systems to perform tasks. These systems generate insights which analysts and business users can translate into tangible business value.

What are the steps in a Data Science Project?

The following are the steps that are generally followed in Data Science projects.

Step 1 : Obtain Data

In this step, we obtain the data that we need from available data sources.

Step 2 : Scrubbing Data

Once we obtain the data from various sources, we need to clean it. The reason for this is that performing analysis or modelling unclean data gives results that are not useful and are not accurate. This step includes handling missing values, data encoding etc. This step is also know as “Data Pre-Processing”

Step 3: Explore Data

Once the data has been cleaned, we examine the data. in this step, we try to make sense of the data. what does this data represent?. What questions can be answered using this data?. What needs to be predicted using this data?. These are some of the questions answered in this step. We also try to identify significant patterns and trends in our data using data visualization.

Step 4 : Model Data

In this stage, we use the cleaned data to train machine learning models. These trained models then can be used to predict the outcome when a new entry of data is presented to it. For example: Train a spam detector using the mails in your inbox. When a new mail arrives, the trained model identifies if this mail is spam or not.

Step 5 : Interpreting Data

In this step, We deliver the results in to answer the business questions we asked when we first started the project, together with the actionable insights that we found through the data science process.

Key Points

Steps in Data Pre-Processing

Overview

Time: 0 min

Objectives

To understand variables and the rules of naming variables

What is Data Pre-Processing?

Before we start exploring our data to extract the insights out of it, we need to process our data to make it understandable. Preprocessing the raw data helps to organize, scaling, clean, simplifying it for being used to train machine learning algorithms.

The following are the steps in Data Pre-Processing:

Missing values
Standardization
Normalization
Encoding categorical features
Discretization

Handling Missing Values

Handling missing values is an important step, as it can affect your model. It is important to identify the missing values and know with which value they can be replaced.Depending on the missing data a decision is made regarding whether or not to keep entries with missing data.

A better strategy is to impute the missing values. Imputing means to infer them from the known part of the data. The SimpleImputer class provides basic strategies for imputing missing values. Missing values can be imputed with a provided constant value, or using the statistics (mean, median, or most frequent) of each column in which the missing values are located.

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.impute import MissingIndicator


indicator = MissingIndicator(missing_values=np.NaN)
indicator = indicator.fit_transform(df)
indicator = pd.DataFrame(indicator, columns=['horsepower'])


#replacing the missing values by their mean 

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(df.iloc[:, 1:7])
df.iloc[:, 1:7] = imputer.transform(df.iloc[:, 1:7])
df

Standardization

Standardization is a transformation that centers the data by removing the mean value of each feature and then scale it by dividing features by their standard deviation. After standardizing data the mean will be zero and the standard deviation one. For this task, we can use Standard Scaler.

sc_X = StandardScaler(with_mean=False)
X = sc_X.fit_transform(X.drop(['car name'], axis=1))

Normalization

Normalization is the process of scaling individual samples to have a unit norm. We need to normalize data when the algorithm predicts based on the weighted relationships formed between data points.

from sklearn.preprocessing import Normalizer
nm = Normalizer()
x_sc = nm.fit_transform(X)
X=pd.DataFrame(x_sc)

Encoding categorical features

Sklearn’s machine learning library does not support handling categorical data. Therefore, it is important to convert categorical features to a numerical representation.

Label Encoding converts the labels into the numeric form. For Example: If a column in the data set contains the values {Chicago, Urbana, Springfield}. The can be mapped to 0, 1, 2.

However, the dataset contains multiple car model names which have a string as their datatype.Therefore, we use the OneHot Encoding technique.

from sklearn.preprocessing import OneHotEncoder
onehot = OneHotEncoder(dtype=np.int, sparse=True)
nominals = pd.DataFrame(
    onehot.fit_transform(X[['car name']])\
    .toarray())
nominals

Discretization

Data discretization refers to a method of converting a huge number of data values into smaller ones so that the evaluation and management of data become easy. There are two forms of data discretization first is supervised discretization, and the second is unsupervised discretization. Supervised discretization refers to a method in which the class data is used. Unsupervised discretization refers to a method depending upon the way which operation proceeds.

from sklearn.preprocessing import KBinsDiscretizer 
disc = KBinsDiscretizer(
n_bins=6, encode='onehot',strategy='uniform')
disc.fit_transform(X)

Key Points

Visualizing data in Python

Overview

Time: 0 min

Objectives

To understand Visualization in python

What do you need to visualize data in python?

To graph data in python, we need to use the matplotlib library. More specifically, the pyplot module of this library is the most useful module to plot data. To install this, type - pip install matplotlib

Types of plots in python

We can plot the following graphs in python

line
Bar Chart
Histogram
Scatter plot
Pie-chart

Line Graph

Draws a line between x-axis and corresponding y-axis values.

Define the x-axis and corresponding y-axis values as lists.
Plot them on canvas using .plot() function.
Give a name to x-axis and y-axis using .xlabel() and .ylabel() functions.
Give a title to your plot using .title() function.
Finally, to view your plot, we use .show() function.

import matplotlib.pyplot as plt

x = [1,2,3,4,5]

y = [1,4,9,16,25]

# plotting the points
plt.plot(x, y, color='green', linestyle='dashed', linewidth = 3,
         marker='o', markerfacecolor='blue', markersize=12)

plt.xlabel('x - axis: numbers')

plt.ylabel('y - axis: Square(x)')

plt.title('x^2')

plt.show()

Bar Graph

The plt.bar() function is used to plot a bar chart.
x-coordinates of the left side of bars are passed along with the heights of bars.
Name to x-axis coordinates by defining tick_labels

import matplotlib.pyplot as plt

# x-coordinates of left sides of bars
left = [1, 2, 3, 4, 5]

# heights of bars
height = [10, 24, 36, 40, 5]

# labels for bars
tick_label = ['one', 'two', 'three', 'four', 'five']

# plotting a bar chart
plt.bar(left, height, tick_label = tick_label,
		width = 0.8, color = ['red', 'green'])

# naming the x-axis
plt.xlabel('x - axis')
# naming the y-axis
plt.ylabel('y - axis')
# plot title
plt.title('My bar chart!')

# function to show the plot
plt.show()

Histogram

plt.hist() is used function to plot a histogram.
Frequencies are passed as a list.
The range is set by defining a tuple containing min and max values.
The next step is to “bin” the range of values. Binning means dividing the entire range of values into a series of intervals and then count how many values fall into each interval.

import matplotlib.pyplot as plt

# frequencies
ages = [2,5,70,40,30,45,50,45,43,40,44,
		60,7,13,57,18,90,77,32,21,20,40]

# setting the ranges and no. of intervals
range = (0, 100)
bins = 10

# plotting a histogram
plt.hist(ages, bins, range, color = 'green',
		histtype = 'bar', rwidth = 0.8)

# x-axis label
plt.xlabel('age')
# frequency label
plt.ylabel('No. of people')
# plot title
plt.title('My histogram')

# function to show the plot
plt.show()

Scatter Plot

plt.scatter() function is used to plot a scatter plot.
Define x and corresponding y-axis values.
Marker argument is used to set the character to use as a marker. Its size can be defined using the s parameter.

import matplotlib.pyplot as plt

# x-axis values
x = [1,2,3,4,5,6,7,8,9,10]
# y-axis values
y = [2,4,5,7,6,8,9,11,12,12]

# plotting points as a scatter plot
plt.scatter(x, y, label= "stars", color= "green",
			marker= "*", s=30)

# x-axis label
plt.xlabel('x - axis')
# frequency label
plt.ylabel('y - axis')
# plot title
plt.title('My scatter plot!')
# showing legend
plt.legend()

# function to show the plot
plt.show()

Pie Chart

Plot a pie chart by using plt.pie() method.
Define the labels using a list called activities.
Then, a portion of each label can be defined using another list called slices.
Color for each label can also be defined using a list.

import matplotlib.pyplot as plt

# defining labels
activities = ['eat', 'sleep', 'work', 'play']

# portion covered by each label
slices = [3, 7, 8, 6]

# color for each label
colors = ['r', 'y', 'g', 'b']

# plotting the pie chart
plt.pie(slices, labels = activities, colors=colors,
		startangle=90, shadow = True, explode = (0, 0, 0.1, 0),
		radius = 1.2, autopct = '%1.1f%%')

# plotting legend
plt.legend()

# showing the plot
plt.show()

Key Points

Data Pre-Processing using Python

Steps In A Data Science Project

Overview

What is Data Science?

What are the steps in a Data Science Project?

Key Points

Steps in Data Pre-Processing

Overview

What is Data Pre-Processing?

Handling Missing Values

Standardization

Normalization

Encoding categorical features

Discretization

Key Points

Visualizing data in Python

Overview

What do you need to visualize data in python?

Types of plots in python

Line Graph

Bar Graph

Histogram

Scatter Plot

Pie Chart

Key Points