Steps In A Data Science Project
Overview
Time: 0 minObjectives
Understand the steps in Data science project
What is Data Science?
Data science is the field of study that combines domain expertise, programming skills, and knowledge of mathematics and statistics to extract meaningful insights from data. Data scientists apply machine learning algorithms to numbers, text, images, video, audio, and more to produce artificially intelligent systems to perform tasks. These systems generate insights which analysts and business users can translate into tangible business value.
What are the steps in a Data Science Project?
The following are the steps that are generally followed in Data Science projects.
Step 1 : Obtain Data
In this step, we obtain the data that we need from available data sources.
Step 2 : Scrubbing Data
Once we obtain the data from various sources, we need to clean it. The reason for this is that performing analysis or modelling unclean data gives results that are not useful and are not accurate. This step includes handling missing values, data encoding etc. This step is also know as “Data Pre-Processing”
Step 3: Explore Data
Once the data has been cleaned, we examine the data. in this step, we try to make sense of the data. what does this data represent?. What questions can be answered using this data?. What needs to be predicted using this data?. These are some of the questions answered in this step. We also try to identify significant patterns and trends in our data using data visualization.
Step 4 : Model Data
In this stage, we use the cleaned data to train machine learning models. These trained models then can be used to predict the outcome when a new entry of data is presented to it. For example: Train a spam detector using the mails in your inbox. When a new mail arrives, the trained model identifies if this mail is spam or not.
Step 5 : Interpreting Data
In this step, We deliver the results in to answer the business questions we asked when we first started the project, together with the actionable insights that we found through the data science process.
Key Points
Steps in Data Pre-Processing
Overview
Time: 0 minObjectives
To understand variables and the rules of naming variables
What is Data Pre-Processing?
Before we start exploring our data to extract the insights out of it, we need to process our data to make it understandable. Preprocessing the raw data helps to organize, scaling, clean, simplifying it for being used to train machine learning algorithms.
The following are the steps in Data Pre-Processing:
- Missing values
- Standardization
- Normalization
- Encoding categorical features
- Discretization
Handling Missing Values
Handling missing values is an important step, as it can affect your model. It is important to identify the missing values and know with which value they can be replaced.Depending on the missing data a decision is made regarding whether or not to keep entries with missing data.
A better strategy is to impute the missing values. Imputing means to infer them from the known part of the data. The SimpleImputer class provides basic strategies for imputing missing values. Missing values can be imputed with a provided constant value, or using the statistics (mean, median, or most frequent) of each column in which the missing values are located.
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.impute import MissingIndicator
indicator = MissingIndicator(missing_values=np.NaN)
indicator = indicator.fit_transform(df)
indicator = pd.DataFrame(indicator, columns=['horsepower'])
#replacing the missing values by their mean
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(df.iloc[:, 1:7])
df.iloc[:, 1:7] = imputer.transform(df.iloc[:, 1:7])
df
Standardization
Standardization is a transformation that centers the data by removing the mean value of each feature and then scale it by dividing features by their standard deviation. After standardizing data the mean will be zero and the standard deviation one. For this task, we can use Standard Scaler.
sc_X = StandardScaler(with_mean=False)
X = sc_X.fit_transform(X.drop(['car name'], axis=1))
Normalization
Normalization is the process of scaling individual samples to have a unit norm. We need to normalize data when the algorithm predicts based on the weighted relationships formed between data points.
from sklearn.preprocessing import Normalizer
nm = Normalizer()
x_sc = nm.fit_transform(X)
X=pd.DataFrame(x_sc)
Encoding categorical features
Sklearn’s machine learning library does not support handling categorical data. Therefore, it is important to convert categorical features to a numerical representation.
Label Encoding converts the labels into the numeric form. For Example: If a column in the data set contains the values {Chicago, Urbana, Springfield}. The can be mapped to 0, 1, 2.
However, the dataset contains multiple car model names which have a string as their datatype.Therefore, we use the OneHot Encoding technique.
from sklearn.preprocessing import OneHotEncoder
onehot = OneHotEncoder(dtype=np.int, sparse=True)
nominals = pd.DataFrame(
onehot.fit_transform(X[['car name']])\
.toarray())
nominals
Discretization
Data discretization refers to a method of converting a huge number of data values into smaller ones so that the evaluation and management of data become easy. There are two forms of data discretization first is supervised discretization, and the second is unsupervised discretization. Supervised discretization refers to a method in which the class data is used. Unsupervised discretization refers to a method depending upon the way which operation proceeds.
from sklearn.preprocessing import KBinsDiscretizer
disc = KBinsDiscretizer(
n_bins=6, encode='onehot',strategy='uniform')
disc.fit_transform(X)
Key Points
Visualizing data in Python
Overview
Time: 0 minObjectives
To understand Visualization in python
What do you need to visualize data in python?
To graph data in python, we need to use the matplotlib library. More specifically, the pyplot module of this library is the most useful module to plot data. To install this, type - pip install matplotlib
Types of plots in python
We can plot the following graphs in python
- line
- Bar Chart
- Histogram
- Scatter plot
- Pie-chart
Line Graph
Draws a line between x-axis and corresponding y-axis values.
- Define the x-axis and corresponding y-axis values as lists.
- Plot them on canvas using .plot() function.
- Give a name to x-axis and y-axis using .xlabel() and .ylabel() functions.
- Give a title to your plot using .title() function.
- Finally, to view your plot, we use .show() function.
import matplotlib.pyplot as plt
x = [1,2,3,4,5]
y = [1,4,9,16,25]
# plotting the points
plt.plot(x, y, color='green', linestyle='dashed', linewidth = 3,
marker='o', markerfacecolor='blue', markersize=12)
plt.xlabel('x - axis: numbers')
plt.ylabel('y - axis: Square(x)')
plt.title('x^2')
plt.show()
Bar Graph
- The plt.bar() function is used to plot a bar chart.
- x-coordinates of the left side of bars are passed along with the heights of bars.
- Name to x-axis coordinates by defining tick_labels
import matplotlib.pyplot as plt
# x-coordinates of left sides of bars
left = [1, 2, 3, 4, 5]
# heights of bars
height = [10, 24, 36, 40, 5]
# labels for bars
tick_label = ['one', 'two', 'three', 'four', 'five']
# plotting a bar chart
plt.bar(left, height, tick_label = tick_label,
width = 0.8, color = ['red', 'green'])
# naming the x-axis
plt.xlabel('x - axis')
# naming the y-axis
plt.ylabel('y - axis')
# plot title
plt.title('My bar chart!')
# function to show the plot
plt.show()
Histogram
- plt.hist() is used function to plot a histogram.
- Frequencies are passed as a list.
- The range is set by defining a tuple containing min and max values.
- The next step is to “bin” the range of values. Binning means dividing the entire range of values into a series of intervals and then count how many values fall into each interval.
import matplotlib.pyplot as plt
# frequencies
ages = [2,5,70,40,30,45,50,45,43,40,44,
60,7,13,57,18,90,77,32,21,20,40]
# setting the ranges and no. of intervals
range = (0, 100)
bins = 10
# plotting a histogram
plt.hist(ages, bins, range, color = 'green',
histtype = 'bar', rwidth = 0.8)
# x-axis label
plt.xlabel('age')
# frequency label
plt.ylabel('No. of people')
# plot title
plt.title('My histogram')
# function to show the plot
plt.show()
Scatter Plot
- plt.scatter() function is used to plot a scatter plot.
- Define x and corresponding y-axis values.
- Marker argument is used to set the character to use as a marker. Its size can be defined using the s parameter.
import matplotlib.pyplot as plt
# x-axis values
x = [1,2,3,4,5,6,7,8,9,10]
# y-axis values
y = [2,4,5,7,6,8,9,11,12,12]
# plotting points as a scatter plot
plt.scatter(x, y, label= "stars", color= "green",
marker= "*", s=30)
# x-axis label
plt.xlabel('x - axis')
# frequency label
plt.ylabel('y - axis')
# plot title
plt.title('My scatter plot!')
# showing legend
plt.legend()
# function to show the plot
plt.show()
Pie Chart
- Plot a pie chart by using plt.pie() method.
- Define the labels using a list called activities.
- Then, a portion of each label can be defined using another list called slices.
- Color for each label can also be defined using a list.
import matplotlib.pyplot as plt
# defining labels
activities = ['eat', 'sleep', 'work', 'play']
# portion covered by each label
slices = [3, 7, 8, 6]
# color for each label
colors = ['r', 'y', 'g', 'b']
# plotting the pie chart
plt.pie(slices, labels = activities, colors=colors,
startangle=90, shadow = True, explode = (0, 0, 0.1, 0),
radius = 1.2, autopct = '%1.1f%%')
# plotting legend
plt.legend()
# showing the plot
plt.show()
Key Points