This lesson is still being designed and assembled (Pre-Alpha version)

Introduction to working with Data in Python

Setup

Overview

Time: 0 min
Objectives
  • Install Anaconda

  • Download the python jupyter notebook file

  • Open jupyter notebooks

To participate in this workshop, you will need to have the Anaconda distribution of Python installed on your computer, or you may use Anaconda on the UIC virtual labs to launch Jupyter notebooks. You will also need to download the code file used for this workshop. See instructions below for these steps.

Code Files

Workshop Code:

Please download the following code to follow along with the workshop:
Python numpy pandas

Software setup

We will be using Jupyter notebook for this workshop. We install the Anaconda navigator. Anaconda is an open source distribution, which provides the easiest way to code in python, especially for data science.

Python

Python is a popular language for research computing, and great for general-purpose programming as well. Installing all of its research packages individually can be a bit difficult, so we recommend Anaconda, an all-in-one installer.

Regardless of how you choose to install it, please make sure you install Python version 3.x (e.g., 3.6 is fine).

We will teach Python using the Jupyter Notebook, a programming environment that runs in a web browser (Jupyter Notebook will be installed by Anaconda). For this to work you will need a reasonably up-to-date browser. The current versions of the Chrome, Safari and Firefox browsers are all supported (some older browsers, including Internet Explorer version 9 and below, are not).

  1. Open https://www.anaconda.com/products/individual#download-section with your web browser.
  2. Download the Anaconda for Windows installer with Python 3. (If you are not sure which version to choose, you probably want the 64-bit Graphical Installer Anaconda3-...-Windows-x86_64.exe)
  3. Install Python 3 by running the Anaconda Installer, using all of the defaults for installation except make sure to check Add Anaconda to my PATH environment variable.

Video Tutorial

  1. Open https://www.anaconda.com/products/individual#download-section with your web browser.
  2. Download the Anaconda Installer with Python 3 for macOS (you can either use the Graphical or the Command Line Installer).
  3. Install Python 3 by running the Anaconda Installer using all of the defaults for installation.

Video Tutorial

  1. Open https://www.anaconda.com/products/individual#download-section with your web browser.
  2. Download the Anaconda Installer with Python 3 for Linux.
    (The installation requires using the shell. If you aren't comfortable doing the installation yourself stop here and request help at the workshop.)
  3. Open a terminal window and navigate to the directory where the executable is downloaded (e.g., `cd ~/Downloads`).
  4. Type
    bash Anaconda3-
    and then press Tab to autocomplete the full file name. The name of file you just downloaded should appear.
  5. Press Enter (or Return depending on your keyboard). You will follow the text-only prompts. To move through the text, press Spacebar. Type yes and press enter to approve the license. Press Enter (or Return) to approve the default location for the files. Type yes and press Enter (or Return) to prepend Anaconda to your PATH (this makes the Anaconda distribution the default Python).
  6. Close the terminal window.

Virtual Lab

If you would prefer not to install the software for this workshop on your computer, you may use the Virtual lab service run by Technology Services. This allows you to use a virtual machine either from your web browser or from a desktop app installed on your computer. Overall you may have a better experience using it from the desktop app, but the browswer should suffice for most workshops.

See browser instructions here
See desktop instructions here

Launching Jupyter on Anaconda

We can use Anaconda Navigator to access Jupyter and other tools(pyCharm etc) provided in Anaconda.

For Windows Users:

  1. Click Start
  2. Search and select Anaconda Navigator from the menu.
  3. Once the Navigator opens up. Select Jupyter Notebook from the tools available.
  4. Jupyter will open up on a new tab in the browser.
  5. Navigate to the required destination.
  6. Click on new - > Notebook
  7. The script file opens up.

For Mac Users:

  1. Click Launchpad and select Anaconda Navigator. Or, use Cmd+Space to open Spotlight Search and type “Navigator” to open the program.
  2. Once the Navigator opens up. Select Jupyter Notebook from the tools available.
  3. Jupyter will open up on a new tab in the browser.
  4. Navigate to the required destination.
  5. Click on new - > Notebook
  6. The script file opens up.

Key Points


Introduction

Overview

Time: 0 min
Objectives

What is Data?

Data is the raw alphanumeric values obtained through different acquisition methods. Data in their simplest form consist of raw alphanumeric values.

What is Information?

Information is created when data are processed, organized, or structured to provide context and meaning. Information is essentially processed data.

What is Knowledge?

Knowledge is what we know. Knowledge is unique to each individual and is the accumulation of past experience and insight that shapes the lens by which we interpret, and assign meaning to, information.

Data vs Information vs Knowledge

Why is Data important?

In todays world, data is everywhere. While one might not think twice about the data that flows around us, it is often helpful to collect and analyze data. A few impacts that good data has are- -Improve People’s Lives -Make Informed Decisions -Stop Molehills From Turning Into Mountains -Get The Results You Want -Find Solutions To Problems -Stop The Guessing Game -Be Strategic In Your Approaches

How to get data to work for me?

There are several programming languages that have made it easy to work with data.However, python is the most popular choice of data scientists and data analysts. Working with data often includes concepts of statistics like mean, medium etc. It also deals with plotting and undestanding patters and distributions in data. Python has inbuilt libraries to handle statistics. It also has several libraries that help with cleaning and preprocessing data.

What is Numpy?

NumPy is a Python library used for working with arrays.It also has functions for working in domain of linear algebra, fourier transform, and matrices.

What is Pandas?

Pandas is a Python library used for working with data sets.It has functions for analyzing, cleaning, exploring, and manipulating data.

Key Points


Introduction and Installation of Pandas

Overview

Time: min
Objectives

What is Pandas?

Pandas is a Python library used for working with data sets. It has functions for analyzing, cleaning, exploring, and manipulating data. Pandas allows us to analyze big data and make conclusions based on statistical theories.Pandas can clean messy data sets, and make them readable and relevant.

Examples of questions answered by Pandas?

How to install Pandas?

Pandas can be installed by simply typing the command - “pip install pandas” on your terminal on in jupyter notebook

How to use Pandas?

To use pandas, we need to import it first. This can be done by typing the following line - “import pandas”

import pandas

stock_dict = {
  'Items': ["Sugar", "Salt", "Pepper"],
  'Qunatity(Kgs)': [10, 20, 5]
}

stock_df = pandas.DataFrame(stock_dict)

print(stock_df)

Key Points


Pandas Series

Overview

Time: min
Objectives

What is a Series?

It is a one-dimensional array holding data of any type. It is similar to a column in a table.


import pandas as pd

quantity = [10, 20, 30]

quantity_series = pd.Series(quantity)

print(quantity_series)

Labels

By default the values are labeled with their index number. These labels can then be used to access the elements. For example: The first element of the series is accessed by quantity_series[0].

Custom Labels

We can index the elemts in the series with our own labels.



quantity = [10, 20, 30]

quantity_series = pd.Series(quantity, index = ["Sugar", "Salt", "Pepper"])

print(quantity_series["Sugar"])

Key/Value Objects in a Series

Objects that follow key:value format(Dictionaries) can also be used to create series. The key becomes the index and the value becomes the column element.


pantry={'eggs':10,'butter(tbsp)':20,'sugar(oz)':30,'salt(oz)':40}

pantry_series = pd.Series(pantry)
print(pantry_series)
pantry_series_slice = pd.Series(pantry, index = ['eggs', 'butter(tbsp)'])
print(pantry_series_slice)

Key Points


Dataframes in Pandas

Overview

Time: min
Objectives

What are Dataframes?

A DataFrame is a 2 dimensional data structure. It is similar to a 2 dimensional array.

import pandas as pd

data = {
  "quantity_available": [420, 380, 390],
  "quantity_needed_for_full_inventory": [50, 40, 45]
}

#load data into a DataFrame object:
df = pd.DataFrame(data)

print(df) 

Accessing a row element

Pandas uses the .loc attribute to return one or more specified row(s).

NOTE: If you use loc to retrieve a single row, the result is a series. If we retrieve multiple rows, the result is a dataframe.

print(df.loc[0]) # returns single row
print(df.loc[[0, 1]]) # returns multiple rows in a dataframe

Named Indexes

We can create custom labels by using the index argument while creating a data frame.

data = {
  "quantity_available": [420, 380, 390],
  "quantity_needed_for_full_inventory": [50, 40, 45]
}

#load data into a DataFrame object:
df_labeled = pd.DataFrame(data, index = ['Sugar', 'Salt', 'Pepper'])

print(df_labeled) 

Accessing Rows Through Named Indexes

The loc attribute can be used to access rows with the custom labels.

print(df_labeled.loc['Sugar'])
print(df_labeled.loc[['Sugar','Salt']])

Key Points


Working with Dataframes using functions in Pandas

Overview

Time: min
Objectives

View the Data?

The head and tail methods are used to access few rows in the data frame. The head() method returns rows from the beginning of the dataframe. The tail() returns rows from the ending of the data frame. The default number of rows returned is 5 rows. We can change the number of rows returned by passing the reuired number to the function as an argumet.

data = {
  "quantity_available": [5, 10, 15, 20, 25, 30, 35, 12, 10, 10, 25, 15],
  "quantity_needed_for_full_inventory": [50, 40, 45, 60, 60, 80, 100, 50, 25, 50, 50, 30],
  "Name_of_item":['Sugar','Salt','Pepper','Butter','Flour','Baking Powder','eggs','Blue Cheese', 'Panko', 'Maple Syrup', 'Strawberry Preserve', 'Honey']
}

#load data into a DataFrame object:
df = pd.DataFrame(data)

print(df.head())
print(df.tail())

print(df.head(10))
print(df.tail(10))

Info()

Pandas library provides a methos called info(), which works on data frame objects. This function is used to et information regarding the data frame.

print(df.info())

interpreting the result

The first two lines - i.e RangeIndex and Data columns tells us the number of rows(RangeIndex) and numbr of columns(Data columns)

Next, there is a table displayed. The “Column” column contains the names of the columns.

The “Non-Null Count” tells us the number of columns that are not null(contains a value).

The “Dtype” column describes the data type of the values in the column.

Key Points


Introduction to Numpy

Overview

Time: min
Objectives

What is Numpy?

NumPy is a Python library used for working with arrays.It also has functions for working in domain of linear algebra, fourier transform, and matrices.

Why Use NumPy?

In Python we have lists that serve the purpose of arrays, but they are slow to process.

NumPy aims to provide an array object called ndarray that is up to 50x faster than traditional Python lists.

Installation of Numpy

Numpy can be installed by typing the following command - “pip install numpy”

How to use Numpy?

In order to use Numpy, we need to import the package first. This can be done by typing “import numpy”.


import numpy 

arr = numpy.array([1, 2, 3, 4, 5])

print(arr)

Key Points


Arrays in Numpy

Overview

Time: min
Objectives

Create a NumPy ndarray Object

We can create a NumPy ndarray object by using the array() function.

import numpy as np

arr = np.array([1, 2, 3, 4, 5])

print(arr)

print(type(arr))

Dimensions in Arrays

A dimension in arrays is the level of array depth (nested arrays).

0-D Arrays

0-D arrays, or Scalars, are the elements in an array. Each value in an array is a 0-D array.

import numpy as np

arr = np.array(42)

print(arr)

1-D Arrays

An array that has 0-D arrays as its elements is called uni-dimensional or 1-D array.

import numpy as np

arr = np.array([1, 2, 3, 4, 5])

print(arr)

2-D Arrays

An array that has 1-D arrays as its elements is called a 2-D array.These are often used to represent matrix or 2nd order tensors.

 import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6]])

print(arr)

3-D arrays

An array that has 2-D arrays (matrices) as its elements is called 3-D array.These are often used to represent a 3rd order tensor.

import numpy as np

arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])

print(arr)

Higher Dimensional Arrays

We can create higher dimension arrays by using the ndmin argument.

import numpy as np

arr = np.array([1, 2, 3, 4], ndmin=5)

print(arr)
print('number of dimensions :', arr.ndim)

Find the dimension of the Array

To find the number of dimensions of an array, we can print the ndim attribute of that array.

print(arr.ndim)

Key Points


Indexing and slicing Arrays

Overview

Time: min
Objectives

Access Array Elements

Array indexing is the same as accessing an array element.You can access an array element by referring to its index number.The indexes in NumPy arrays start with 0, meaning that the first element has index 0, and the second has index 1 etc.

import numpy as np

arr = np.array([1, 2, 3, 4])

print(arr[0])

Access 2-D Arrays

To access elements from 2-D arrays we can use comma separated integers representing the dimension and the index of the element.


import numpy as np

arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])

print('2nd element on 1st row: ', arr[0, 1])

Slicing arrays

Slicing in python means taking elements from one given index to another given index.

We pass slice instead of index like this: [start:end].

We can also define the step, like this: [start:end:step].

import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7])

arr2 = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
print(arr[1:5]) # Slice elements from index 1 to index 5
print(arr[4:]) # Slice elements from index 4 to the end of the array
print(arr[:4]) # Slice elements from the beginning to index 4 (not included)
print(arr[-3:-1]) # Slice from the index 3 from the end to index 1 from the end
print(arr[1:5:2]) # Return every other element from index 1 to index 5
print(arr2[1, 1:4]) # From the second element, slice elements from index 1 to index 4 (not included)
print(arr2[0:2, 1:4]) # From both elements, slice index 1 to index 4 (not included)

Key Points


Important attributes in Numpy

Overview

Time: min
Objectives

Shape of an Array

The shape of an array is the number of elements in each dimension.


import numpy as np

arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])

print(arr.shape)

Reshaping arrays

Reshaping means changing the shape of an array.

By reshaping we can add or remove dimensions or change number of elements in each dimension.

Reshape From 1-D to 2-D

import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])

newarr = arr.reshape(4, 3)

print(newarr)

Reshape From 1-D to 3-D

import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])

newarr = arr.reshape(2, 3, 2)

print(newarr)

NOTE: We can Reshape Into any Shape, as long as the elements required for reshaping are equal in both shapes.

Flattening the arrays

Flattening array means converting a multidimensional array into a 1D array.

We can use reshape(-1) to do this.

import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6]])

newarr = arr.reshape(-1)

print(newarr)

Joining NumPy Arrays

Joining means putting contents of two or more arrays in a single array.We pass a sequence of arrays that we want to join to the concatenate() function, along with the axis. If axis is not explicitly passed, it is taken as 0.

import numpy as np

arr1 = np.array([1, 2, 3])

arr2 = np.array([4, 5, 6])

arr = np.concatenate((arr1, arr2))

print(arr)

arr1 = np.array([[1, 2], [3, 4]])

arr2 = np.array([[5, 6], [7, 8]])

arr = np.concatenate((arr1, arr2), axis=1)
print(arr)

Splitting NumPy Arrays

Splitting is reverse operation of Joining. We use array_split() for splitting arrays, we pass it the array we want to split and the number of splits.

import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6])

newarr = np.array_split(arr, 3)

print(newarr)

Searching Arrays

You can search an array for a certain value, and return the indexes that get a match.

To search an array, use the where() method.

import numpy as np

arr = np.array([1, 2, 3, 4, 5, 4, 4])

x = np.where(arr == 4)

print(x)

Sorting Arrays

Sorting means putting elements in an ordered sequence.

Ordered sequence is any sequence that has an order corresponding to elements, like numeric or alphabetical, ascending or descending.

The NumPy ndarray object has a function called sort(), that will sort a specified array.

import numpy as np

arr = np.array([3, 2, 0, 1])

print(np.sort(arr))

Key Points