This lesson is still being designed and assembled (Pre-Alpha version)

Text Mining using Python

Introduction

Overview

Time: min
Objectives

What is Text Mining?

Text Mining is the process of deriving meaningful information from natural language text.

The overall goal is to turn texts into data for analysis via application of Natural Language Processing(NLP).

What is NLP?

Natural Language Processing(NLP) is a part of computer science and artificial intelligence which deals with human languages.

In other words, NLP is a component of text mining that performs a special kind of linguistic analysis that essentially helps a machine “read” text.

Key Points


Tokenization

Overview

Time: min
Objectives

What is Tokenization?

It is the process of converting large text to smaller chunks. Formally - “Token is a single entity that is building blocks for sentence or paragraph”. Tokenization is sometimes as simple as splitting the text into white spaces.

import pandas as pd
import numpy as np
import nltk
nltk.download('punkt')
text = '''The UIC Library Digital Scholarship Hub is a facility that is available to support students, 
staff and faculty with digital scholarship and humanities experimental research and instruction. 
The Hub provides technology, data and individual consultations to encourage creative, 
innovative and non-traditional research and development. '''
from nltk.tokenize import word_tokenize
token = word_tokenize(text)
token

We are using punkt tokenizer. This tokenizer uses unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences in order to split text into token

Using tokenization to get count of words

from nltk.probability import FreqDist
frequencies = FreqDist(token)
print(frequencies)
frequencies
top_ten_words = frequencies.most_common(10)
top_ten_words

Key Points


Stemming

Overview

Time: min
Objectives

What is Stemming?

Stemming is the process of reducing tokens to root forms. For example - studying and studied are converted to study. There are two commonly used stemming techniques in python.

Among these, Lancaster stemming is more aggresive, with twice the rules as porter stemmer and tends to over stem words -

Porter Stemming

from nltk.stem import PorterStemmer
pst = PorterStemmer()
stm = ["giving", "given", "given", "gave"]
for word in stm :
   print(word+ ":" +pst.stem(word))

Lancaster Stemming

from nltk.stem import LancasterStemmer
lst = LancasterStemmer()
stm = ["giving", "given", "given", "gave"]
for word in stm :
 print(word+ ":" +lst.stem(word))

Key Points


Lemmatization

Overview

Time: min
Objectives

What is Lemmatization?

It is a process of converting a word to its base form. In other words, it is the same as stemming.

However, the main difference between stemming and lemmatization is that lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect meanings and spelling errors.

For example - The word caring, stemming reduces it to “car”, where are lemmatization reduces it to “care” which is similar to the actual word.

Lemmatization can be implemented in python by using Wordnet Lemmatizer, Spacy Lemmatizer, TextBlob, Stanford CoreNLP

nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer() 
 
print("Caring :", lemmatizer.lemmatize("Caring")) 
print("Caring :" +lst.stem("Caring"))
print("corpora :", lemmatizer.lemmatize("corpora"))

Key Points


Stop Words

Overview

Time: min
Objectives

What are stop words?

Commonly used words like “the”, “a”, “at”, “for”, “above”, “on”, “is”, “all” are called stop words. While processing text, we delete these words as they do not provide any meaning or have a significant effect on the analysis performed. This step depends highly on the language. Python provides a library called “stopwords” that holds various pre defined stop word collections.

from nltk import word_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')
a = set(stopwords.words('english'))
text = '''The UIC Library Digital Scholarship Hub is a facility that is available to support students, 
staff and faculty with digital scholarship and humanities experimental research and instruction. 
The Hub provides technology, data and individual consultations to encourage creative, 
innovative and non-traditional research and development. '''
text1 = word_tokenize(text.lower())
stopwords = [x for x in text1 if x not in a]
print(stopwords)

Key Points


Part of speech tagging (POS)

Overview

Time: min
Objectives

What are POS tagging?

Through POS tagging, each of the tokens are assigned a part of speech(noun, verb, pronouns, adverbs etc). This is done in python using taggers like NLTK, Spacy, TextBlob, Standford CoreNLP, etc.

nltk.download('averaged_perceptron_tagger')
text = '''The UIC Library Digital Scholarship Hub is a facility that is available to support students, 
staff and faculty with digital scholarship and humanities experimental research and instruction. 
The Hub provides technology, data and individual consultations to encourage creative, 
innovative and non-traditional research and development. '''

tex = word_tokenize(text)
for token in tex:
  print(nltk.pos_tag([token]))

Key Points


Named Entity Recognition

Overview

Time: min
Objectives

What are Named Entity Recognition?

It is the process of detecting the named entities such as the person name, the location name, the company name, the quantities, and the monetary value.

nltk.download('maxent_ne_chunker')
nltk.download('words')
text = '''The UIC Library Digital Scholarship Hub is a facility that is available to support students, 
staff and faculty with digital scholarship and humanities experimental research and instruction. 
The Hub provides technology, data and individual consultations to encourage creative, 
innovative and non-traditional research and development. '''
from nltk import ne_chunk
token = word_tokenize(text)
tags = nltk.pos_tag(token)
chunk = ne_chunk(tags)
chunk

Key Points


Chunking

Overview

Time: min
Objectives

What is Chunking?

Chunking means a grouping of words or tokens into chunks.

text = '''The UIC Library Digital Scholarship Hub is a facility that is available to support students, 
staff and faculty with digital scholarship and humanities experimental research and instruction. 
The Hub provides technology, data and individual consultations to encourage creative, 
innovative and non-traditional research and development. '''
from nltk import ne_chunk
token = word_tokenize(text)
tags = nltk.pos_tag(token)
reg = "NP: {<DT>?<JJ>*<NN>}"
a = nltk.RegexpParser(reg)
result = a.parse(tags)
print(result)

Key Points


Word Cloud

Overview

Time: min
Objectives

What is World Cloud?

A Word Cloud or Tag Cloud is a visual representation of text data in the form of tags, which are typically single words whose importance is visualized by way of their size and color. As unstructured data in the form of text continues to see unprecedented growth, especially within the field of social media, there is an ever-increasing need to analyze the massive amounts of text generated from these systems. A Word Cloud is an excellent option to help visually interpret text and is useful in quickly gaining insight into the most prominent items in a given text, by visualizing the word frequency in the text as a weighted list

Word clouds are normally used to display the frequency of appearance of words in a particular document or speech More frequently used words appear larger in the word cloud. The frequency is assumed to reflect the importance of the term in the context of the document.

wordcloud= WordCloud(width=1000,height=500, stopwords=STOPWORDS, background_color='white').generate(''.join(post['Post']))
plt.figure(figsize=(15,8))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

Key Points


Sentiment Analysis

Overview

Time: min
Objectives

What is Sentiment Analysis?

Quantifying users content, idea, belief, and opinion is known as sentiment analysis. User’s online post, blogs, tweets, feedback of product helps business people to the target audience and innovate in products and services. Sentiment analysis helps in understanding people in a better and more accurate way. It is not only limited to marketing, but it can also be utilized in politics, research, and security.

There are mainly two approaches for performing sentiment analysis.

Lexicon-based: count number of positive and negative words in given text and the larger count will be the sentiment of text.

Machine learning based approach: Develop a classification model, which is trained using the pre-labeled dataset of positive, negative, and neutral.

Key Points


Term Document Matrix

Overview

Time: min
Objectives

What is Term Document Matrix?

The term document matrix means we map a collection of ‘n’ documents to the vector space model by a term-document matrix. In other words, It creates a numerical representation of the documents. Representing text as a numerical structure is a common starting point for text mining and analytics such as search and ranking, creating taxonomies, categorization, document similarity, and text-based machine learning.

tdm=textmining.TermDocumentMatrix() # use a function from textmining library
for i in post_corpus:
    #print(i)
    tdm.add_doc(i)# update the matrix with each variable conversion
type(tdm)
os.chdir("../working")
tdm.write_csv("TDM_DataFrame.csv",cutoff= 1)
def buildMatrix(self,document_list):
        print("building matrix...")
        tdm = textmining.TermDocumentMatrix()
        for doc in document_list:
             tdm.add_doc(doc)
        #write tdm into dataframe
        tdm.write_csv(r'path\matrix.csv', cutoff=1)
df=pd.read_csv("TDM_DataFrame.csv")
df.head(20)
df.shape

Key Points