Introduction

Overview

Time: min

Objectives

What is Text Mining?

Text Mining is the process of deriving meaningful information from natural language text.

The overall goal is to turn texts into data for analysis via application of Natural Language Processing(NLP).

What is NLP?

Natural Language Processing(NLP) is a part of computer science and artificial intelligence which deals with human languages.

In other words, NLP is a component of text mining that performs a special kind of linguistic analysis that essentially helps a machine “read” text.

Key Points

Tokenization

Overview

Time: min

Objectives

What is Tokenization?

It is the process of converting large text to smaller chunks. Formally - “Token is a single entity that is building blocks for sentence or paragraph”. Tokenization is sometimes as simple as splitting the text into white spaces.

import pandas as pd
import numpy as np
import nltk
nltk.download('punkt')
text = '''The UIC Library Digital Scholarship Hub is a facility that is available to support students, 
staff and faculty with digital scholarship and humanities experimental research and instruction. 
The Hub provides technology, data and individual consultations to encourage creative, 
innovative and non-traditional research and development. '''
from nltk.tokenize import word_tokenize
token = word_tokenize(text)
token

We are using punkt tokenizer. This tokenizer uses unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences in order to split text into token

Using tokenization to get count of words

from nltk.probability import FreqDist
frequencies = FreqDist(token)
print(frequencies)
frequencies
top_ten_words = frequencies.most_common(10)
top_ten_words

Key Points

Stemming

Overview

Time: min

Objectives

What is Stemming?

Stemming is the process of reducing tokens to root forms. For example - studying and studied are converted to study. There are two commonly used stemming techniques in python.

Porter Stemming
Lancaster Stemming

Among these, Lancaster stemming is more aggresive, with twice the rules as porter stemmer and tends to over stem words -

Porter Stemming

from nltk.stem import PorterStemmer
pst = PorterStemmer()
stm = ["giving", "given", "given", "gave"]
for word in stm :
   print(word+ ":" +pst.stem(word))

Lancaster Stemming

from nltk.stem import LancasterStemmer
lst = LancasterStemmer()
stm = ["giving", "given", "given", "gave"]
for word in stm :
 print(word+ ":" +lst.stem(word))

Key Points

Lemmatization

Overview

Time: min

Objectives

What is Lemmatization?

It is a process of converting a word to its base form. In other words, it is the same as stemming.

However, the main difference between stemming and lemmatization is that lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect meanings and spelling errors.

For example - The word caring, stemming reduces it to “car”, where are lemmatization reduces it to “care” which is similar to the actual word.

Lemmatization can be implemented in python by using Wordnet Lemmatizer, Spacy Lemmatizer, TextBlob, Stanford CoreNLP

nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer() 
 
print("Caring :", lemmatizer.lemmatize("Caring")) 
print("Caring :" +lst.stem("Caring"))
print("corpora :", lemmatizer.lemmatize("corpora"))

Key Points

Stop Words

Overview

Time: min

Objectives

What are stop words?

Commonly used words like “the”, “a”, “at”, “for”, “above”, “on”, “is”, “all” are called stop words. While processing text, we delete these words as they do not provide any meaning or have a significant effect on the analysis performed. This step depends highly on the language. Python provides a library called “stopwords” that holds various pre defined stop word collections.

from nltk import word_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')
a = set(stopwords.words('english'))
text = '''The UIC Library Digital Scholarship Hub is a facility that is available to support students, 
staff and faculty with digital scholarship and humanities experimental research and instruction. 
The Hub provides technology, data and individual consultations to encourage creative, 
innovative and non-traditional research and development. '''
text1 = word_tokenize(text.lower())
stopwords = [x for x in text1 if x not in a]
print(stopwords)

Key Points

Part of speech tagging (POS)

Overview

Time: min

Objectives

What are POS tagging?

Through POS tagging, each of the tokens are assigned a part of speech(noun, verb, pronouns, adverbs etc). This is done in python using taggers like NLTK, Spacy, TextBlob, Standford CoreNLP, etc.

nltk.download('averaged_perceptron_tagger')
text = '''The UIC Library Digital Scholarship Hub is a facility that is available to support students, 
staff and faculty with digital scholarship and humanities experimental research and instruction. 
The Hub provides technology, data and individual consultations to encourage creative, 
innovative and non-traditional research and development. '''

tex = word_tokenize(text)
for token in tex:
  print(nltk.pos_tag([token]))

Key Points

Named Entity Recognition

Overview

Time: min

Objectives

What are Named Entity Recognition?

It is the process of detecting the named entities such as the person name, the location name, the company name, the quantities, and the monetary value.

nltk.download('maxent_ne_chunker')
nltk.download('words')
text = '''The UIC Library Digital Scholarship Hub is a facility that is available to support students, 
staff and faculty with digital scholarship and humanities experimental research and instruction. 
The Hub provides technology, data and individual consultations to encourage creative, 
innovative and non-traditional research and development. '''
from nltk import ne_chunk
token = word_tokenize(text)
tags = nltk.pos_tag(token)
chunk = ne_chunk(tags)
chunk

Key Points

Chunking

Overview

Time: min

Objectives

What is Chunking?

Chunking means a grouping of words or tokens into chunks.

text = '''The UIC Library Digital Scholarship Hub is a facility that is available to support students, 
staff and faculty with digital scholarship and humanities experimental research and instruction. 
The Hub provides technology, data and individual consultations to encourage creative, 
innovative and non-traditional research and development. '''
from nltk import ne_chunk
token = word_tokenize(text)
tags = nltk.pos_tag(token)
reg = "NP: {<DT>?<JJ>*<NN>}"
a = nltk.RegexpParser(reg)
result = a.parse(tags)
print(result)

Key Points

Word Cloud

Overview

Time: min

Objectives

What is World Cloud?

A Word Cloud or Tag Cloud is a visual representation of text data in the form of tags, which are typically single words whose importance is visualized by way of their size and color. As unstructured data in the form of text continues to see unprecedented growth, especially within the field of social media, there is an ever-increasing need to analyze the massive amounts of text generated from these systems. A Word Cloud is an excellent option to help visually interpret text and is useful in quickly gaining insight into the most prominent items in a given text, by visualizing the word frequency in the text as a weighted list

Word clouds are normally used to display the frequency of appearance of words in a particular document or speech More frequently used words appear larger in the word cloud. The frequency is assumed to reflect the importance of the term in the context of the document.

wordcloud= WordCloud(width=1000,height=500, stopwords=STOPWORDS, background_color='white').generate(''.join(post['Post']))
plt.figure(figsize=(15,8))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

Key Points

Sentiment Analysis

Overview

Time: min

Objectives

What is Sentiment Analysis?

Quantifying users content, idea, belief, and opinion is known as sentiment analysis. User’s online post, blogs, tweets, feedback of product helps business people to the target audience and innovate in products and services. Sentiment analysis helps in understanding people in a better and more accurate way. It is not only limited to marketing, but it can also be utilized in politics, research, and security.

There are mainly two approaches for performing sentiment analysis.

Lexicon-based: count number of positive and negative words in given text and the larger count will be the sentiment of text.

Machine learning based approach: Develop a classification model, which is trained using the pre-labeled dataset of positive, negative, and neutral.

Key Points

Term Document Matrix

Overview

Time: min

Objectives

What is Term Document Matrix?

The term document matrix means we map a collection of ‘n’ documents to the vector space model by a term-document matrix. In other words, It creates a numerical representation of the documents. Representing text as a numerical structure is a common starting point for text mining and analytics such as search and ranking, creating taxonomies, categorization, document similarity, and text-based machine learning.

An entry in the matrix corresponds to the “weight” of a term in the document; zero means the term has no significance in the document or it simply doesn’t exist in the document.
A typical weighting is tf-idf weighting: W = tf* idf
Term Frequency (TF): More frequent terms in a document are more important, i.e. more indicative of the topic. May want to normalize term frequency(tf) across the entire corpus: TF = (Number of times term t appears in a document) / (Total number of terms in the document)
Inverse Document Frequency (IDF): Terms that appear in many different documents are less indicative of overall topic. IDF = log10(Total number of documents / Number of documents with term t in it)

tdm=textmining.TermDocumentMatrix() # use a function from textmining library
for i in post_corpus:
    #print(i)
    tdm.add_doc(i)# update the matrix with each variable conversion
type(tdm)
os.chdir("../working")
tdm.write_csv("TDM_DataFrame.csv",cutoff= 1)
def buildMatrix(self,document_list):
        print("building matrix...")
        tdm = textmining.TermDocumentMatrix()
        for doc in document_list:
             tdm.add_doc(doc)
        #write tdm into dataframe
        tdm.write_csv(r'path\matrix.csv', cutoff=1)
df=pd.read_csv("TDM_DataFrame.csv")
df.head(20)
df.shape

Key Points

Text Mining using Python

Introduction

Overview

What is Text Mining?

What is NLP?

Key Points

Tokenization

Overview

What is Tokenization?

Using tokenization to get count of words

Key Points

Stemming

Overview

What is Stemming?

Porter Stemming

Lancaster Stemming

Key Points

Lemmatization

Overview

What is Lemmatization?

Key Points

Stop Words

Overview

What are stop words?

Key Points

Part of speech tagging (POS)

Overview

What are POS tagging?

Key Points

Named Entity Recognition

Overview

What are Named Entity Recognition?

Key Points

Chunking

Overview

What is Chunking?

Key Points

Word Cloud

Overview

What is World Cloud?

Key Points

Sentiment Analysis

Overview

What is Sentiment Analysis?

Key Points

Term Document Matrix

Overview

What is Term Document Matrix?

Key Points