Tokenization

Overview

Time: min

Objectives

What is Tokenization?

It is the process of converting large text to smaller chunks. Formally - “Token is a single entity that is building blocks for sentence or paragraph”. Tokenization is sometimes as simple as splitting the text into white spaces.

import pandas as pd
import numpy as np
import nltk
nltk.download('punkt')
text = '''The UIC Library Digital Scholarship Hub is a facility that is available to support students, 
staff and faculty with digital scholarship and humanities experimental research and instruction. 
The Hub provides technology, data and individual consultations to encourage creative, 
innovative and non-traditional research and development. '''
from nltk.tokenize import word_tokenize
token = word_tokenize(text)
token

We are using punkt tokenizer. This tokenizer uses unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences in order to split text into token

Using tokenization to get count of words

from nltk.probability import FreqDist
frequencies = FreqDist(token)
print(frequencies)
frequencies
top_ten_words = frequencies.most_common(10)
top_ten_words

Key Points

previous episode

Text Mining using Python

next episode