Term Document Matrix

Overview

Time: min

Objectives

What is Term Document Matrix?

The term document matrix means we map a collection of ‘n’ documents to the vector space model by a term-document matrix. In other words, It creates a numerical representation of the documents. Representing text as a numerical structure is a common starting point for text mining and analytics such as search and ranking, creating taxonomies, categorization, document similarity, and text-based machine learning.

An entry in the matrix corresponds to the “weight” of a term in the document; zero means the term has no significance in the document or it simply doesn’t exist in the document.
A typical weighting is tf-idf weighting: W = tf* idf
Term Frequency (TF): More frequent terms in a document are more important, i.e. more indicative of the topic. May want to normalize term frequency(tf) across the entire corpus: TF = (Number of times term t appears in a document) / (Total number of terms in the document)
Inverse Document Frequency (IDF): Terms that appear in many different documents are less indicative of overall topic. IDF = log10(Total number of documents / Number of documents with term t in it)

tdm=textmining.TermDocumentMatrix() # use a function from textmining library
for i in post_corpus:
    #print(i)
    tdm.add_doc(i)# update the matrix with each variable conversion
type(tdm)
os.chdir("../working")
tdm.write_csv("TDM_DataFrame.csv",cutoff= 1)
def buildMatrix(self,document_list):
        print("building matrix...")
        tdm = textmining.TermDocumentMatrix()
        for doc in document_list:
             tdm.add_doc(doc)
        #write tdm into dataframe
        tdm.write_csv(r'path\matrix.csv', cutoff=1)
df=pd.read_csv("TDM_DataFrame.csv")
df.head(20)
df.shape

Key Points

previous episode

Text Mining using Python

lesson home

Term Document Matrix

Overview

What is Term Document Matrix?

Key Points

previous episode

lesson home