Term Document Matrix
Overview
Time: minObjectives
What is Term Document Matrix?
The term document matrix means we map a collection of ‘n’ documents to the vector space model by a term-document matrix. In other words, It creates a numerical representation of the documents. Representing text as a numerical structure is a common starting point for text mining and analytics such as search and ranking, creating taxonomies, categorization, document similarity, and text-based machine learning.
- An entry in the matrix corresponds to the “weight” of a term in the document; zero means the term has no significance in the document or it simply doesn’t exist in the document.
- A typical weighting is tf-idf weighting: W = tf* idf
- Term Frequency (TF): More frequent terms in a document are more important, i.e. more indicative of the topic. May want to normalize term frequency(tf) across the entire corpus: TF = (Number of times term t appears in a document) / (Total number of terms in the document)
- Inverse Document Frequency (IDF): Terms that appear in many different documents are less indicative of overall topic. IDF = log10(Total number of documents / Number of documents with term t in it)
tdm=textmining.TermDocumentMatrix() # use a function from textmining library
for i in post_corpus:
#print(i)
tdm.add_doc(i)# update the matrix with each variable conversion
type(tdm)
os.chdir("../working")
tdm.write_csv("TDM_DataFrame.csv",cutoff= 1)
def buildMatrix(self,document_list):
print("building matrix...")
tdm = textmining.TermDocumentMatrix()
for doc in document_list:
tdm.add_doc(doc)
#write tdm into dataframe
tdm.write_csv(r'path\matrix.csv', cutoff=1)
df=pd.read_csv("TDM_DataFrame.csv")
df.head(20)
df.shape
Key Points