Text Mining

CSCI-866-001: Data Mining & Knowledge Discovery

Lecturer: Dr. Sothea HAS

🗺️ Content

Introduction & Motivation
Text Preprocessing
Text Transformation
Feature Selection
Data Mining/Pattern Recovery
Evaluation/Interpretation
Applications

Introduction & Motivation

Text Mining

Text Mining: Exploration techniques for extracting meaningful patterns and insights from unstructured texts 🗒️.

Introduction & Motivation

Text Mining Process

Text Mining often involves the following steps:

Let’s walk through all these with Spam Mails Dataset (5K rows).

	label	text	label_num
1566	ham	Subject: hpl nom for march 30 , 2001\r\n( see ...	0
1988	spam	Subject: online pharxmacy 80 % off all meds\r\...	1
1235	ham	Subject: re : nom / actual volume for april 17...	0

Text Preprocessing

Text preprocessing: a process of cleaning and normalizing raw text data for the analysis.
It’s essential for all NLP pipelines:
- Voice recognition system
- Search engine softwares
- ML model training…
Ex: Consider the 1st & 2nd email from Spam Mails Dataset:

Code

data['text'][0]

"Subject: enron methanol ; meter # : 988291\r\nthis is a follow up to the note i gave you on monday , 4 / 3 / 00 { preliminary\r\nflow data provided by daren } .\r\nplease override pop ' s daily volume { presently zero } to reflect daily\r\nactivity you can obtain from gas control .\r\nthis change is needed asap for economics purposes ."

Code

data['text'][1]

'Subject: hpl nom for january 9 , 2001\r\n( see attached file : hplnol 09 . xls )\r\n- hplnol 09 . xls'

Text Preprocessing

Tokenization

"Subject: enron methanol ; meter # : 988291\r\nthis is a follow up to the note i gave you on monday , 4 / 3 / 00 { preliminary\r\nflow data provided by daren } .\r\nplease override pop ' s daily volume { presently zero } to reflect daily\r\nactivity you can obtain from gas control .\r\nthis change is needed asap for economics purposes ."

Tokenization: Breaking texts into smaller components called tokens, which can be
- characters: 'want' \(\to\) ['w', 'a', 'n', 't'].
- words: 'Data mining' \(\to\) ['Data', 'mining'].
- sub-words: 'Unknown' \(\to\) ['Un', 'known']…

Code

from nltk.tokenize import word_tokenize # For tokenization
text1 = word_tokenize(data['text'][0])
print(text1)
print(f"Number of tokens = {len(text1)}")

['Subject', ':', 'enron', 'methanol', ';', 'meter', '#', ':', '988291', 'this', 'is', 'a', 'follow', 'up', 'to', 'the', 'note', 'i', 'gave', 'you', 'on', 'monday', ',', '4', '/', '3', '/', '00', '{', 'preliminary', 'flow', 'data', 'provided', 'by', 'daren', '}', '.', 'please', 'override', 'pop', "'", 's', 'daily', 'volume', '{', 'presently', 'zero', '}', 'to', 'reflect', 'daily', 'activity', 'you', 'can', 'obtain', 'from', 'gas', 'control', '.', 'this', 'change', 'is', 'needed', 'asap', 'for', 'economics', 'purposes', '.']
Number of tokens = 68

This is different from str.split().

Text Preprocessing

Remove Stopwords

['Subject', ':', 'enron', 'methanol', ';', 'meter', '#', ':', '988291', 'this', 'is', 'a', 'follow', 'up', 'to', 'the', 'note', 'i', 'gave', 'you', 'on', 'monday', ',', '4', '/', '3', '/', '00', '{', 'preliminary', 'flow', 'data', 'provided', 'by', 'daren', '}', '.', 'please', 'override', 'pop', "'", 's', 'daily', 'volume', '{', 'presently', 'zero', '}', 'to', 'reflect', 'daily', 'activity', 'you', 'can', 'obtain', 'from', 'gas', 'control', '.', 'this', 'change', 'is', 'needed', 'asap', 'for', 'economics', 'purposes', '.']
Number of tokens = 68

Stopwords: common words that carry little or no meaningful information: (a, an, the, of, to…). They should be removed.
Common stopwords in NLP can be found in nltk.corpus.stopwords.
After dropping stopwords, the result is often represented as text.

Code

import nltk
from nltk.corpus import stopwords

# Download required NLTK data (only needed once)
# nltk.download('stopwords')
# nltk.download('punkt')

def remove_stopwords(text = None, tokens = None, language='english'):
    # Tokenize the text into words
    if text is not None:
        words = word_tokenize(text)
    else:
        if tokens is None:
            print("'text' is None, 'tokens' cannot be None!")
        else:
            words = tokens

    # Get the set of stopwords for the specified language
    stop_words = set(stopwords.words(language))
    
    # Remove stopwords and punctuation
    filtered_words = [word for word in words if word.lower() not in stop_words and word.isalpha()]
    
    # Join the words back into a string
    filtered_text = ' '.join(filtered_words)

    return filtered_text, filtered_words
filtered_text, filtered_text_split = remove_stopwords(tokens=text1)
print(filtered_text)

Subject enron methanol meter follow note gave monday preliminary flow data provided daren please override pop daily volume presently zero reflect daily activity obtain gas control change needed asap economics purposes

Text Preprocessing

Stemming & Lemmatization

Subject enron methanol meter follow note gave monday preliminary flow data provided daren please override pop daily volume presently zero reflect daily activity obtain gas control change needed asap economics purposes

Stemming: remove prefixes and suffixes from words, transforming them into their stem of word/root form.
Ex: 'purposes' \(\to\) 'purpos' & 'happily' \(\to\) 'happili'…

Code

from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer(language="english")
stemmed_words = ' '.join([stemmer.stem(word) for word in filtered_text_split]).capitalize()
print(stemmed_words)

Subject enron methanol meter follow note gave monday preliminari flow data provid daren pleas overrid pop daili volum present zero reflect daili activ obtain gas control chang need asap econom purpos

⚠️ It sometimes does not return a real word.
Lemmatization: convert words to their dictionary-base form.
Ex: 'Happily' \(\to\) 'Happy' & 'boxes' \(\to\) 'box'…

Text Preprocessing

Stemming & Lemmatization

Code

print(stemmed_words)

Subject enron methanol meter follow note gave monday preliminari flow data provid daren pleas overrid pop daili volum present zero reflect daili activ obtain gas control chang need asap econom purpos

⚠️ It sometimes does not return a real word.
Lemmatization: convert words to their dictionary-base form.
Ex: 'Happily' \(\to\) 'Happy' & 'boxes' \(\to\) 'box'…

Lemmatized text:

Code

import spacy   # to perform lemmatization

lemmatizer = spacy.load("en_core_web_sm")
doc = lemmatizer(filtered_text)
normalized_text = ' '.join([token.lemma_ for token in doc])
print(normalized_text.capitalize())

Subject enron methanol meter follow note give monday preliminary flow datum provide daren please override pop daily volume presently zero reflect daily activity obtain gas control change need asap economic purpose

Text Preprocessing

Normalization

Lemmatized text:

Subject enron methanol meter follow note give monday preliminary flow datum provide daren please override pop daily volume presently zero reflect daily activity obtain gas control change need asap economic purpose

Normalization: standardize the tokens into a consistent format including
- Case normalization: "Text" \(\to\) "text"
- Non-alphabetic characters: "gmail.com" \(\to\) "gmaildotcom"
- Hyphenated words: "co-worker" \(\to\) "coworker"…

Normalized text:

Code

print(normalized_text)

subject enron methanol meter follow note give monday preliminary flow datum provide daren please override pop daily volume presently zero reflect daily activity obtain gas control change need asap economic purpose

Text Preprocessing

Preprocessed text

Normalized text:

subject enron methanol meter follow note give monday preliminary flow datum provide daren please override pop daily volume presently zero reflect daily activity obtain gas control change need asap economic purpose

Question: Why Text Preprocessing is important?
Answer: It reduces noise (stopword, punctuation…) and allows us to more accurately
- Access the information embedded within the text
- Graphically represent the text
- Encode/embed it into numerical values
- Perform cross-document comparison…
These will be covered in the next section!

Text Transformation

Consider some examples:
1. This is a very good movie.
2. This movie is boring.
3. This movie is better than the previous movie.
4. The previous movie is better than this movie.
These are easily understood by us, human.
However, machine computer cannot process it directly (not even the preprocessed one).
Text Transformation: transform (preprocessed text) into numerical values (vector).
We will explore some Text Transformation techniques:
- Bag-of-Word (BoW)
- Text Embedding Models

Text Transformation

Bag-of-Word (BoW)

Bag-of-Word: is a fundamental technique in Text Transformation.
Some common types:
- Binary Term Occurrence: 0-1 vector
- Term Occurrence: counts of terms
- Term Frequency: proportion of terms
- Term Frequency-Inverse Document Frequency (TF-IDF): weight each term according to how rare they are.

Text Transformation

Bag-of-Word (BoW)

Consider our 4 sentences:
1. This is a very good movie.
2. This movie is boring.
3. This movie is better than the previous movie.
4. The previous movie is better than this movie.

Binary Term Occurrence (BTO)

Lemmatize each sentence and represent it as a binary (0-1) row vector according to whether the lemma (column) belongs to the sentence or not.

Code

import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re
import string

# Download required NLTK data
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

try:
    nltk.data.find('corpora/wordnet')
except LookupError:
    nltk.download('wordnet')

try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')

def create_binary_lemma_occurrence_table(sentences):
    """
    Convert sentences to binary lemma occurrence representation.
    
    Args:
        sentences: List of sentences
    
    Returns:
        DataFrame with binary lemma occurrence representation
    """
    
    # Initialize lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    # Preprocess sentences: tokenize, lemmatize, and clean
    processed_sentences = []
    for sentence in sentences:
        # Tokenize and convert to lowercase
        tokens = word_tokenize(sentence.lower())
        
        # Remove punctuation and keep only alphabetic words
        tokens = [token for token in tokens if token.isalpha()]
        
        # Lemmatize each token
        lemmas = [lemmatizer.lemmatize(token) for token in tokens]
        processed_sentences.append(lemmas)
    
    # Get all unique lemmas across all sentences
    all_lemmas = set()
    for lemmas in processed_sentences:
        all_lemmas.update(lemmas)
    
    # Sort lemmas alphabetically for consistent ordering
    all_lemmas = sorted(list(all_lemmas))
    
    # Create binary occurrence matrix
    binary_matrix = []
    for lemmas in processed_sentences:
        lemma_set = set(lemmas)  # Convert to set for O(1) lookup
        binary_row = [1 if lemma in lemma_set else 0 for lemma in all_lemmas]
        binary_matrix.append(binary_row)
    
    # Create DataFrame with lemmas as columns and sentences as rows
    df = pd.DataFrame(binary_matrix, columns=all_lemmas)
    df.index = [f"Sentence {i+1}" for i in range(len(sentences))]
    
    return df, all_lemmas, processed_sentences

# Define the sentences
sentences = [
    "This is a very good movie.",
    "This movie is boring.",
    "This movie is better than the previous movie.",
    "The previous movie is better than this movie."
]

# Create the binary lemma occurrence table
binary_table, lemmas, sentences_ = create_binary_lemma_occurrence_table(sentences)
binary_table

	a	better	boring	good	is	movie	previous	than	the	this	very
Sentence 1	1	0	0	1	1	1	0	0	0	1	1
Sentence 2	0	0	1	0	1	1	0	0	0	1	0
Sentence 3	0	1	0	0	1	1	1	1	1	1	0
Sentence 4	0	1	0	0	1	1	1	1	1	1	0

Text Transformation

Bag-of-Word (BoW)

Consider our 4 sentences:
1. This is a very good movie.
2. This movie is boring.
3. This movie is better than the previous movie.
4. The previous movie is better than this movie.

Term Occurrence (TO)

Lemmatize each sentence and represent it as a row vector using the number of occurrence of each lemma (column) in that sentence.

Code

def create_lemma_occurrence_table(sentences, lemmas = None, processed_sentences = None):
    """
    Convert sentences to binary lemma occurrence representation.
    
    Args:
        sentences: List of sentences
    
    Returns:
        DataFrame with binary lemma occurrence representation
    """

    if lemmas is None and processed_sentences is None:
        # Initialize lemmatizer
        lemmatizer = WordNetLemmatizer()
        
        # Preprocess sentences: tokenize, lemmatize, and clean
        processed_sentences_ = []
        for sentence in sentences:
            # Tokenize and convert to lowercase
            tokens = word_tokenize(sentence.lower())
            
            # Remove punctuation and keep only alphabetic words
            tokens = [token for token in tokens if token.isalpha()]
            
            # Lemmatize each token
            lemmas = [lemmatizer.lemmatize(token) for token in tokens]
            processed_sentences_.append(lemmas)
        
        # Get all unique lemmas across all sentences
        all_lemmas = set()
        for lemmas in processed_sentences_:
            all_lemmas.update(lemmas)
            all_lemmas = sorted(list(all_lemmas))
    else:
        all_lemmas = lemmas
        processed_sentences_ = processed_sentences
    # Sort lemmas alphabetically for consistent ordering
    # print(processed_sentences_)
    
    # Create term frequency matrix
    tf_matrix = []
    for i, lemmas in enumerate(processed_sentences_):
        lemma_counts = {}
        for lemma in lemmas:
            lemma_counts[lemma] = lemma_counts.get(lemma, 0) + 1

        # Create frequency row for all lemmas
        tf_row = [lemma_counts.get(lemma, 0) for lemma in all_lemmas]
        tf_matrix.append(tf_row)
    
    # Create DataFrame with lemmas as columns and sentences as rows
    df = pd.DataFrame(tf_matrix, columns=all_lemmas)
    df.index = [f"Sentence {i+1}" for i in range(len(sentences))]
    
    return df

# Define the sentences
sentences = [
    "This is a very good movie.",
    "This movie is boring.",
    "This movie is better than the previous movie.",
    "The previous movie is better than this movie."
]

# Create the binary lemma occurrence table
tf_table = create_lemma_occurrence_table(sentences, lemmas, sentences_)
tf_table

	a	better	boring	good	is	movie	previous	than	the	this	very
Sentence 1	1	0	0	1	1	1	0	0	0	1	1
Sentence 2	0	0	1	0	1	1	0	0	0	1	0
Sentence 3	0	1	0	0	1	2	1	1	1	1	0
Sentence 4	0	1	0	0	1	2	1	1	1	1	0

Text Transformation

Bag-of-Word (BoW)

Consider our 4 sentences:
1. This is a very good movie.
2. This movie is boring.
3. This movie is better than the previous movie.
4. The previous movie is better than this movie.

Term Frequency (TF)

Lemmatize each sentence and represent it as a row vector using the proportion of occurrence of each lemma (column) in that sentence.

Code

def create_lemma_freq_table(sentences, lemmas = None, processed_sentences = None):
    if lemmas is None and processed_sentences is None:
        # Initialize lemmatizer
        lemmatizer = WordNetLemmatizer()
        
        # Preprocess sentences: tokenize, lemmatize, and clean
        processed_sentences_ = []
        for sentence in sentences:
            # Tokenize and convert to lowercase
            tokens = word_tokenize(sentence.lower())
            
            # Remove punctuation and keep only alphabetic words
            tokens = [token for token in tokens if token.isalpha()]
            
            # Lemmatize each token
            lemmas = [lemmatizer.lemmatize(token) for token in tokens]
            processed_sentences_.append(lemmas)
        
        # Get all unique lemmas across all sentences
        all_lemmas = set()
        for lemmas in processed_sentences_:
            all_lemmas.update(lemmas)
            all_lemmas = sorted(list(all_lemmas))
    else:
        all_lemmas = lemmas
        processed_sentences_ = processed_sentences
    # Sort lemmas alphabetically for consistent ordering
    # print(processed_sentences_)
    
    # Create term frequency matrix
    tf_matrix = []
    for i, lemmas in enumerate(processed_sentences_):
        lemma_counts = {}
        for lemma in lemmas:
            lemma_counts[lemma] = lemma_counts.get(lemma, 0) + 1

        # Create frequency row for all lemmas
        tf_row = [lemma_counts.get(lemma, 0) for lemma in all_lemmas]
        s = sum(tf_row)
        tf_matrix.append(np.round(np.array(tf_row)/s,2))
    
    # Create DataFrame with lemmas as columns and sentences as rows
    df = pd.DataFrame(tf_matrix, columns=all_lemmas)
    df.index = [f"Sentence {i+1}" for i in range(len(sentences))]
    
    return df

# Define the sentences
sentences = [
    "This is a very good movie.",
    "This movie is boring.",
    "This movie is better than the previous movie.",
    "The previous movie is better than this movie."
]

# Create the binary lemma occurrence table
tf_table = create_lemma_freq_table(sentences, lemmas, sentences_)
tf_table

	a	better	boring	good	is	movie	previous	than	the	this	very
Sentence 1	0.17	0.00	0.00	0.17	0.17	0.17	0.00	0.00	0.00	0.17	0.17
Sentence 2	0.00	0.00	0.25	0.00	0.25	0.25	0.00	0.00	0.00	0.25	0.00
Sentence 3	0.00	0.12	0.00	0.00	0.12	0.25	0.12	0.12	0.12	0.12	0.00
Sentence 4	0.00	0.12	0.00	0.00	0.12	0.25	0.12	0.12	0.12	0.12	0.00

Text Transformation

Bag-of-Word (BoW)

Consider our 4 sentences:
1. This is a very good movie.
2. This movie is boring.
3. This movie is better than the previous movie.
4. The previous movie is better than this movie.

Term Frequency-Inverse Document Frequency (IF-IDF)

\(\text{IF-IDF}(\color{blue}{w},\color{red}{s})=\text{TF}(\color{blue}{w},\color{red}{s})\times\log(\text{No. of total snt.}/\text{No. snt. containing }\color{blue}{w})\).

Code

def create_lemma_freq_table(sentences, lemmas = None, processed_sentences = None):
    if lemmas is None and processed_sentences is None:
        # Initialize lemmatizer
        lemmatizer = WordNetLemmatizer()
        
        # Preprocess sentences: tokenize, lemmatize, and clean
        processed_sentences_ = []
        for sentence in sentences:
            # Tokenize and convert to lowercase
            tokens = word_tokenize(sentence.lower())
            
            # Remove punctuation and keep only alphabetic words
            tokens = [token for token in tokens if token.isalpha()]
            
            # Lemmatize each token
            lemmas = [lemmatizer.lemmatize(token) for token in tokens]
            processed_sentences_.append(lemmas)
        
        # Get all unique lemmas across all sentences
        all_lemmas = set()
        for lemmas in processed_sentences_:
            all_lemmas.update(lemmas)
            all_lemmas = sorted(list(all_lemmas))
    else:
        all_lemmas = lemmas
        processed_sentences_ = processed_sentences
    # Sort lemmas alphabetically for consistent ordering
    # print(processed_sentences_)
    
    idf_ = {}
    for lemma in all_lemmas:
        s = sum([1 if lemma in sentence else 0 for sentence in processed_sentences_])
        idf_[lemma] = np.log(len(processed_sentences_)/s)
    
    # Create term frequency matrix
    tf_matrix = []
    for i, lemmas in enumerate(processed_sentences_):
        lemma_counts = {}
        lemma_in_sent_count = {}
        for lemma in lemmas:
            lemma_counts[lemma] = lemma_counts.get(lemma, 0) + 1

        # Create frequency row for all lemmas
        tf_row = [lemma_counts.get(lemma, 0) for lemma in all_lemmas]
        s = sum(tf_row)
        tf_matrix.append(np.round(np.array(list(idf_.values()))*np.array(tf_row)/s,2))

    # Create DataFrame with lemmas as columns and sentences as rows
    df = pd.DataFrame(tf_matrix, columns=all_lemmas)
    df.index = [f"Sentence {i+1}" for i in range(len(sentences))]
    
    return df

# Define the sentences
sentences = [
    "This is a very good movie.",
    "This movie is boring.",
    "This movie is better than the previous movie.",
    "The previous movie is better than this movie."
]

# Create the binary lemma occurrence table
tf_table = create_lemma_freq_table(sentences, lemmas, sentences_)
tf_table

	a	better	boring	good	previous	than	the	very
Sentence 1	0.23	0.00	0.00	0.23	0.00	0.00	0.00	0.23
Sentence 2	0.00	0.00	0.35	0.00	0.00	0.00	0.00	0.00
Sentence 3	0.00	0.09	0.00	0.00	0.09	0.09	0.09	0.00
Sentence 4	0.00	0.09	0.00	0.00	0.09	0.09	0.09	0.00

🔑 A large value means ‘the word is rare across sentences but frequent in a single sentence’.

Text Transformation

Text Embedding & Pretrained Models

The preprocessed text can be embedded into real vector using Deep Learning models.
Some well-known embedding methods:
- Word2Vec (Google):
  - Efficient Estimation of Word Representations in Vector Space.
  - Distributed Representations of Words and Phrases and their Compositionality.
- GloVe (Stanford NLP Group):
  - GloVe: Global Vectors for Word Representation.
- FastText (Facebook AI Research):
  - Enriching Word Vectors with Subword Information.
  - Bag of Tricks for Efficient Text Classification.
- BERT (Google):
  - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

Text Transformation

Text Embedding & Pretrained Models

The following code provides the first 10 dimensions of the previous examples with light-weight SentenceTransformers model.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L12-v1')

sentences = [
    "This is a very good movie.",
    "This movie is boring.",
    "This movie is better than the previous movie.",
    "The previous movie is better than this movie."
]

embeddings = model.encode(sentences)
print(f"Embedding shape: {embeddings.shape}")

df_embed = {}
for i, embedding in enumerate(embeddings):
    df_embed['Sentence '+(i+1)] = embedding[:10]
df_embed = pd.DataFrame(df_embed, index = [str(d) for d in range(1,11)])
df_embed

Feature Selection

From Text Transformation, each text, doc… can be considered as a row observation.
Not all features are useful, especially in preprocessed text (lemmas, normalized lemmas…).
Too frequent or too infrequent are often not helpful!

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df ="No. or percentage", max_df="No. or percentage")
embeddings = vectorizer.fit_transform(sentences)

Part of Speech (POS) tags can be helpful for feature selection.
What we have learned so far can also be used:
- Feature importances
- PCA-base methods
- Visualization…

Data Mining/Pattern Recovery

Pattern Recovery

The common methods of text mining include:
- Text classification: spam, non-spam or positive or negative review…
- Cluster Analysis: clustering similar news…
Again, we can apply ML methods we have learned on embedded text:
- NBC, Logistic Regression
- KNN, Decision Trees
- SVM, Random Forest, Adaboost, XGBoost…
There is one major problem: which similarity measures are suitable for comparing text-embedding vectors?

Pattern Recovery

Jaccard Coefficient

It’s a popular similarity measure for asymmetric binary vectors.
If \(S_i\) and \(S_j\) are two binary embedding vectors, then \[J(S_i,S_j)=\frac{\text{No. of 1-1 matches}}{\text{No. of all matches EXCEPT for 0-0 matches}}.\]
Ex: Given 3 sentences with the following binary embedding vectors:

	0	1	2	3	4	5	6	7	9
S1	0	1	0	1	0	0	1	0	1
S2	0	0	1	1	0	1	1	1	0
S3	1	0	0	1	1	1	1	0	1

One has

\(J(S_1,S_2)=2/7.\)
\(J(S_2,S_3)=3/8.\)
\(J(S_1,S_3)=3/7.\)

Pattern Recovery

Cosine Similarity

It’s a popular similarity measure for weighted document vector such as TF or TF-IDF.
If \(S_i\) and \(S_j\) are two TF or TF-IDF vectors, then \[\cos(S_i,S_j)=\frac{\langle S_i,S_j\rangle}{\|S_i\|\|S_j\|},\text{ where }\begin{cases}\langle S_i,S_j\rangle=\sum_{k}^n(S_i)_{k}(S_j)_{k}\\ \|S\|=\sqrt{\sum_{k=1}^nS_k^2} \end{cases}.\]
Ex: Given 3 sentences with the following binary embedding vectors:

	a	better	boring	good	previous	than	the	very
S1	0.23	0.00	0.00	0.23	0.00	0.00	0.00	0.23
S2	0.00	0.00	0.35	0.00	0.00	0.00	0.00	0.00
S3	0.00	0.09	0.00	0.00	0.09	0.09	0.09	0.00
S4	0.00	0.09	0.00	0.00	0.09	0.09	0.09	0.00

Cosine similarity on TF-IDF:
\(\cos(S_1,S_2)=\) 0.0.
\(\cos(S_2,S_3)=\) 0.0.
\(\cos(S_3,S_4)=\) 1.0.

Summary

Main challenge in text mining: Preprocessing and vectorization.

In order to be able to apply well known Data Mining algorithms.

There are lots of alternative techniques:

Thus you need to experiment in order to find out which work well for your use case.

Focus has shifted from bag-of-words approaches to embeddings.

Text mining can be tricky, but somewhat acceptable results are easily achieved.
Make sure to define suitable token type for your task.

Text Mining

🗺️ Content

Introduction & Motivation

Introduction & Motivation

Text Mining

Introduction & Motivation

Text Mining Process

Text Preprocessing

Text Preprocessing

Text Preprocessing

Tokenization

Text Preprocessing

Remove Stopwords

Text Preprocessing

Stemming & Lemmatization

Text Preprocessing

Stemming & Lemmatization

Text Preprocessing

Normalization

Text Preprocessing

Preprocessed text

Text Transformation

Text Transformation

Text Transformation

Bag-of-Word (BoW)

Text Transformation

Bag-of-Word (BoW)

Binary Term Occurrence (BTO)

Text Transformation

Bag-of-Word (BoW)

Term Occurrence (TO)

Text Transformation

Bag-of-Word (BoW)

Term Frequency (TF)

Text Transformation

Bag-of-Word (BoW)

Term Frequency-Inverse Document Frequency (IF-IDF)

Text Transformation

Text Embedding & Pretrained Models

Text Transformation

Text Embedding & Pretrained Models

Feature Selection

Feature Selection

Data Mining/Pattern Recovery

Pattern Recovery

Pattern Recovery

Jaccard Coefficient

Pattern Recovery

Cosine Similarity

Summary

🥳 It’s party time 🥂