19 November 2022

EDA on Text data

by Hasan

This is a part of series. First, Second, Third parts are connected in link. In this part I will try to write down about text data
Feature extraction from text

Bag of words
Embeddings(~word2vec)

1. Bag of words

1.1. CountVectorizer

each word is separated and count number of occurences

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())

We may be need to do some post processing. As we know KNN, neural networks are sensitive to the scale of the features. So we need to scale the features. We can use TF-IDF to do this.

1.2. TfidfVectorizer

What actually is just not frequency but normalized frequency.
Term frequency:

tf = 1/ x.sum[axis=1](:,None)
x = x * tf

Inverse document frequency:

idf = np.log(x.shape[0]/(x>0).sum(axis=0)))
x = x*idf
sklearn.feature_extraction.text.TfidfVectorizer

1.3 N-grams

Not only words but n-consequent words

sklearn.feature_extraction.text.CountVectorizer(ngram_range=(1,2)) 
# may be parameter analyzer

Text Preprocessing

Actually before applying any Bag of words we need to preprocess the text. We need to remove the stop words, stemming, lemmatization, etc.* Conventionally preprocessing are
- Tokenization -> Very very sunny day -> [Very, very, sunny, day]
- Lowercasing -> [very, very, sunny, day] -> [very, very, sunny, day] ->CountVectorizer from sklearn will automatically do this
- Removing punctuation
- Removing stopwords -> [The cow jumped over the moon] -> [cow, jumped, moon]
  - Ariticles or preprositon words
  - Very common words
  - Can be used NLTK library
  - sklearn.feature_extraction.text.CountVectorizer(max_df)
  - max_df is the frequency threshold, after which the word is removed
- Stemming/Lemmatization
- Stemming
  - [democracy, democratic, democratization] -> [democr]
  - [Saw] -> [s]
- Lemitization
  - [democracy, democratic, democratization] -> [democracy]
  - [Saw, sawing, sawed] -> [see or saw] depending on text

Summray of Bag of words Pipeline

Preprocessing Lowercasing, removing punctuation, removing stopwords, stemming/lemmatization
N-grams helps to get local context
Post processing TF-IDF

2. Embeddings

Word2vec

Vector representation of words and text
Each word is represented as a vector, in some sophisticated way, which could have 100 dimensions or more.
Same words will have similar vectors. king->queen
Also addition and subtraction of vectors will have some meaning. -> king + woman - man = queen
Several implementaton of word2vec
- Word2vec
- Glove
- FastText
Sentences
- Doc2vec
Based on situation we can use word or sentence embeddings. Actually try both and take the best one.
All the preprocessing steps can be applied to the text before applying word2vec.

Comparion Bag of words and Word2vec

Bag of words
- Very large vector
- meaning is easy value in vector is known
Word2vec
- Relative Small vector
- Values of vector can be interpreted only some cases
- The words with simlar meaning will have similar embeddings

Next post can be found here

tags:

Hasan's Post

Tutorial repository

EDA on Text data

Feature extraction from text

1. Bag of words

Text Preprocessing

Summray of Bag of words Pipeline

2. Embeddings

Word2vec

Comparion Bag of words and Word2vec