Hasan's Post

Tutorial repository

View on GitHub
19 November 2022

EDA on Text data

by Hasan

  1. Bag of words

  2. Embeddings(~word2vec)

1. Bag of words

1.1. CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())

1.2. TfidfVectorizer

tf = 1/ x.sum[axis=1](:,None)
x = x * tf
idf = np.log(x.shape[0]/(x>0).sum(axis=0)))
x = x*idf
sklearn.feature_extraction.text.TfidfVectorizer

1.3 N-grams

sklearn.feature_extraction.text.CountVectorizer(ngram_range=(1,2)) 
# may be parameter analyzer

Text Preprocessing

Summray of Bag of words Pipeline

  1. Preprocessing Lowercasing, removing punctuation, removing stopwords, stemming/lemmatization
  2. N-grams helps to get local context
  3. Post processing TF-IDF

2. Embeddings

Word2vec

Comparion Bag of words and Word2vec

Next post can be found here

tags: