In order to use textual data for predictive modelling, the text must be parsed to remove certain words — this process is called tokenization. These words need to then be encoded as integers, or floating-point values, for use as inputs in machine learning algorithms. This process is called feature extraction (or vectorization).

Scikit-learn’s CountVectorizer is used to convert a collection of text documents to a vector of term/token counts. It also enables the ​pre-processing of text data prior to generating the vector representation. This functionality makes it a highly flexible feature representation module for text.

About Scikit-Learn’s vectorizers

As you know machines, as advanced as they may be, are not capable of understanding words and sentences in the same manner as humans do. In order to make documents’ corpora more palatable for computers, they must first be converted into some numerical structure. There are a few techniques used to achieve that, but in this post, I’m going to focus on Vector Space models a.k.a. (BoW) models.

Bag-of-Words is a very intuitive approach to this problem, the methods comprise of:

1. Splitting the documents into tokens by following some sort of pattern.

2. Assigning a weight to each token proportional to the frequency with which it shows up in the document and/or corpora.

Creating a document-term matrix with each row representing a document and each column addressing a token

The vectorizer objects provided by Scikit-Learn are quite reliable right out of the box, they allow us to perform all the above steps at once efficiently, and even apply preprocessing and rules regarding the number and frequency of tokens. To top it all off, they come in three different flavors (there are other versions like DictVectorizers, but they are not that common):

Count Vectorizer: The most straightforward one, it counts the number of times a token shows up in the document and uses this value as its weight

Python Code :

# import pandas and sklearn’s CountVectorizer class
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# create a dataframe from a word matrix
def wm2df(wm, feat_names):

# create an index for each row
doc = [‘Doc{:d}’.format(idx) for idx, _ in enumerate(wm)]
df = pd.DataFrame(data=wm.toarray(), index=doc,columns=feat_names)

# set of documents
sentence = [‘This is the first sentence in our corpus followed by one more sentence to demonstrate Bag of words’,
‘This is the second sentence in our corpus with a FEW UPPER CASE WORDS and Few Title Case Words’,
‘The most straightforward one, it counts the number of times a token shows up in the document and uses this value as its weight.’]

# instantiate the vectorizer object
cvec = CountVectorizer(lowercase=False)

# convert the documents into a document-term matrix
wm = cvec.fit_transform(sentence)

# retrieve the terms found in the corpora
tokens = cvec.get_feature_names()

# create a dataframe from the matrix
wm2df(wm, tokens)

Lead Data Scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store