Text Analytics - Package Creation for Pre Processing

Gopathi Suresh Kumar
4 min readFeb 20, 2021

Natural Language Processing (NLP)

Natural Language Processing or NLP is a sub field of Linguistics, Computer Science,Information Technology and Artificial Intelligence that gives the machines the ability to understand,read, speak and derive meaning from human languages.

NLP is among the hottest topic in the field of data science. Companies are putting tons of money into research in this field. Everyone is trying to understand Natural Language Processing and its applications to make a career around it. Every business out there wants to integrate it into their business somehow.

Applications / Use cases of NLP :

1. Search Autocorrect and Autocomplete
2. Language Translator
3. Social Media Monitoring
4. Chatbots
5. Survey Analysis
6. Targeted Advertising
7. Voice Assistants
8. Grammar Checkers
9. Email Filtering etc. and Many more to go

Text Preprocessing :

To preprocess your text simply means to bring your text into a form that is predictable and analyzable for your task. A task here is a combination of approach and domain. For example, extracting top keywords with TF-IDF (approach) from Tweets (domain) is an example of a Task and Pre-Processing is one of the 6 steps of CRISP-DM (cross-industry process for data mining) Methodology.

In NLP, text preprocessing is the first step in the process of building a model.The various text preprocessing steps are:

  1. Tokenization
  2. Lower casing
  3. Stop words removal
  4. Stemming
  5. Lemmatization

Part of this package creation, we aimed to perform the data cleansing and exploratory analysis of text data easier for the users by including various functionalities in single function. Idea is to extend these functionalities to perform sentiment analysis of any given text data

1. Basic Functionalities:

a. Unique Words count: Number of Uniquely Identified words.

b. Different Characters count

c. Count of stop words

d. System special characters

e. Tokenization: The benefit of Tokenization is that it gets the text into a format that’s easier to convert to raw numbers, which can actually be used for processing. It’s a natural first step when analyzing text data.

f. Number of hashtags (New column will be created and #tags would be stored)

g. Number of numeric characters

h. Number of uppercase words (Generally uppercase words are used to denote a sentiment)

i. Number of emojis

2. Pre-Processing Functionalities:

a. Lower casing: Conversion to lowercase texting.

b. Punctuation removal (Remember, hashtags are stored in a separate column in previous function)

c. Stop words removal: Stop words removal can be easily done by removing words that are in a pre-defined list. An important thing to note is that there is no universal list of stop words. As such, the list is often created from scratch and tailored to the application being worked on.

d. Standardizing text using look up dictionary

e. Spelling correction: Based on Standard Dictionary Spell corrections are suggested.

f. Topic Modeling: Essentially, it’s a form of Dimhttps://www.aha.video/movies/maa-vintha-gaadha-vinumaensionality Reduction since we’re reducing a large amount of text data down to a much smaller number of topics. Topic modeling can be useful in a number of Data Science scenarios

g. Frequent words — Based on user input ’n’, visualization of word cloud for n — frequently used words would be given as output

h. Rare words — Based on user input ’n’, visualization of word cloud for n — rare words would be given as output

i. Replacing emojis with text

j. Stemming — Stemming is the process of reducing words into their root form. The purpose of this is to reduce words which are spelled slightly differently due to context but have the same meaning, into the same token for processing. Different stemming algorithms would be given as optional arguments from which the user can choose

k. Lemmatization: Lemmatization is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meaning to one word

Python/PIP Package Creation :

1. Register Yourself

The Python community maintains a repository similar to npm for open source packages. If you want to

make your package publicly accessible you can upload it on PyPi. So, first of all, register yourself on

PyPi: https://pypi.org/account/register/.

I am assuming you have a GitHub account where you will upload your package code. If you want to keep

your package private you can skip this step. We will use README.md directly from your GitHub project

for this tutorial and it will be used as documentation of your package.

2. Checking the Required Tools

Make sure you have Python and pip installed on your system. Install the required packages:

--

--