NLP – Text Processing and Regular Expressions
This article will be an introduction to Natural Language Processing and it will explain its Regular expressions & Pre Text Processing techniques.

What is NLP?
The art of extracting out all of the important information from an unstructured text-set is called Natural Language Processing.
We, humans, are different from animals because w,e use a structured type of language, present everywhere around us in – computers, books, magazines, hoardings, social media,, etc. So, NLP can be said as a way for us to train computers to understand human language and extract important information.
NLP Applications
- Speech Recognition
- Language Translation
- Chatbots
- Summary Creators
- Sentiment Analysis etc
Regular Expression in NLP
Regular expressions are the series/sequence of character(s) which can replace a set of patterns in a text dataset. They can easily handle NLP applications.
Regular expressions can be used for string related functions like – searching a string or replacing a string.
Types of Regular Expressions
- Digits (1/2/100)
- Alphabets (a/v/t)
- Any character (+/? /<)
- Set of digits etc
Functions used:
- search – It’s used to find a specific pattern in a string
- match – It’s used to find the very first occurrence of a pattern in a string
- findall – It’s used to find all the patterns in a given string
- split – It is used to split the text into the given regular expression
- sub – It used to search and replace in string
Using Regular expressions in python
Example –
These codes given below are some examples of how to use regular expressions in python.

https://gist.github.com/payalmhjn/f4a91ef00e98bcae0134c5498f7b5ec3
Text Processing in NLP
As we use text-based dataset in natural language processing, we must convert the raw text into code which machine learning algorithms can understand.
Understanding the basic terms used in Text pre-processing-
- Ngrams – It’s a combination of N number of words.
- Corpus – Corpus is the collection of text-based documents.
- Tokens – Tokens are smaller combinations of a text object, for example – word, phrase, Ngrams. Structure of token is like – <Prefix> Morpheme <Suffix>
**Order can be made as
Tokens < Sentences < Paragraphs < Documents < Corpus
To understand Ngrams, example –
Sentence – ‘I hate my cousin brother’
Unigrams (n = 1) – I, hate, my, cousin, brother
Bigrams (n=2) – I hate, hate my, my cousin, cousin brother
Trigrams (n=3) – I hate my, my cousin brother
Techniques used for Text processing:
- Tokenization
- Tokenization is the process of breaking/ splitting a text document/object into small tokens (parts), which can be – letters, digits, symbols, special characters, etc.
Types of Tokenization-
- White Space Tokenization
- When the entire text object is split with the help of white spaces, it is called White Space Tokenizer.
- Example – “I miss my grandmother”
It is tokenized as – “I”, “miss”, “my”, “grandmother”
- Regular Expression Tokenizer
- When a regular expression is used to get tokens from a text object, it is called Regular Expression Tokenizer.
- Example – “Football, Cricket, Golf Tennis”
re.split (r’[;,\s]’, line)
Tokens – “Football”, “Cricket”, “Golf”, “Tennis”
- Normalization
- As mentioned somewhere above, morphemes are used in the structure of a token. A morpheme is nothing but a base form of a given word.
- Example – “Antibacterial” : Anti + Bacteria + l
- By definition, Normalization is a technique to transform a token into its morpheme.
- This technique is mostly used for text cleaning.
Types of Normalization-
- Stemming
- It is an elementary process based on rules such as the removal of inflectional form from a specific token. The output after stemming of a token gives the stem form of the token.
- “Laugh” is the stem word of tokens – “Laughed”, ”Laughing”, “Laugh”, “Laughs”.
- Token = Stem word + suffix(es/s/ing/ed)
- One disadvantage of stemming can be that it may produce non-meaningful words sometimes.
- Example- “His eyes are always opened”
After stemming- “Hi eye are always open”
- Lemmatization
- Lemmatization is a process of reeducating a token into its lemma in a systematic manner.
- It uses vocabulary, grammar relations,, and speech tags to do the process.
- Example: “is” will be used instead of “are”, “is”, “am”
- If a token is used as a noun, it will not be changed and if a token is used as a verb, it will be converted to its lemma.
Implementing NLP techniques in Python
The following codes show how to implement lemmatization, stemming,, and tokenization in Python. It also shows how to display Corpus.

https://gist.github.com/payalmhjn/64f8fcc33bcca458e71808c7860b2524
Conclusion
Natural language processing has a lot of algorithms so it can be difficult to learn but NLP has been made to be used for a lot of tasks like-
- Creating Chatbots for quick customer responses
- Recognition Entity used
- Reducing Words to their smallest form(Root)
- Extracting important information from a particular text-dataset.