...

Library - Natural Language Toolkit

Back to Course

Lesson Description


Lession - #520 Natural Language Toolkit-Tokenizing Text


What is Tokenizing?

It very well might be characterized as the most common way of separating a piece of text into more modest parts, like sentences and words. These more modest parts are called tokens. For instance, a word is a token in a sentence, and a sentence is a token in a passage.

As we realize that NLP is utilized to construct applications, for example, sentiment analysis, QA systems, language translation, smart chatbots, voice systems,, and so on, subsequently, to fabricate them, it becomes fundamental to figure out the example in the text. The tokens, referenced above, are extremely helpful in finding and figuring out these examples. We can consider tokenization as the base advance for different recipes, for example, stemming and lemmatization.

NLTK package

nltk.tokenize is the package given by NLTK module to accomplish the course of tokenization.

Tokenizing sentences into words
Parting the sentence into words or making a list of words from a string is a fundamental piece of each and every text handling action. Allow us to figure out it with the assistance of different functions/modules gave by nltk.tokenize package.

word_tokenize module
word_tokenize module is utilized for essential word tokenization. Following model will utilize this module to part a sentence into words.

Example

import nltk
from nltk.tokenize import word_tokenize
word_tokenize('Tutorialspoint.com provides high quality technical tutorials for free.'>

Output
['Tutorialspoint.com', 'provides', 'high', 'quality', 'technical', 'tutorials', 'for', 'free', '.']


TreebankWordTokenizer Class
word_tokenize module, utilized above is fundamentally a wrapper function that calls tokenize(>
work as an example of the TreebankWordTokenizer class. It will give similar result as we get while utilizing word_tokenize(>
module for dividing the sentences into word. Allow us to see a similar model executed above −

Example
To start with, we really want to import the natural language toolkit(nltk>
.
import nltk

Now, import the TreebankWordTokenizer class to implement the word tokenizer algorithm −
from nltk.tokenize import TreebankWordTokenizer

Next, create an instance of TreebankWordTokenizer class as follows −
Tokenizer_wrd = TreebankWordTokenizer(>

Now, input the sentence you want to convert to tokens −
Tokenizer_wrd.tokenize(
'Tutorialspoint.com provides high quality technical tutorials for free.'
>

Output
[
'Tutorialspoint.com', 'provides', 'high', 'quality',
'technical', 'tutorials', 'for', 'free', '.'
]


Complete implementation example
Let us see the complete implementation example below
[
'Tutorialspoint.com', 'provides', 'high', 'quality',
'technical', 'tutorials','for', 'free', '.'
]


WordPunktTokenizer Class
An elective word tokenizer that parts all accentuation into isolated tokens. Allow us to grasp it with the accompanying straightforward model −
Example
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer(>
tokenizer.tokenize(" I can't allow you to go home early">

Output
['I', 'can', "'", 't', 'allow', 'you', 'to', 'go', 'home', 'early']


Tokenizing text into sentences

A conspicuous inquiry that came to us is that when we have word tokenizer why do we want sentence tokenizer or for what reason do we really want to tokenize text into sentences. Assume we really want to include normal words in sentences, how we can do this? For achieving this undertaking, we really want both sentence tokenization and word tokenization.

Allow us to comprehend the distinction among sentence and word tokenizer with the assistance of following straightforward example −
Example
import nltk
from nltk.tokenize import sent_tokenize
text = "Let us understand the difference between sentence & word tokenizer. 
It is going to be a simple example."
sent_tokenize(text>
Output
[
"Let us understand the difference between sentence & word tokenizer.",
'It is going to be a simple example.'
]