Library - Natural Language Toolkit

Back to Course

Lesson Description

Lession - #532 Training Tokenizer & filtering stopwords

#### Why to train own sentence tokenizer? This is vital inquiry that in the event that we have NLTK's default sentence tokenizer, for what reason do we have to prepare a sentence tokenizer? The response to this question lies in the nature of NLTK's default sentence tokenizer. The NLTK's default tokenizer is fundamentally a universally useful tokenizer. Despite the fact that it functions admirably yet it may not be a decent decision for nonstandard text, that maybe our text is, or for a text that is having an exceptional designing. To tokenize such message and obtain best outcomes, we ought to prepare our own sentence tokenizer.   Execution Example For this model, we will utilize the webtext corpus. The text file which we will use from this corpus is having the text designed as discoursed displayed underneath −    ```plaintext Guy: How old are you? Hipster girl: You know, I never answer that question. Because to me, it's about how mature you are, you know? I mean, a fourteen year old could be more mature than a twenty-five year old, right? I'm sorry, I just never answer that question. Guy: But, uh, you're older than eighteen, right? Hipster girl: Oh, yeah. ``` We have saved this text file with the name of training\_tokenizer. NLTK furnishes a class named **PunktSentenceTokenizer** with the assistance of which we can prepare on crude message to create a custom sentence tokenizer. We can get crude text either by perusing in a record or from a NLTK corpus utilizing the raw(>
technique.  Allow us to see the model beneath to get more understanding into it −  To begin with, import **PunktSentenceTokenizer** class from **nltk.tokenize** package −    ```plaintext from nltk.tokenize import PunktSentenceTokenizer ``` Now, import **webtext** corpus from **nltk.corpus** package    ```plaintext from nltk.corpus import webtext ``` Next, by using **raw(>
** method, get the raw text from **training\_tokenizer.txt** file as follows −    ```plaintext text = webtext.raw('C://Users/Leekha/training_tokenizer.txt'>
``` Now, create an instance of **PunktSentenceTokenizer** and print the tokenize sentences from text file as follows −    ```plaintext sent_tokenizer = PunktSentenceTokenizer(text>
sents_1 = sent_tokenizer.tokenize(text>
```   Output ```plaintext White guy: So, do you have any plans for this evening? print(sents_1[1]>
Output: Asian girl: Yeah, being angry! print(sents_1[670]>
Output: Guy: A hundred bucks? print(sents_1[675]>
Output: Girl: But you already have a Big Mac... ```   #### What are stopwords? A few well known words that are available in text yet don't contribute in that frame of mind of a sentence. Such words are not the slightest bit significant with the end goal of data recovery or natural language processing. The most well-known stopwords are 'the' and 'a'.    #### NLTK stopwords corpus In reality, Natural Language Tool kit accompanies a stopword corpus containing word records for some languages. Allow us to figure out its use with the assistance of the accompanying model −  In the first place, import the stopwords copus from **nltk.corpus** package −    ```plaintext from nltk.corpus import stopwords ``` Now, we will be using stopwords from English Languages    ```plaintext english_stops = set(stopwords.words('english'>
words = ['I', 'am', 'a', 'writer'] [word for word in words if word not in english_stops] ```   Output ```plaintext ['I', 'writer'] ``` ```plaintext