Library - Natural Language Toolkit

Back to Course

Lesson Description

Lession - #537 Corpus Readers and Custom Corpora

What is a corpus?

A corpus is huge assortment, in organized design, of machine-readable texts that have been delivered in a characteristic open setting. The word Corpora is the plural of Corpus. Corpus can be determined in numerous ways as follows −

  • From the text that was initially electronic
  • From the records of communicated in language
  • From optical person acknowledgment, etc
Corpus representativeness, Corpus Balance, Sampling, Corpus Size are the components that assumes a significant part while planning corpus. Probably the most famous corpus for NLP errands are TreeBank, PropBank, VarbNet and WordNet.

How to build custom corpus?

While downloading NLTK, we additionally installed NLTK data package. Along these lines, we as of now have NLTK data package installed on our PC. On the off chance that we discuss Windows, we'll expect that this data package is installed at C:\natural_language_toolkit_data and in the event that we discuss Linux, Unix and Mac OS X, we 'll accept that this data package is installed at/usr/share/natural_language_toolkit_data.

In the accompanying Python recipe, we will make custom corpora which should be inside one of the ways characterized by NLTK. It is so in light of the fact that it tends to be found by NLTK. To stay away from struggle with the authority NLTK data package, let us make a custom natural_language_toolkit_data directory in our home directory.
import os, os.path
path = os.path.expanduser('~/natural_language_toolkit_data'>
if not os.path.exists(path>
: os.mkdir(path>


Now, Let us check whether we have natural_language_toolkit_data directory in our home directory or not −
import nltk.data
path in nltk.data.path


As we have the result True, implies we have nltk_data directory in our home directory .

Presently we will make a wordlist file, named wordfile.txt and put it in a folder, named corpus in nltk_data directory (~/nltk_data/corpus/wordfile.txt>
and will stack it by utilizing nltk.data.load −
import nltk.data
nltk.data.load(‘corpus/wordfile.txt’, format = ‘raw’>

b’Instructional exercises\n’

Corpus readers

NLTK gives different CorpusReader classes. We will cover them in the accompanying python recipes

Creating wordlist corpus
NLTK has WordListCorpusReader class that provides access to the file containing a list of words. For the following Python recipe, we need to create a wordlist file which can be CSV or normal text file. For example, we have created a file named ‘list’ that contains the following data −
On the web
Instructional exercises

Now Let us instantiate a WordListCorpusReader class producing the list of words from our created file ‘list’.
from nltk.corpus.reader import WordListCorpusReader
reader_corpus = WordListCorpusReader('.', ['list']>

['On the web', 'Free', 'Instructional exercises']

Creating POS tagged word corpus

NLTK has TaggedCorpusReader class with the assistance of which we can make a POS tagged word corpus. In reality, POS tagging is the most common way of recognizing the part-of-speech tag for a word.

Probably the least complex configuration for a tagged corpus is of the structure 'word/tag'like following extract from the brown corpus −
The/at-tl expense/nn and/cc time/nn involved/vbn are/ber
astronomical/jj ./.

In the above excerpt, each word has a tag which means its POS. For instance, vb alludes to a verb.

Presently Let us start up a TaggedCorpusReaderclass delivering POS tagged words structure the file 'list.pos', which has the above selection.
from nltk.corpus.reader import TaggedCorpusReader
reader_corpus = TaggedCorpusReader('.', r'.*\.pos'>

[('The', 'AT-TL'>
, ('expense', 'NN'>
, ('and', 'CC'>
, ...]

Creating Chunked phrase corpus

NLTK has ChnkedCorpusReader class with the assistance of which we can make a Chunked phrase corpus. All things considered, a lump is a short phrase in a sentence.

For instance, we have the accompanying selection from the tagged treebank corpus −
[Earlier/JJR staff-reduction/NN moves/NNS] have/VBP trimmed/VBN about/
IN [300/CD jobs/NNS] ,/, [the/DT spokesman/NN] said/VBD ./.

In the above passage, each piece is a noun phrase however the words that are not in brackets are essential for the sentence tree and not piece of any noun phrase subtree.

Presently Let us start up a ChunkedCorpusReader class creating lumped phrase from the file 'list.chunk', which has the above selection.
from nltk.corpus.reader import ChunkedCorpusReader
reader_corpus = TaggedCorpusReader('.', r'.*\.chunk'>

Tree('NP', [('Earlier', 'JJR'>
, ('staff-reduction', 'NN'>
, ('moves', 'NNS'>
, ('have', 'VBP'>
, ... ]