Python NLTK - Tokenize Text to Words or Sentences


Tokenize Text to Words or Sentences

In Natural Language Processing, Tokenization is the process of breaking given text into individual words.

Assuming that given document of text input contains paragraphs, it could broken down to sentences or words. NLTK provides tokenization at two levels: word level and sentence level.

To tokenize a given text into words with NLTK, you can use word_tokenize() function. And to tokenize given text into sentences, you can use sent_tokenize() function.


Syntax - word_tokenize() & senk_tokenize()

Following is the syntax of word_tokenize() function.

nltk.word_tokenize(text)

where text is the string.

Following is the syntax of sent_tokenize() function.

nltk.sent_tokenize(text)

word_tokenize() or sent_tokenize() returns a Python List containing tokens.

The prerequisite to use word_tokenize() or sent_tokenize() functions in your program is that, you should have punkt package downloaded. You can download it offline, or programmatically before using the tokenize methods using nltk.download() function.

In the following examples, we will use second method using nltk.download() function.


Example 1: NLTK Word Tokenization - nltk.word_tokenize()

In the following example, we have used word_tokenize() to tokenize given text into words.

Python Program

import nltk

# nltk tokenizer requires punkt package
# download if not downloaded or not up-to-date
nltk.download('punkt')

# input text
sentence = """Today morning, Arthur felt very good."""

# tokene into words
tokens = nltk.word_tokenize(sentence)

# print tokens
print(tokens)

Output

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\PE\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
['Today', 'morning', ',', 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']

If you have already downloaded all the nltk packages, you may not use nltk.download('punkt'). If you run the program again, you would see the following comments from nltk_data.

Output

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\PE\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
['Today', 'morning', ',', 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']

The second time, the package is not downloaded, unless it is out-of-date.


Example 2: NLTK Sentence Tokenization - nltk.word_tokenize()

In the following example, we have used sent_tokenize() to tokenize given text into sentences.

Python Program

import nltk

# nltk tokenizer requires punkt package
# download if not downloaded or not up-to-date
nltk.download('punkt')

# input text
sentence = """Today morning, Arthur felt very good.

The time is ticking.
"""

# tokene into words
tokens = nltk.sent_tokenize(sentence)

# print tokens
print(tokens)

Output

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\PE\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
['Today', 'morning', ',', 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']

Summary

In this NLTK Tutorial of Python Examples, we learned how to tokenize text into sentences and how to tokenize a sentence into words.


Python Libraries