📅  最后修改于: 2023-12-03 14:46:00.904000             🧑  作者: Mango
NLTK stands for Natural Language Toolkit, which is a Python library used for processing natural language. One of its functionalities is tokenization which is the process of separating text into individual words or meaningful chunks.
To install NLTK, run the following command in your terminal:
pip install nltk
After installation, you need to download the necessary corpora by running the following command:
import nltk
nltk.download('punkt')
To tokenize a text, you can use the word_tokenize()
function which converts a sentence or a paragraph to a list of words.
from nltk.tokenize import word_tokenize
text = "This is an example of NLTK tokenizer"
tokens = word_tokenize(text)
print(tokens)
The output will be:
['This', 'is', 'an', 'example', 'of', 'NLTK', 'tokenizer']
There is also a sent_tokenize()
function that can be used to tokenize a paragraph into sentences.
from nltk.tokenize import sent_tokenize
text = "This is the first sentence. This is the second sentence."
sentences = sent_tokenize(text)
print(sentences)
The output will be:
['This is the first sentence.', 'This is the second sentence.']
NLTK also provides other tokenizers such as TweetTokenizer
for tokenizing social media text and RegexpTokenizer
for tokenizing based on regular expressions.
In this tutorial, we have explored the basic usage of the NLTK tokenizer. With this knowledge, you can apply tokenization to various natural language processing tasks such as text classification, sentiment analysis, and part-of-speech tagging.