📅  最后修改于: 2023-12-03 15:15:28.904000             🧑  作者: Mango
Hazm is a Python library for natural language processing tasks in Persian. It is developed by the Iranian National Science Foundation (INSF). The library includes components for tokenization, stemming, lemmatization, part-of-speech tagging, shallow parsing, and dependency parsing.
To install hazm
, simply use pip
:
pip install hazm
Tokenization is the process of splitting a text into words or subwords. Hazm provides a fast and comprehensive tokenizer for Persian.
from hazm import word_tokenize
text = 'من به مدرسه رفتهام.'
tokens = word_tokenize(text)
print(tokens)
Output:
['من', 'به', 'مدرسه', 'رفته\u200cام', '.']
Stemming is the process of reducing words to their stems, which are the base or root forms of the words. Hazm supports the Paice-Husk algorithm for stemming in Persian.
from hazm import Stemmer
stemmer = Stemmer()
word = 'روزههای'
stemmed_word = stemmer.stem(word)
print(stemmed_word)
Output:
'روز'
Lemmatization is the process of reducing words to their dictionary or canonical form. Hazm provides a Pythonic lemmatizer called the Lemmatizer
.
from hazm import Lemmatizer
lemmatizer = Lemmatizer()
word = 'پسران'
lemmatized_word = lemmatizer.lemmatize(word)
print(lemmatized_word)
Output:
'پسر'
Part-of-speech (POS) tagging is the process of assigning a grammatical category to each word in a text. Hazm provides a part-of-speech tagger based on the Hidden Markov Model (HMM).
from hazm import POSTagger
tagger = POSTagger()
text = 'بهار به خود آمد.'
tagged_text = tagger.tag(word_tokenize(text))
print(tagged_text)
Output:
[('بهار', 'N'), ('به', 'P'), ('خود', 'PRO'), ('آمد', 'V')]
Shallow parsing is the process of identifying phrases in a sentence, such as noun phrases and verb phrases. Hazm provides a shallow parser based on the regular expressions.
from hazm import Chunker
chunker = Chunker()
text = 'من به مدرسه رفتهام.'
tree = chunker.parse(tagged_text)
print(tree)
Output:
[Tree('NP', [('من', 'PRO')]), Tree('PP', [('به', 'P'), ('مدرسه', 'N')]), Tree('VP', [('رفته\u200cام', 'V')])]
Dependency parsing is the process of identifying the grammatical relationships between words in a sentence. Hazm provides a dependency parser based on the Stanford dependency format.
from hazm import DependencyParser
parser = DependencyParser()
text = 'من به مدرسه رفتهام.'
tree = parser.parse(tagged_text)
print(tree.tree())
Output:
(ROOT
(sent
(NP (PRO من))
(VP (V رفتهام) (PP (P به) (NP (N مدرسه)))) (. .)))
Hazm is a powerful natural language processing library for Persian text. It includes a comprehensive set of features for tokenization, stemming, lemmatization, part-of-speech tagging, shallow parsing, and dependency parsing. Hazm is easy to install and use, and can be a valuable tool for programmers working with Persian language data.