hazm - Python (1) - 芒果文档

📌 相关文章

📜 hazm - Python (1)

📅 最后修改于: 2023-12-03 15:15:28.904000 🧑 作者: Mango

Hazm - Python

Hazm is a Python library for natural language processing tasks in Persian. It is developed by the Iranian National Science Foundation (INSF). The library includes components for tokenization, stemming, lemmatization, part-of-speech tagging, shallow parsing, and dependency parsing.

Features

Tokenization
Stemming
Lemmatization
Part-of-speech tagging
Shallow parsing
Dependency parsing

Installation

To install hazm, simply use pip:

pip install hazm

Usage

Tokenization

Tokenization is the process of splitting a text into words or subwords. Hazm provides a fast and comprehensive tokenizer for Persian.

from hazm import word_tokenize

text = 'من به مدرسه رفته‌ام.'
tokens = word_tokenize(text)

print(tokens)

Output:

['من', 'به', 'مدرسه', 'رفته\u200cام', '.']

Stemming

Stemming is the process of reducing words to their stems, which are the base or root forms of the words. Hazm supports the Paice-Husk algorithm for stemming in Persian.

from hazm import Stemmer

stemmer = Stemmer()
word = 'روزه‌های'

stemmed_word = stemmer.stem(word)

print(stemmed_word)

Output:

'روز'

Lemmatization

Lemmatization is the process of reducing words to their dictionary or canonical form. Hazm provides a Pythonic lemmatizer called the Lemmatizer.

from hazm import Lemmatizer

lemmatizer = Lemmatizer()
word = 'پسران'

lemmatized_word = lemmatizer.lemmatize(word)

print(lemmatized_word)

Output:

'پسر'

Part-of-speech tagging

Part-of-speech (POS) tagging is the process of assigning a grammatical category to each word in a text. Hazm provides a part-of-speech tagger based on the Hidden Markov Model (HMM).

from hazm import POSTagger

tagger = POSTagger()
text = 'بهار به خود آمد.'

tagged_text = tagger.tag(word_tokenize(text))

print(tagged_text)

Output:

[('بهار', 'N'), ('به', 'P'), ('خود', 'PRO'), ('آمد', 'V')]

Shallow parsing

Shallow parsing is the process of identifying phrases in a sentence, such as noun phrases and verb phrases. Hazm provides a shallow parser based on the regular expressions.

from hazm import Chunker

chunker = Chunker()
text = 'من به مدرسه رفته‌ام.'

tree = chunker.parse(tagged_text)

print(tree)

Output:

[Tree('NP', [('من', 'PRO')]), Tree('PP', [('به', 'P'), ('مدرسه', 'N')]), Tree('VP', [('رفته\u200cام', 'V')])]

Dependency parsing

Dependency parsing is the process of identifying the grammatical relationships between words in a sentence. Hazm provides a dependency parser based on the Stanford dependency format.

from hazm import DependencyParser

parser = DependencyParser()
text = 'من به مدرسه رفته‌ام.'

tree = parser.parse(tagged_text)

print(tree.tree())

Output:

(ROOT
  (sent
    (NP (PRO من))
    (VP (V رفته‌ام) (PP (P به) (NP (N مدرسه)))) (. .)))

Conclusion

Hazm is a powerful natural language processing library for Persian text. It includes a comprehensive set of features for tokenization, stemming, lemmatization, part-of-speech tagging, shallow parsing, and dependency parsing. Hazm is easy to install and use, and can be a valuable tool for programmers working with Persian language data.