📜  Python中的可读性索引 (NLP)

📅  最后修改于: 2022-05-13 01:54:54.525000             🧑  作者: Mango

Python中的可读性索引 (NLP)

可读性是读者理解书面文本的难易程度。在自然语言中,文本的可读性取决于其内容(词汇和句法的复杂性)。它侧重于我们选择的单词,以及我们如何将它们放入句子和段落中以供读者理解。

我们写作的主要目的是传递作者和读者都认为有价值的信息。如果我们不能传达这些信息,我们的努力就白费了。为了吸引读者,向他们提供他们很乐意继续阅读并能够清楚理解的信息至关重要。因此,要求内容足够容易阅读,并尽可能地理解这一点。有各种可用的难度量表,它们都有自己的难度确定公式。

本文说明了可用于可读性分数评估的各种传统可读性公式。在自然语言处理中,有时需要分析单词和句子以确定文本的难度。可读性分数通常是特定等级的等级水平,它根据特定文本的难度对文本进行评分。它帮助作者改进文本,使其能够被更多的观众理解,从而使内容具有吸引力。

各种可用的可读性分数确定方法/公式:

  1. 戴尔-查尔公式
  2. Gunning 雾公式
  3. Fry 可读性图
  4. 麦克劳克林的 SMOG 公式
  5. 预测公式
  6. 可读性和报纸读者群
  7. Flesch 分数

从这里阅读更多可用的可读性公式。

可读性公式的实现如下所示。

戴尔查尔公式:

应用公式:
在整个文本中选择几个 100 字的样本。
计算以单词为单位的平均句子长度(将单词数除以句子数)。
计算不在 Dale-Chall 单词列表中的单词的百分比,包含 3,000 个简单单词。
计算这个方程

Raw score = 0.1579*(PDW) + 0.0496*(ASL) + 3.6365
Here,
PDW = Percentage of difficult words not on the Dale–Chall word list.
ASL = Average sentence length

Gunning 雾公式

Grade level= 0.4 * ( (average sentence length) + (percentage of Hard Words) )
Here, Hard Words = words with more than two syllables.

烟雾公式

SMOG grading = 3 + √(polysyllable count).
Here, polysyllable count = number of words of more than two syllables in a 
sample of 30 sentences.

弗莱施公式

Reading Ease score = 206.835 - (1.015 × ASL) - (84.6 × ASW)
Here,
ASL = average sentence length (number of words divided by number of sentences)
ASW = average word length in syllables (number of syllables divided by number of words)

可读性公式的优点
1. 可读性公式衡量读者必须阅读给定文本的年级水平。因此,为文本作者提供了接触目标受众所急需的信息。
2. 事先了解目标受众是否能够理解您的内容。
3. 易于使用。
4.可读的文本吸引更多的观众。

可读性公式的缺点
1. 由于许多可读性公式,同一文本的结果中出现巨大差异的可能性越来越大。
2. 将数学应用于文学这并不总是一个好主意。
3. 无法衡量一个单词或短语的复杂性来确定你需要纠正它的地方。

Python
import spacy
from textstat.textstat import textstatistics,legacy_round
 
# Splits the text into sentences, using
# Spacy's sentence segmentation which can
# be found at https://spacy.io/usage/spacy-101
def break_sentences(text):
    nlp = spacy.load('en_core_web_sm')
    doc = nlp(text)
    return list(doc.sents)
 
# Returns Number of Words in the text
def word_count(text):
    sentences = break_sentences(text)
    words = 0
    for sentence in sentences:
        words += len([token for token in sentence])
    return words
 
# Returns the number of sentences in the text
def sentence_count(text):
    sentences = break_sentences(text)
    return len(sentences)
 
# Returns average sentence length
def avg_sentence_length(text):
    words = word_count(text)
    sentences = sentence_count(text)
    average_sentence_length = float(words / sentences)
    return average_sentence_length
 
# Textstat is a python package, to calculate statistics from
# text to determine readability,
# complexity and grade level of a particular corpus.
# Package can be found at https://pypi.python.org/pypi/textstat
def syllables_count(word):
    return textstatistics().syllable_count(word)
 
# Returns the average number of syllables per
# word in the text
def avg_syllables_per_word(text):
    syllable = syllables_count(text)
    words = word_count(text)
    ASPW = float(syllable) / float(words)
    return legacy_round(ASPW, 1)
 
# Return total Difficult Words in a text
def difficult_words(text):
     
    nlp = spacy.load('en_core_web_sm')
    doc = nlp(text)
    # Find all words in the text
    words = []
    sentences = break_sentences(text)
    for sentence in sentences:
        words += [str(token) for token in sentence]
 
    # difficult words are those with syllables >= 2
    # easy_word_set is provide by Textstat as
    # a list of common words
    diff_words_set = set()
     
    for word in words:
        syllable_count = syllables_count(word)
        if word not in nlp.Defaults.stop_words and syllable_count >= 2:
            diff_words_set.add(word)
 
    return len(diff_words_set)
 
# A word is polysyllablic if it has more than 3 syllables
# this functions returns the number of all such words
# present in the text
def poly_syllable_count(text):
    count = 0
    words = []
    sentences = break_sentences(text)
    for sentence in sentences:
        words += [token for token in sentence]
     
 
    for word in words:
        syllable_count = syllables_count(word)
        if syllable_count >= 3:
            count += 1
    return count
 
 
def flesch_reading_ease(text):
    """
        Implements Flesch Formula:
        Reading Ease score = 206.835 - (1.015 × ASL) - (84.6 × ASW)
        Here,
          ASL = average sentence length (number of words
                divided by number of sentences)
          ASW = average word length in syllables (number of syllables
                divided by number of words)
    """
    FRE = 206.835 - float(1.015 * avg_sentence_length(text)) -\
          float(84.6 * avg_syllables_per_word(text))
    return legacy_round(FRE, 2)
 
 
def gunning_fog(text):
    per_diff_words = (difficult_words(text) / word_count(text) * 100) + 5
    grade = 0.4 * (avg_sentence_length(text) + per_diff_words)
    return grade
 
 
def smog_index(text):
    """
        Implements SMOG Formula / Grading
        SMOG grading = 3 + ?polysyllable count.
        Here,
           polysyllable count = number of words of more
          than two syllables in a sample of 30 sentences.
    """
 
    if sentence_count(text) >= 3:
        poly_syllab = poly_syllable_count(text)
        SMOG = (1.043 * (30*(poly_syllab / sentence_count(text)))**0.5) \
                + 3.1291
        return legacy_round(SMOG, 1)
    else:
        return 0
 
 
def dale_chall_readability_score(text):
    """
        Implements Dale Challe Formula:
        Raw score = 0.1579*(PDW) + 0.0496*(ASL) + 3.6365
        Here,
            PDW = Percentage of difficult words.
            ASL = Average sentence length
    """
    words = word_count(text)
    # Number of words not termed as difficult words
    count = word_count - difficult_words(text)
    if words > 0:
 
        # Percentage of words not on difficult word list
 
        per = float(count) / float(words) * 100
     
    # diff_words stores percentage of difficult words
    diff_words = 100 - per
 
    raw_score = (0.1579 * diff_words) + \
                (0.0496 * avg_sentence_length(text))
     
    # If Percentage of Difficult Words is greater than 5 %, then;
    # Adjusted Score = Raw Score + 3.6365,
    # otherwise Adjusted Score = Raw Score
 
    if diff_words > 5:      
 
        raw_score += 3.6365
         
    return legacy_round(score, 2)


来源 :
https://en.wikipedia.org/wiki/Readability