Python中的可读性索引 (NLP)
可读性是读者理解书面文本的难易程度。在自然语言中,文本的可读性取决于其内容(词汇和句法的复杂性)。它侧重于我们选择的单词,以及我们如何将它们放入句子和段落中以供读者理解。
我们写作的主要目的是传递作者和读者都认为有价值的信息。如果我们不能传达这些信息,我们的努力就白费了。为了吸引读者,向他们提供他们很乐意继续阅读并能够清楚理解的信息至关重要。因此,要求内容足够容易阅读,并尽可能地理解这一点。有各种可用的难度量表,它们都有自己的难度确定公式。
本文说明了可用于可读性分数评估的各种传统可读性公式。在自然语言处理中,有时需要分析单词和句子以确定文本的难度。可读性分数通常是特定等级的等级水平,它根据特定文本的难度对文本进行评分。它帮助作者改进文本,使其能够被更多的观众理解,从而使内容具有吸引力。
各种可用的可读性分数确定方法/公式:
- 戴尔-查尔公式
- Gunning 雾公式
- Fry 可读性图
- 麦克劳克林的 SMOG 公式
- 预测公式
- 可读性和报纸读者群
- Flesch 分数
从这里阅读更多可用的可读性公式。
可读性公式的实现如下所示。
戴尔查尔公式:
应用公式:
在整个文本中选择几个 100 字的样本。
计算以单词为单位的平均句子长度(将单词数除以句子数)。
计算不在 Dale-Chall 单词列表中的单词的百分比,包含 3,000 个简单单词。
计算这个方程
Raw score = 0.1579*(PDW) + 0.0496*(ASL) + 3.6365
Here,
PDW = Percentage of difficult words not on the Dale–Chall word list.
ASL = Average sentence length
Gunning 雾公式
Grade level= 0.4 * ( (average sentence length) + (percentage of Hard Words) )
Here, Hard Words = words with more than two syllables.
烟雾公式
SMOG grading = 3 + √(polysyllable count).
Here, polysyllable count = number of words of more than two syllables in a
sample of 30 sentences.
弗莱施公式
Reading Ease score = 206.835 - (1.015 × ASL) - (84.6 × ASW)
Here,
ASL = average sentence length (number of words divided by number of sentences)
ASW = average word length in syllables (number of syllables divided by number of words)
可读性公式的优点:
1. 可读性公式衡量读者必须阅读给定文本的年级水平。因此,为文本作者提供了接触目标受众所急需的信息。
2. 事先了解目标受众是否能够理解您的内容。
3. 易于使用。
4.可读的文本吸引更多的观众。
可读性公式的缺点:
1. 由于许多可读性公式,同一文本的结果中出现巨大差异的可能性越来越大。
2. 将数学应用于文学这并不总是一个好主意。
3. 无法衡量一个单词或短语的复杂性来确定你需要纠正它的地方。
Python
import spacy
from textstat.textstat import textstatistics,legacy_round
# Splits the text into sentences, using
# Spacy's sentence segmentation which can
# be found at https://spacy.io/usage/spacy-101
def break_sentences(text):
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
return list(doc.sents)
# Returns Number of Words in the text
def word_count(text):
sentences = break_sentences(text)
words = 0
for sentence in sentences:
words += len([token for token in sentence])
return words
# Returns the number of sentences in the text
def sentence_count(text):
sentences = break_sentences(text)
return len(sentences)
# Returns average sentence length
def avg_sentence_length(text):
words = word_count(text)
sentences = sentence_count(text)
average_sentence_length = float(words / sentences)
return average_sentence_length
# Textstat is a python package, to calculate statistics from
# text to determine readability,
# complexity and grade level of a particular corpus.
# Package can be found at https://pypi.python.org/pypi/textstat
def syllables_count(word):
return textstatistics().syllable_count(word)
# Returns the average number of syllables per
# word in the text
def avg_syllables_per_word(text):
syllable = syllables_count(text)
words = word_count(text)
ASPW = float(syllable) / float(words)
return legacy_round(ASPW, 1)
# Return total Difficult Words in a text
def difficult_words(text):
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
# Find all words in the text
words = []
sentences = break_sentences(text)
for sentence in sentences:
words += [str(token) for token in sentence]
# difficult words are those with syllables >= 2
# easy_word_set is provide by Textstat as
# a list of common words
diff_words_set = set()
for word in words:
syllable_count = syllables_count(word)
if word not in nlp.Defaults.stop_words and syllable_count >= 2:
diff_words_set.add(word)
return len(diff_words_set)
# A word is polysyllablic if it has more than 3 syllables
# this functions returns the number of all such words
# present in the text
def poly_syllable_count(text):
count = 0
words = []
sentences = break_sentences(text)
for sentence in sentences:
words += [token for token in sentence]
for word in words:
syllable_count = syllables_count(word)
if syllable_count >= 3:
count += 1
return count
def flesch_reading_ease(text):
"""
Implements Flesch Formula:
Reading Ease score = 206.835 - (1.015 × ASL) - (84.6 × ASW)
Here,
ASL = average sentence length (number of words
divided by number of sentences)
ASW = average word length in syllables (number of syllables
divided by number of words)
"""
FRE = 206.835 - float(1.015 * avg_sentence_length(text)) -\
float(84.6 * avg_syllables_per_word(text))
return legacy_round(FRE, 2)
def gunning_fog(text):
per_diff_words = (difficult_words(text) / word_count(text) * 100) + 5
grade = 0.4 * (avg_sentence_length(text) + per_diff_words)
return grade
def smog_index(text):
"""
Implements SMOG Formula / Grading
SMOG grading = 3 + ?polysyllable count.
Here,
polysyllable count = number of words of more
than two syllables in a sample of 30 sentences.
"""
if sentence_count(text) >= 3:
poly_syllab = poly_syllable_count(text)
SMOG = (1.043 * (30*(poly_syllab / sentence_count(text)))**0.5) \
+ 3.1291
return legacy_round(SMOG, 1)
else:
return 0
def dale_chall_readability_score(text):
"""
Implements Dale Challe Formula:
Raw score = 0.1579*(PDW) + 0.0496*(ASL) + 3.6365
Here,
PDW = Percentage of difficult words.
ASL = Average sentence length
"""
words = word_count(text)
# Number of words not termed as difficult words
count = word_count - difficult_words(text)
if words > 0:
# Percentage of words not on difficult word list
per = float(count) / float(words) * 100
# diff_words stores percentage of difficult words
diff_words = 100 - per
raw_score = (0.1579 * diff_words) + \
(0.0496 * avg_sentence_length(text))
# If Percentage of Difficult Words is greater than 5 %, then;
# Adjusted Score = Raw Score + 3.6365,
# otherwise Adjusted Score = Raw Score
if diff_words > 5:
raw_score += 3.6365
return legacy_round(score, 2)
来源 :
https://en.wikipedia.org/wiki/Readability