NLP - 文本处理中的扩展收缩
文本预处理是 NLP 中的关键步骤。清理我们的文本数据以将其转换为可分析和可预测的形式,这称为文本预处理。在本文中,我们将讨论收缩以及如何处理文本中的收缩。
什么是宫缩?
收缩是通过删除字母并用撇号替换它们来缩短的单词或单词组合。
如今,一切都在网上转移,我们更多地通过短信或以文本形式在不同社交媒体(如 Facebook、Instagram、Whatsapp、Twitter、LinkedIn 等)上的帖子与他人交流。有这么多人要交谈,我们依靠缩写和缩短的单词形式给人们发短信。
例如,我将在 5 分钟内到达那里。你不在那里吗?我在 smthng 上发短信了吗?我想在 d 公园附近见你。
在英语收缩中,我们经常从单词中去掉元音以形成收缩。删除收缩有助于文本标准化,并且在我们处理 Twitter 数据和产品评论时很有用,因为这些词在情绪分析中起着重要作用。
如何扩大宫缩?
1.使用收缩库
首先,安装库。您可以在 Google colab 上试用这个库,因为安装该库变得非常顺利。
使用点子:
!pip install contractions
在 Jupyter 笔记本中:
import sys
!{sys.executable} -m pip install contractions
代码 1:使用收缩库扩展收缩
Python3
# import library
import contractions
# contracted text
text = '''I'll be there within 5 min. Shouldn't you be there too?
I'd love to see u there my dear. It's awesome to meet new friends.
We've been waiting for this day for so long.'''
# creating an empty list
expanded_words = []
for word in text.split():
# using contractions.fix to expand the shotened words
expanded_words.append(contractions.fix(word))
expanded_text = ' '.join(expanded_words)
print('Original text: ' + text)
print('Expanded_text: ' + expanded_text)
Python3
text = '''She'd like to know how I'd done that!
She's going to the park and I don't think I'll be home for dinner.
Theyre going to the zoo and she'll be home for dinner.'''
contractions.fix(text)
输出:
Original text: I'll be there within 5 min. Shouldn't you be there too?
I'd love to see u there my dear. It's awesome to meet new friends.
We've been waiting for this day for so long.
Expanded_text: I will be there within 5 min. should not you be there too?
I would love to see you there my dear. it is awesome to meet new friends.
we have been waiting for this day for so long.
在形成词向量之前去除收缩有助于降维。
代码2:简单地使用contractions.fix 来扩展文本。
蟒蛇3
text = '''She'd like to know how I'd done that!
She's going to the park and I don't think I'll be home for dinner.
Theyre going to the zoo and she'll be home for dinner.'''
contractions.fix(text)
输出:
'she would like to know how I would done that!
she is going to the park and I do not think I will be home for dinner.
they are going to the zoo and she will be home for dinner.'
也可以使用其他技术(如字典映射)以及 pycontractions 库来处理收缩。您可以参考 pycontractions 库的文档以了解更多信息:https://pypi.org/project/pycontractions/