Python NLTK |标记化.regexp()
在NLTK tokenize.regexp()
模块的帮助下,我们可以通过RegexpTokenizer()
方法使用正则表达式从字符串中提取标记。
Syntax : tokenize.RegexpTokenizer()
Return : Return array of tokens using regular expression
示例 #1:
在此示例中,我们使用RegexpTokenizer()
方法在正则表达式的帮助下提取标记流。
# import RegexpTokenizer() method from nltk
from nltk.tokenize import RegexpTokenizer
# Create a reference variable for Class RegexpTokenizer
tk = RegexpTokenizer('\s+', gaps = True)
# Create a string input
gfg = "I love Python"
# Use tokenize method
geek = tk.tokenize(gfg)
print(geek)
输出 :
[‘I’, ‘love’, ‘Python’]
示例 #2:
# import RegexpTokenizer() method from nltk
from nltk.tokenize import RegexpTokenizer
# Create a reference variable for Class RegexpTokenizer
tk = RegexpTokenizer('\s+', gaps = True)
# Create a string input
gfg = "Geeks for Geeks"
# Use tokenize method
geek = tk.tokenize(gfg)
print(geek)
输出 :
[‘Geeks’, ‘for’, ‘Geeks’]