Python有很多比较字符串的方法。一些主要的方法是:
- 使用正则表达式
- 简单比较
- 使用 difflib
但其中一种非常简单的方法是使用fuzzywuzzy库,我们可以得到满分 100 的分数,这表示两个字符串通过给出相似性索引来表示相等。这篇文章讨论了我们如何开始使用fuzzywuzzy 库。
FuzzyWuzzy 是一个用于字符串匹配的Python库。模糊字符串匹配是查找与给定模式匹配的字符串的过程。基本上它使用 Levenshtein Distance 来计算序列之间的差异。
FuzzyWuzzy 由 SeatGeek 开发和开源,这是一项查找体育和音乐会门票的服务。他们的原始用例,如他们的博客中所讨论的。
- fuzzywuzzy 的要求
- Python 2.4 或更高版本
- python-Levenshtein
通过 pip 安装:
pip install fuzzywuzzy
pip install python-Levenshtein
如何使用这个库?
首先导入这些模块,
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
简单的比率用法:
fuzz.ratio('geeksforgeeks', 'geeksgeeks')
87
# Exact match
fuzz.ratio('GeeksforGeeks', 'GeeksforGeeks')
100
fuzz.ratio('geeks for geeks', 'Geeks For Geeks ')
80
fuzz.partial_ratio("geeks for geeks", "geeks for geeks!")
100
# Exclamation mark in second string,
but still partially words are same so score comes 100
fuzz.partial_ratio("geeks for geeks", "geeks geeks")
64
# score is less because there is a extra
token in the middle middle of the string.
现在,令牌集比率是令牌排序比率:
# Token Sort Ratio
fuzz.token_sort_ratio("geeks for geeks", "for geeks geeks")
100
# This gives 100 as every word is same, irrespective of the position
# Token Set Ratio
fuzz.token_sort_ratio("geeks for geeks", "geeks for for geeks")
88
fuzz.token_set_ratio("geeks for geeks", "geeks for for geeks")
100
# Score comes 100 in second case because token_set_ratio
considers duplicate words as a single word.
现在假设如果我们有选项列表,并且想要找到最接近的匹配项,我们可以使用process模块
query = 'geeks for geeks'
choices = ['geek for geek', 'geek geek', 'g. for geeks']
# Get a list of matches ordered by score, default limit to 5
process.extract(query, choices)
[('geeks geeks', 95), ('g. for geeks', 95), ('geek for geek', 93)]
# If we want only the top one
process.extractOne(query, choices)
('geeks geeks', 95)
还有一个比率,通常称为WRatio ,有时最好使用 WRatio 而不是简单的比率,因为 WRatio 处理大小写和其他一些参数。
fuzz.WRatio('geeks for geeks', 'Geeks For Geeks')
100
fuzz.WRatio('geeks for geeks!!!','geeks for geeks')
100
# whereas simple ratio will give for above case
fuzz.ratio('geeks for geeks!!!','geeks for geeks')
91
完整代码
# Python code showing all the ratios together,
# make sure you have installed fuzzywuzzy module
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
s1 = "I love GeeksforGeeks"
s2 = "I am loving GeeksforGeeks"
print "FuzzyWuzzy Ratio: ", fuzz.ratio(s1, s2)
print "FuzzyWuzzy PartialRatio: ", fuzz.partial_ratio(s1, s2)
print "FuzzyWuzzy TokenSortRatio: ", fuzz.token_sort_ratio(s1, s2)
print "FuzzyWuzzy TokenSetRatio: ", fuzz.token_set_ratio(s1, s2)
print "FuzzyWuzzy WRatio: ", fuzz.WRatio(s1, s2),'\n\n'
# for process library,
query = 'geeks for geeks'
choices = ['geek for geek', 'geek geek', 'g. for geeks']
print "List of ratios: "
print process.extract(query, choices), '\n'
print "Best among the above list: ",process.extractOne(query, choices)
输出:
FuzzyWuzzy Ratio: 84
FuzzyWuzzy PartialRatio: 85
FuzzyWuzzy TokenSortRatio: 84
FuzzyWuzzy TokenSetRatio: 86
FuzzyWuzzy WRatio: 84
List of ratios:
[('g. for geeks', 95), ('geek for geek', 93), ('geek geek', 86)]
Best among the above list: ('g. for geeks', 95)
FuzzyWuzzy 库建立在 difflib 库之上,python-Levenshtein 用于提高速度。所以它是Python字符串匹配的最佳方式之一。