📜  FuzzyWuzzy Python库

📅  最后修改于: 2021-10-22 03:36:52             🧑  作者: Mango

Python有很多比较字符串的方法。一些主要的方法是:

  1. 使用正则表达式
  2. 简单比较
  3. 使用 difflib

但其中一种非常简单的方法是使用fuzzywuzzy库,我们可以得到满分 100 的分数,这表示两个字符串通过给出相似性索引来表示相等。这篇文章讨论了我们如何开始使用fuzzywuzzy 库。

FuzzyWuzzy 是一个用于字符串匹配的Python库。模糊字符串匹配是查找与给定模式匹配的字符串的过程。基本上它使用 Levenshtein Distance 来计算序列之间的差异。
FuzzyWuzzy 由 SeatGeek 开发和开源,这是一项查找体育和音乐会门票的服务。他们的原始用例,如他们的博客中所讨论的。

    fuzzywuzzy 的要求
  • Python 2.4 或更高版本
  • python-Levenshtein

通过 pip 安装:

pip install fuzzywuzzy
pip install python-Levenshtein

如何使用这个库?

首先导入这些模块,

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

简单的比率用法:

fuzz.ratio('geeksforgeeks', 'geeksgeeks')
87
  
# Exact match
fuzz.ratio('GeeksforGeeks', 'GeeksforGeeks')  
  
100
fuzz.ratio('geeks for geeks', 'Geeks For Geeks ') 
80
fuzz.partial_ratio("geeks for geeks", "geeks for geeks!")
100
# Exclamation mark in second string, 
but still partially words are same so score comes 100
  
fuzz.partial_ratio("geeks for geeks", "geeks geeks")
64
# score is less because there is a extra 
token in the middle middle of the string.

现在,令牌集比率是令牌排序比率:

# Token Sort Ratio
fuzz.token_sort_ratio("geeks for geeks", "for geeks geeks")
100
  
# This gives 100 as every word is same, irrespective of the position 
  
# Token Set Ratio
fuzz.token_sort_ratio("geeks for geeks", "geeks for for geeks")
88
 fuzz.token_set_ratio("geeks for geeks", "geeks for for geeks")
100
# Score comes 100 in second case because token_set_ratio 
considers duplicate words as a single word.

现在假设如果我们有选项列表,并且想要找到最接近的匹配项,我们可以使用process模块

query = 'geeks for geeks'
choices = ['geek for geek', 'geek geek', 'g. for geeks'] 
   
# Get a list of matches ordered by score, default limit to 5
process.extract(query, choices)
[('geeks geeks', 95), ('g. for geeks', 95), ('geek for geek', 93)]
   
# If we want only the top one
process.extractOne(query, choices)
('geeks geeks', 95)

还有一个比率,通常称为WRatio ,有时最好使用 WRatio 而不是简单的比率,因为 WRatio 处理大小写和其他一些参数。

fuzz.WRatio('geeks for geeks', 'Geeks For Geeks')
100
fuzz.WRatio('geeks for geeks!!!','geeks for geeks')
100
# whereas simple ratio will give for above case
fuzz.ratio('geeks for geeks!!!','geeks for geeks')
91

完整代码

# Python code showing all the ratios together, 
# make sure you have installed fuzzywuzzy module
  
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
  
s1 = "I love GeeksforGeeks"
s2 = "I am loving GeeksforGeeks"
print "FuzzyWuzzy Ratio: ", fuzz.ratio(s1, s2)
print "FuzzyWuzzy PartialRatio: ", fuzz.partial_ratio(s1, s2)
print "FuzzyWuzzy TokenSortRatio: ", fuzz.token_sort_ratio(s1, s2)
print "FuzzyWuzzy TokenSetRatio: ", fuzz.token_set_ratio(s1, s2)
print "FuzzyWuzzy WRatio: ", fuzz.WRatio(s1, s2),'\n\n'
  
# for process library,
query = 'geeks for geeks'
choices = ['geek for geek', 'geek geek', 'g. for geeks'] 
print "List of ratios: "
print process.extract(query, choices), '\n'
print "Best among the above list: ",process.extractOne(query, choices)

输出:

FuzzyWuzzy Ratio:  84
FuzzyWuzzy PartialRatio:  85
FuzzyWuzzy TokenSortRatio:  84
FuzzyWuzzy TokenSetRatio:  86
FuzzyWuzzy WRatio:  84 


List of ratios: 
[('g. for geeks', 95), ('geek for geek', 93), ('geek geek', 86)] 

Best among the above list:  ('g. for geeks', 95)

FuzzyWuzzy 库建立在 difflib 库之上,python-Levenshtein 用于提高速度。所以它是Python字符串匹配的最佳方式之一。