如何使用Python对 Pandas Dataframe 列进行模糊匹配？

先决条件： FuzzyWuzzy

在本教程中，我们将学习如何使用Python对 Pandas DataFrame 列进行模糊匹配。模糊匹配是一个过程，它让我们识别不准确的匹配，但在我们的目标项目中找到给定的模式。模糊匹配是搜索引擎的基础。这就是为什么当我们在任何浏览器中键入搜索查询时，我们会收到许多推荐或建议。

使用的功能

pd.DataFrame(dict)：将Python字典转换为pandas数据帧
dataframe['column_name'].tolist()：将pandas数据框的特定列转换为Python中的项目列表
append():将项目追加到列表中
process.extract（查询，选择，限制）：函数附带fuzzywuzzy库的处理模块中提取从匹配给定的查询，其在选择列表中的项目。提取的最接近选项的数量由我们设置的限制决定。
process.extractOne(query, choice, scorer)：从选择列表中提取与给定查询匹配的唯一最接近的匹配项， scorer是可选参数，使其使用特定的计分器，如 fuzz.token_sort_ratio, fuzz.token_set_ratio
fuzz.ratio：根据Levenshtein距离计算两个字符串的相似度
fuzz.partial_ratio：为了计算最小的字符串之间的部分字符串比对长字符串的所有n个长度为子串
fuzz.token_sort_ratio：计算每个字符串token排序后的相似度
fuzz.token_set_ratio：它试图排除字符串的差异，在Python计算三个特定子字符串集的比率后返回最大比率

例子

示例 1：（基本方法）

首先，我们将创建两个字典。然后我们将其转换为pandas数据帧并创建两个空列表用于存储匹配项，如下所示：

Python3

from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas
  
dict1 = {'name': ["aparna", "pankaj", 
                  "sudhir", "Geeku"]}
  
dict2 = {'name': ["aparn", "arup", "Pankaj",
                  "sudhir c", "Geek", "abc"]}
  
# converting to pandas dataframes
dframe1 = pd.DataFrame(dict1)
dframe2 = pd.DataFrame(dict2)
  
# empty lists for storing the
# matches later
mat1 = []
mat2 = []
  
# printing the pandas dataframes
dframe1.show()
dframe2.show()

Python3

list1 = dframe1['name'].tolist()
list2 = dframe2['name'].tolist()
  
# taking the threshold as 80
threshold = 80

Python3

# iterating through list1 to extract 
# it's closest match from list2
for i in list1:
    mat1.append(process.extract(i, list2, limit=2))
dframe1['matches'] = mat1
  
dframe1.show()

Python3

# iterating through the closest
# matches to filter out the
# maximum closest match
for j in dframe1['matches']:
    for k in j:
        
        if k[1] >= threshold:
            p.append(k[0])
              
    mat2.append(",".join(p))
    p = []
      
# storing the resultant matches 
# back to dframe1
dframe1['matches'] = mat2
  
dframe1.show()

Python3

import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
  
# creating the dictionaries
dict1 = {'name': ["mango", "coco", "choco", "peanut", "apple"]}
dict2 = {'name': ["mango fruit", "coconut", "chocolate",
                  "mangoes", "chocos", "peanuts", "appl"]}
  
# converting to pandas dataframes
dframe1 = pd.DataFrame(dict1)
dframe2 = pd.DataFrame(dict2)
  
# empty lists for storing the matches later
mat1 = []
mat2 = []
p = []
  
# printing the pandas dataframes
print("First dataframe:\n", dframe1, 
      "\nSecond dataframe:\n", dframe2)
  
# converting dataframe column to list
# of elements
# to do fuzzy matching
list1 = dframe1['name'].tolist()
list2 = dframe2['name'].tolist()
  
# taking the threshold as 82
threshold = 82
  
# iterating through list1 to extract 
# it's closest match from list2
for i in list1:
    mat1.append(process.extract(i, list2, limit=2))
dframe1['matches'] = mat1
  
# iterating through the closest matches
# to filter out the maximum closest match
for j in dframe1['matches']:
    for k in j:
        if k[1] >= threshold:
            p.append(k[0])
    mat2.append(",".join(p))
    p = []
  
  
# storing the resultant matches back to dframe1
dframe1['matches'] = mat2
print("\nDataFrame after Fuzzy matching:")
dframe1

Python3

import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
  
# creating the dictionaries
dict1 = {'name': ["aparna", "pankaj", "sudhir",
                  "Geeku", "geeks for geeks"]}
  
dict2 = {'name': ["aparn", "arup", "Pankaj",
                  "for geeks geeks", "sudhir c",
                  "geeks geeks"]}
  
# converting to pandas dataframes
dframe1 = pd.DataFrame(dict1)
dframe2 = pd.DataFrame(dict2)
  
# empty lists for storing the matches 
# later
mat1 = []
mat2 = []
p = []
  
# printing the pandas dataframes
print("First dataframe:\n", dframe1,
      "\nSecond dataframe:\n", dframe2)
  
# converting dataframe column to 
# list of elements
# to do fuzzy matching
list1 = dframe1['name'].tolist()
list2 = dframe2['name'].tolist()
  
# taking the threshold as 80
threshold = 80
  
# iterating through list1 to extract 
# it's closest match from list2
for i in list1:
    mat1.append(process.extractOne(i, list2, scorer=fuzz.ratio))
dframe1['matches'] = mat1
  
# iterating through the closest matches
# to filter out the maximum closest match
for j in dframe1['matches']:
    if j[1] >= threshold:
        p.append(j[0])
    mat2.append(",".join(p))
    p = []
  
  
# storing the resultant matches back to dframe1
dframe1['matches'] = mat2
print("\nDataFrame after Fuzzy matching using fuzz.ratio():")
dframe1

Python3

import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
  
# creating the dictionaries
dict1 = {'name': ["aparna", "pankaj", "sudhir", 
                  "Geeku", "geeks for geeks"]}
  
dict2 = {'name': ["aparn", "arup", "Pankaj",
                  "for geeks geeks", "sudhir c", 
                  "geeks geeks"]}
  
# converting to pandas dataframes
dframe1 = pd.DataFrame(dict1)
dframe2 = pd.DataFrame(dict2)
  
# empty lists for storing the matches 
# later
mat1 = []
mat2 = []
p = []
  
# printing the pandas dataframes
print("First dataframe:\n", dframe1,
      "\nSecond dataframe:\n", dframe2)
  
# converting dataframe column to 
# list of elements
# to do fuzzy matching
list1 = dframe1['name'].tolist()
list2 = dframe2['name'].tolist()
  
# taking the threshold as 80
threshold = 80
  
# iterating through list1 to extract 
# it's closest match from list2
for i in list1:
    mat1.append(process.extractOne(
      i, list2, scorer=fuzz.partial_ratio))
dframe1['matches'] = mat1
  
# iterating through the closest matches
# to filter out the maximum closest match
for j in dframe1['matches']:
    if j[1] >= threshold:
        p.append(j[0])
    mat2.append(",".join(p))
    p = []
  
  
# storing the resultant matches back to dframe1
dframe1['matches'] = mat2
print("\nDataFrame after Fuzzy matching using fuzz.partial_ratio:")
dframe1

Python3

import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
  
# creating the dictionaries
dict1 = {'name': ["aparna", "pankaj", "sudhir", 
                  "Geeku", "geeks for geeks"]}
  
dict2 = {'name': ["aparn", "arup", "Pankaj",
                  "for geeks geeks", "sudhir c",
                  "Geek"]}
  
# converting to pandas dataframes
dframe1 = pd.DataFrame(dict1)
dframe2 = pd.DataFrame(dict2)
  
# empty lists for storing the matches
# later
mat1 = []
mat2 = []
p = []
  
# printing the pandas dataframes
print("First dataframe:\n", dframe1,
      "\nSecond dataframe:\n", dframe2)
  
# converting dataframe column to 
# list of elements
# to do fuzzy matching
list1 = dframe1['name'].tolist()
list2 = dframe2['name'].tolist()
  
# taking the threshold as 80
threshold = 80
  
# iterating through list1 to extract
# it's closest match from list2
for i in list1:
    mat1.append(process.extractOne(
      i, list2, scorer=fuzz.token_sort_ratio))
dframe1['matches'] = mat1
  
# iterating through the closest matches
# to filter out the maximum closest match
for j in dframe1['matches']:
    if j[1] >= threshold:
        p.append(j[0])
    mat2.append(",".join(p))
    p = []
  
  
# storing the resultant matches back 
# to dframe1
dframe1['matches'] = mat2
print("\nDataFrame after Fuzzy matching using fuzz.token_sort_ratio:")
dframe1

Python3

import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
  
# creating the dictionaries
dict1 = {'name': ["aparna", "pankaj", "Geeku", 
                  "geeks for geeks"]}
  
dict2 = {'name': ["aparn", "arup", "Pankaj",
                  "geeks for for geeks", 
                  "geeks for geeks", "Geek"]}
  
# converting to pandas dataframes
dframe1 = pd.DataFrame(dict1)
dframe2 = pd.DataFrame(dict2)
  
# empty lists for storing the matches
# later
mat1 = []
mat2 = []
p = []
  
# printing the pandas dataframes
print("First dataframe:\n", dframe1,
      "\nSecond dataframe:\n", dframe2)
  
# converting dataframe column 
# to list of elements
# to do fuzzy matching
list1 = dframe1['name'].tolist()
list2 = dframe2['name'].tolist()
  
# taking the threshold as 80
threshold = 80
  
# iterating through list1 to extract
# it's closest match from list2
for i in list1:
    mat1.append(process.extractOne(
      i, list2, scorer=fuzz.token_set_ratio))
dframe1['matches'] = mat1
  
# iterating through the closest matches
# to filter out the maximum closest match
for j in dframe1['matches']:
    if j[1] >= threshold:
        p.append(j[0])
    mat2.append(",".join(p))
    p = []
  
  
# storing the resultant matches back 
# to dframe1
dframe1['matches'] = mat2
print("\nDataFrame after Fuzzy matching using token_set_ratio():")
dframe1

dframe1：

dframe2：

然后我们将使用 tolist()函数将数据帧转换为列表。
我们采用阈值 = 80，这样模糊匹配仅在字符串彼此接近至少 80% 以上时发生。

蟒蛇3

list1 = dframe1['name'].tolist()
list2 = dframe2['name'].tolist()
  
# taking the threshold as 80
threshold = 80

输出：

然后我们将遍历 list1 项以从 list2 中提取最接近的匹配项。
这里我们使用处理模块中的 process.extract()函数来提取元素。
Limit=2 意味着它将提取两个最接近的元素及其准确率，如果我们现在打印它，那么我们可以看到比率值。
然后我们将每个最接近的匹配项附加到列表 mat1
并将匹配列表存储在第一个数据框中的“匹配”列下，即 dframe1

蟒蛇3

# iterating through list1 to extract 
# it's closest match from list2
for i in list1:
    mat1.append(process.extract(i, list2, limit=2))
dframe1['matches'] = mat1
  
dframe1.show()

输出：

然后我们将再次遍历外循环中的匹配列，并在内循环中遍历每组匹配项
k[1] >= threshold 意味着它只会选择那些阈值大于或等于 80 的项目并将它们附加到列表 p 中。
如果特定列项目有多个匹配项，则使用“,”.join()函数将匹配项用逗号分隔，并将其附加到列表 mat2。我们再次将列表 p 设置为空，用于存储第一个数据帧列中下一行项目的匹配项。
然后我们将结果最接近的匹配存储回 dframe1 以获得我们的最终输出。

蟒蛇3

# iterating through the closest
# matches to filter out the
# maximum closest match
for j in dframe1['matches']:
    for k in j:
        
        if k[1] >= threshold:
            p.append(k[0])
              
    mat2.append(",".join(p))
    p = []
      
# storing the resultant matches 
# back to dframe1
dframe1['matches'] = mat2
  
dframe1.show()

输出：

示例 2：

在本示例中，步骤与示例一相同。唯一的区别是特定行项目有多个匹配项，例如“芒果”和“巧克力”。我们设置阈值=82 以提高模糊匹配精度。

蟒蛇3

import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
  
# creating the dictionaries
dict1 = {'name': ["mango", "coco", "choco", "peanut", "apple"]}
dict2 = {'name': ["mango fruit", "coconut", "chocolate",
                  "mangoes", "chocos", "peanuts", "appl"]}
  
# converting to pandas dataframes
dframe1 = pd.DataFrame(dict1)
dframe2 = pd.DataFrame(dict2)
  
# empty lists for storing the matches later
mat1 = []
mat2 = []
p = []
  
# printing the pandas dataframes
print("First dataframe:\n", dframe1, 
      "\nSecond dataframe:\n", dframe2)
  
# converting dataframe column to list
# of elements
# to do fuzzy matching
list1 = dframe1['name'].tolist()
list2 = dframe2['name'].tolist()
  
# taking the threshold as 82
threshold = 82
  
# iterating through list1 to extract 
# it's closest match from list2
for i in list1:
    mat1.append(process.extract(i, list2, limit=2))
dframe1['matches'] = mat1
  
# iterating through the closest matches
# to filter out the maximum closest match
for j in dframe1['matches']:
    for k in j:
        if k[1] >= threshold:
            p.append(k[0])
    mat2.append(",".join(p))
    p = []
  
  
# storing the resultant matches back to dframe1
dframe1['matches'] = mat2
print("\nDataFrame after Fuzzy matching:")
dframe1

输出：

现在我们将使用 process.extractOne() 方法只匹配两个数据帧之间最接近的。在此方法中，我们将应用不同的模糊匹配函数，如下所示：

示例 3：使用 fuzz.ratio()

蟒蛇3

import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
  
# creating the dictionaries
dict1 = {'name': ["aparna", "pankaj", "sudhir",
                  "Geeku", "geeks for geeks"]}
  
dict2 = {'name': ["aparn", "arup", "Pankaj",
                  "for geeks geeks", "sudhir c",
                  "geeks geeks"]}
  
# converting to pandas dataframes
dframe1 = pd.DataFrame(dict1)
dframe2 = pd.DataFrame(dict2)
  
# empty lists for storing the matches 
# later
mat1 = []
mat2 = []
p = []
  
# printing the pandas dataframes
print("First dataframe:\n", dframe1,
      "\nSecond dataframe:\n", dframe2)
  
# converting dataframe column to 
# list of elements
# to do fuzzy matching
list1 = dframe1['name'].tolist()
list2 = dframe2['name'].tolist()
  
# taking the threshold as 80
threshold = 80
  
# iterating through list1 to extract 
# it's closest match from list2
for i in list1:
    mat1.append(process.extractOne(i, list2, scorer=fuzz.ratio))
dframe1['matches'] = mat1
  
# iterating through the closest matches
# to filter out the maximum closest match
for j in dframe1['matches']:
    if j[1] >= threshold:
        p.append(j[0])
    mat2.append(",".join(p))
    p = []
  
  
# storing the resultant matches back to dframe1
dframe1['matches'] = mat2
print("\nDataFrame after Fuzzy matching using fuzz.ratio():")
dframe1

输出：

示例 4：使用 fuzz.partial_ratio()

蟒蛇3

import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
  
# creating the dictionaries
dict1 = {'name': ["aparna", "pankaj", "sudhir", 
                  "Geeku", "geeks for geeks"]}
  
dict2 = {'name': ["aparn", "arup", "Pankaj",
                  "for geeks geeks", "sudhir c", 
                  "geeks geeks"]}
  
# converting to pandas dataframes
dframe1 = pd.DataFrame(dict1)
dframe2 = pd.DataFrame(dict2)
  
# empty lists for storing the matches 
# later
mat1 = []
mat2 = []
p = []
  
# printing the pandas dataframes
print("First dataframe:\n", dframe1,
      "\nSecond dataframe:\n", dframe2)
  
# converting dataframe column to 
# list of elements
# to do fuzzy matching
list1 = dframe1['name'].tolist()
list2 = dframe2['name'].tolist()
  
# taking the threshold as 80
threshold = 80
  
# iterating through list1 to extract 
# it's closest match from list2
for i in list1:
    mat1.append(process.extractOne(
      i, list2, scorer=fuzz.partial_ratio))
dframe1['matches'] = mat1
  
# iterating through the closest matches
# to filter out the maximum closest match
for j in dframe1['matches']:
    if j[1] >= threshold:
        p.append(j[0])
    mat2.append(",".join(p))
    p = []
  
  
# storing the resultant matches back to dframe1
dframe1['matches'] = mat2
print("\nDataFrame after Fuzzy matching using fuzz.partial_ratio:")
dframe1

输出：

示例 5：使用 fuzz.token_sort_ratio()

蟒蛇3

import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
  
# creating the dictionaries
dict1 = {'name': ["aparna", "pankaj", "sudhir", 
                  "Geeku", "geeks for geeks"]}
  
dict2 = {'name': ["aparn", "arup", "Pankaj",
                  "for geeks geeks", "sudhir c",
                  "Geek"]}
  
# converting to pandas dataframes
dframe1 = pd.DataFrame(dict1)
dframe2 = pd.DataFrame(dict2)
  
# empty lists for storing the matches
# later
mat1 = []
mat2 = []
p = []
  
# printing the pandas dataframes
print("First dataframe:\n", dframe1,
      "\nSecond dataframe:\n", dframe2)
  
# converting dataframe column to 
# list of elements
# to do fuzzy matching
list1 = dframe1['name'].tolist()
list2 = dframe2['name'].tolist()
  
# taking the threshold as 80
threshold = 80
  
# iterating through list1 to extract
# it's closest match from list2
for i in list1:
    mat1.append(process.extractOne(
      i, list2, scorer=fuzz.token_sort_ratio))
dframe1['matches'] = mat1
  
# iterating through the closest matches
# to filter out the maximum closest match
for j in dframe1['matches']:
    if j[1] >= threshold:
        p.append(j[0])
    mat2.append(",".join(p))
    p = []
  
  
# storing the resultant matches back 
# to dframe1
dframe1['matches'] = mat2
print("\nDataFrame after Fuzzy matching using fuzz.token_sort_ratio:")
dframe1

输出：

示例 6：使用 fuzz.token_set_ratio()

蟒蛇3

import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
  
# creating the dictionaries
dict1 = {'name': ["aparna", "pankaj", "Geeku", 
                  "geeks for geeks"]}
  
dict2 = {'name': ["aparn", "arup", "Pankaj",
                  "geeks for for geeks", 
                  "geeks for geeks", "Geek"]}
  
# converting to pandas dataframes
dframe1 = pd.DataFrame(dict1)
dframe2 = pd.DataFrame(dict2)
  
# empty lists for storing the matches
# later
mat1 = []
mat2 = []
p = []
  
# printing the pandas dataframes
print("First dataframe:\n", dframe1,
      "\nSecond dataframe:\n", dframe2)
  
# converting dataframe column 
# to list of elements
# to do fuzzy matching
list1 = dframe1['name'].tolist()
list2 = dframe2['name'].tolist()
  
# taking the threshold as 80
threshold = 80
  
# iterating through list1 to extract
# it's closest match from list2
for i in list1:
    mat1.append(process.extractOne(
      i, list2, scorer=fuzz.token_set_ratio))
dframe1['matches'] = mat1
  
# iterating through the closest matches
# to filter out the maximum closest match
for j in dframe1['matches']:
    if j[1] >= threshold:
        p.append(j[0])
    mat2.append(",".join(p))
    p = []
  
  
# storing the resultant matches back 
# to dframe1
dframe1['matches'] = mat2
print("\nDataFrame after Fuzzy matching using token_set_ratio():")
dframe1

输出：