如何使用Python对 Pandas Dataframe 列进行模糊匹配?
先决条件: FuzzyWuzzy
在本教程中,我们将学习如何使用Python对 Pandas DataFrame 列进行模糊匹配。模糊匹配是一个过程,它让我们识别不准确的匹配,但在我们的目标项目中找到给定的模式。模糊匹配是搜索引擎的基础。这就是为什么当我们在任何浏览器中键入搜索查询时,我们会收到许多推荐或建议。
使用的功能
- pd.DataFrame(dict):将Python字典转换为pandas数据帧
- dataframe['column_name'].tolist():将pandas数据框的特定列转换为Python中的项目列表
- append():将项目追加到列表中
- process.extract(查询,选择,限制):函数附带fuzzywuzzy库的处理模块中提取从匹配给定的查询,其在选择列表中的项目。提取的最接近选项的数量由我们设置的限制决定。
- process.extractOne(query, choice, scorer):从选择列表中提取与给定查询匹配的唯一最接近的匹配项, scorer是可选参数,使其使用特定的计分器,如 fuzz.token_sort_ratio, fuzz.token_set_ratio
- fuzz.ratio:根据Levenshtein距离计算两个字符串的相似度
- fuzz.partial_ratio:为了计算最小的字符串之间的部分字符串比对长字符串的所有n个长度为子串
- fuzz.token_sort_ratio:计算每个字符串token排序后的相似度
- fuzz.token_set_ratio:它试图排除字符串的差异,在Python计算三个特定子字符串集的比率后返回最大比率
例子
示例 1:(基本方法)
- 首先,我们将创建两个字典。然后我们将其转换为pandas数据帧并创建两个空列表用于存储匹配项,如下所示:
Python3
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas
dict1 = {'name': ["aparna", "pankaj",
"sudhir", "Geeku"]}
dict2 = {'name': ["aparn", "arup", "Pankaj",
"sudhir c", "Geek", "abc"]}
# converting to pandas dataframes
dframe1 = pd.DataFrame(dict1)
dframe2 = pd.DataFrame(dict2)
# empty lists for storing the
# matches later
mat1 = []
mat2 = []
# printing the pandas dataframes
dframe1.show()
dframe2.show()
Python3
list1 = dframe1['name'].tolist()
list2 = dframe2['name'].tolist()
# taking the threshold as 80
threshold = 80
Python3
# iterating through list1 to extract
# it's closest match from list2
for i in list1:
mat1.append(process.extract(i, list2, limit=2))
dframe1['matches'] = mat1
dframe1.show()
Python3
# iterating through the closest
# matches to filter out the
# maximum closest match
for j in dframe1['matches']:
for k in j:
if k[1] >= threshold:
p.append(k[0])
mat2.append(",".join(p))
p = []
# storing the resultant matches
# back to dframe1
dframe1['matches'] = mat2
dframe1.show()
Python3
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
# creating the dictionaries
dict1 = {'name': ["mango", "coco", "choco", "peanut", "apple"]}
dict2 = {'name': ["mango fruit", "coconut", "chocolate",
"mangoes", "chocos", "peanuts", "appl"]}
# converting to pandas dataframes
dframe1 = pd.DataFrame(dict1)
dframe2 = pd.DataFrame(dict2)
# empty lists for storing the matches later
mat1 = []
mat2 = []
p = []
# printing the pandas dataframes
print("First dataframe:\n", dframe1,
"\nSecond dataframe:\n", dframe2)
# converting dataframe column to list
# of elements
# to do fuzzy matching
list1 = dframe1['name'].tolist()
list2 = dframe2['name'].tolist()
# taking the threshold as 82
threshold = 82
# iterating through list1 to extract
# it's closest match from list2
for i in list1:
mat1.append(process.extract(i, list2, limit=2))
dframe1['matches'] = mat1
# iterating through the closest matches
# to filter out the maximum closest match
for j in dframe1['matches']:
for k in j:
if k[1] >= threshold:
p.append(k[0])
mat2.append(",".join(p))
p = []
# storing the resultant matches back to dframe1
dframe1['matches'] = mat2
print("\nDataFrame after Fuzzy matching:")
dframe1
Python3
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
# creating the dictionaries
dict1 = {'name': ["aparna", "pankaj", "sudhir",
"Geeku", "geeks for geeks"]}
dict2 = {'name': ["aparn", "arup", "Pankaj",
"for geeks geeks", "sudhir c",
"geeks geeks"]}
# converting to pandas dataframes
dframe1 = pd.DataFrame(dict1)
dframe2 = pd.DataFrame(dict2)
# empty lists for storing the matches
# later
mat1 = []
mat2 = []
p = []
# printing the pandas dataframes
print("First dataframe:\n", dframe1,
"\nSecond dataframe:\n", dframe2)
# converting dataframe column to
# list of elements
# to do fuzzy matching
list1 = dframe1['name'].tolist()
list2 = dframe2['name'].tolist()
# taking the threshold as 80
threshold = 80
# iterating through list1 to extract
# it's closest match from list2
for i in list1:
mat1.append(process.extractOne(i, list2, scorer=fuzz.ratio))
dframe1['matches'] = mat1
# iterating through the closest matches
# to filter out the maximum closest match
for j in dframe1['matches']:
if j[1] >= threshold:
p.append(j[0])
mat2.append(",".join(p))
p = []
# storing the resultant matches back to dframe1
dframe1['matches'] = mat2
print("\nDataFrame after Fuzzy matching using fuzz.ratio():")
dframe1
Python3
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
# creating the dictionaries
dict1 = {'name': ["aparna", "pankaj", "sudhir",
"Geeku", "geeks for geeks"]}
dict2 = {'name': ["aparn", "arup", "Pankaj",
"for geeks geeks", "sudhir c",
"geeks geeks"]}
# converting to pandas dataframes
dframe1 = pd.DataFrame(dict1)
dframe2 = pd.DataFrame(dict2)
# empty lists for storing the matches
# later
mat1 = []
mat2 = []
p = []
# printing the pandas dataframes
print("First dataframe:\n", dframe1,
"\nSecond dataframe:\n", dframe2)
# converting dataframe column to
# list of elements
# to do fuzzy matching
list1 = dframe1['name'].tolist()
list2 = dframe2['name'].tolist()
# taking the threshold as 80
threshold = 80
# iterating through list1 to extract
# it's closest match from list2
for i in list1:
mat1.append(process.extractOne(
i, list2, scorer=fuzz.partial_ratio))
dframe1['matches'] = mat1
# iterating through the closest matches
# to filter out the maximum closest match
for j in dframe1['matches']:
if j[1] >= threshold:
p.append(j[0])
mat2.append(",".join(p))
p = []
# storing the resultant matches back to dframe1
dframe1['matches'] = mat2
print("\nDataFrame after Fuzzy matching using fuzz.partial_ratio:")
dframe1
Python3
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
# creating the dictionaries
dict1 = {'name': ["aparna", "pankaj", "sudhir",
"Geeku", "geeks for geeks"]}
dict2 = {'name': ["aparn", "arup", "Pankaj",
"for geeks geeks", "sudhir c",
"Geek"]}
# converting to pandas dataframes
dframe1 = pd.DataFrame(dict1)
dframe2 = pd.DataFrame(dict2)
# empty lists for storing the matches
# later
mat1 = []
mat2 = []
p = []
# printing the pandas dataframes
print("First dataframe:\n", dframe1,
"\nSecond dataframe:\n", dframe2)
# converting dataframe column to
# list of elements
# to do fuzzy matching
list1 = dframe1['name'].tolist()
list2 = dframe2['name'].tolist()
# taking the threshold as 80
threshold = 80
# iterating through list1 to extract
# it's closest match from list2
for i in list1:
mat1.append(process.extractOne(
i, list2, scorer=fuzz.token_sort_ratio))
dframe1['matches'] = mat1
# iterating through the closest matches
# to filter out the maximum closest match
for j in dframe1['matches']:
if j[1] >= threshold:
p.append(j[0])
mat2.append(",".join(p))
p = []
# storing the resultant matches back
# to dframe1
dframe1['matches'] = mat2
print("\nDataFrame after Fuzzy matching using fuzz.token_sort_ratio:")
dframe1
Python3
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
# creating the dictionaries
dict1 = {'name': ["aparna", "pankaj", "Geeku",
"geeks for geeks"]}
dict2 = {'name': ["aparn", "arup", "Pankaj",
"geeks for for geeks",
"geeks for geeks", "Geek"]}
# converting to pandas dataframes
dframe1 = pd.DataFrame(dict1)
dframe2 = pd.DataFrame(dict2)
# empty lists for storing the matches
# later
mat1 = []
mat2 = []
p = []
# printing the pandas dataframes
print("First dataframe:\n", dframe1,
"\nSecond dataframe:\n", dframe2)
# converting dataframe column
# to list of elements
# to do fuzzy matching
list1 = dframe1['name'].tolist()
list2 = dframe2['name'].tolist()
# taking the threshold as 80
threshold = 80
# iterating through list1 to extract
# it's closest match from list2
for i in list1:
mat1.append(process.extractOne(
i, list2, scorer=fuzz.token_set_ratio))
dframe1['matches'] = mat1
# iterating through the closest matches
# to filter out the maximum closest match
for j in dframe1['matches']:
if j[1] >= threshold:
p.append(j[0])
mat2.append(",".join(p))
p = []
# storing the resultant matches back
# to dframe1
dframe1['matches'] = mat2
print("\nDataFrame after Fuzzy matching using token_set_ratio():")
dframe1
dframe1:
dframe2:
- 然后我们将使用 tolist()函数将数据帧转换为列表。
- 我们采用阈值 = 80,这样模糊匹配仅在字符串彼此接近至少 80% 以上时发生。
蟒蛇3
list1 = dframe1['name'].tolist()
list2 = dframe2['name'].tolist()
# taking the threshold as 80
threshold = 80
输出:
- 然后我们将遍历 list1 项以从 list2 中提取最接近的匹配项。
- 这里我们使用处理模块中的 process.extract()函数来提取元素。
- Limit=2 意味着它将提取两个最接近的元素及其准确率,如果我们现在打印它,那么我们可以看到比率值。
- 然后我们将每个最接近的匹配项附加到列表 mat1
- 并将匹配列表存储在第一个数据框中的“匹配”列下,即 dframe1
蟒蛇3
# iterating through list1 to extract
# it's closest match from list2
for i in list1:
mat1.append(process.extract(i, list2, limit=2))
dframe1['matches'] = mat1
dframe1.show()
输出:
- 然后我们将再次遍历外循环中的匹配列,并在内循环中遍历每组匹配项
- k[1] >= threshold 意味着它只会选择那些阈值大于或等于 80 的项目并将它们附加到列表 p 中。
- 如果特定列项目有多个匹配项,则使用“,”.join()函数将匹配项用逗号分隔,并将其附加到列表 mat2。我们再次将列表 p 设置为空,用于存储第一个数据帧列中下一行项目的匹配项。
- 然后我们将结果最接近的匹配存储回 dframe1 以获得我们的最终输出。
蟒蛇3
# iterating through the closest
# matches to filter out the
# maximum closest match
for j in dframe1['matches']:
for k in j:
if k[1] >= threshold:
p.append(k[0])
mat2.append(",".join(p))
p = []
# storing the resultant matches
# back to dframe1
dframe1['matches'] = mat2
dframe1.show()
输出:
示例 2:
在本示例中,步骤与示例一相同。唯一的区别是特定行项目有多个匹配项,例如“芒果”和“巧克力”。我们设置阈值=82 以提高模糊匹配精度。
蟒蛇3
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
# creating the dictionaries
dict1 = {'name': ["mango", "coco", "choco", "peanut", "apple"]}
dict2 = {'name': ["mango fruit", "coconut", "chocolate",
"mangoes", "chocos", "peanuts", "appl"]}
# converting to pandas dataframes
dframe1 = pd.DataFrame(dict1)
dframe2 = pd.DataFrame(dict2)
# empty lists for storing the matches later
mat1 = []
mat2 = []
p = []
# printing the pandas dataframes
print("First dataframe:\n", dframe1,
"\nSecond dataframe:\n", dframe2)
# converting dataframe column to list
# of elements
# to do fuzzy matching
list1 = dframe1['name'].tolist()
list2 = dframe2['name'].tolist()
# taking the threshold as 82
threshold = 82
# iterating through list1 to extract
# it's closest match from list2
for i in list1:
mat1.append(process.extract(i, list2, limit=2))
dframe1['matches'] = mat1
# iterating through the closest matches
# to filter out the maximum closest match
for j in dframe1['matches']:
for k in j:
if k[1] >= threshold:
p.append(k[0])
mat2.append(",".join(p))
p = []
# storing the resultant matches back to dframe1
dframe1['matches'] = mat2
print("\nDataFrame after Fuzzy matching:")
dframe1
输出:
现在我们将使用 process.extractOne() 方法只匹配两个数据帧之间最接近的。在此方法中,我们将应用不同的模糊匹配函数,如下所示:
示例 3:使用 fuzz.ratio()
蟒蛇3
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
# creating the dictionaries
dict1 = {'name': ["aparna", "pankaj", "sudhir",
"Geeku", "geeks for geeks"]}
dict2 = {'name': ["aparn", "arup", "Pankaj",
"for geeks geeks", "sudhir c",
"geeks geeks"]}
# converting to pandas dataframes
dframe1 = pd.DataFrame(dict1)
dframe2 = pd.DataFrame(dict2)
# empty lists for storing the matches
# later
mat1 = []
mat2 = []
p = []
# printing the pandas dataframes
print("First dataframe:\n", dframe1,
"\nSecond dataframe:\n", dframe2)
# converting dataframe column to
# list of elements
# to do fuzzy matching
list1 = dframe1['name'].tolist()
list2 = dframe2['name'].tolist()
# taking the threshold as 80
threshold = 80
# iterating through list1 to extract
# it's closest match from list2
for i in list1:
mat1.append(process.extractOne(i, list2, scorer=fuzz.ratio))
dframe1['matches'] = mat1
# iterating through the closest matches
# to filter out the maximum closest match
for j in dframe1['matches']:
if j[1] >= threshold:
p.append(j[0])
mat2.append(",".join(p))
p = []
# storing the resultant matches back to dframe1
dframe1['matches'] = mat2
print("\nDataFrame after Fuzzy matching using fuzz.ratio():")
dframe1
输出:
示例 4:使用 fuzz.partial_ratio()
蟒蛇3
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
# creating the dictionaries
dict1 = {'name': ["aparna", "pankaj", "sudhir",
"Geeku", "geeks for geeks"]}
dict2 = {'name': ["aparn", "arup", "Pankaj",
"for geeks geeks", "sudhir c",
"geeks geeks"]}
# converting to pandas dataframes
dframe1 = pd.DataFrame(dict1)
dframe2 = pd.DataFrame(dict2)
# empty lists for storing the matches
# later
mat1 = []
mat2 = []
p = []
# printing the pandas dataframes
print("First dataframe:\n", dframe1,
"\nSecond dataframe:\n", dframe2)
# converting dataframe column to
# list of elements
# to do fuzzy matching
list1 = dframe1['name'].tolist()
list2 = dframe2['name'].tolist()
# taking the threshold as 80
threshold = 80
# iterating through list1 to extract
# it's closest match from list2
for i in list1:
mat1.append(process.extractOne(
i, list2, scorer=fuzz.partial_ratio))
dframe1['matches'] = mat1
# iterating through the closest matches
# to filter out the maximum closest match
for j in dframe1['matches']:
if j[1] >= threshold:
p.append(j[0])
mat2.append(",".join(p))
p = []
# storing the resultant matches back to dframe1
dframe1['matches'] = mat2
print("\nDataFrame after Fuzzy matching using fuzz.partial_ratio:")
dframe1
输出:
示例 5:使用 fuzz.token_sort_ratio()
蟒蛇3
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
# creating the dictionaries
dict1 = {'name': ["aparna", "pankaj", "sudhir",
"Geeku", "geeks for geeks"]}
dict2 = {'name': ["aparn", "arup", "Pankaj",
"for geeks geeks", "sudhir c",
"Geek"]}
# converting to pandas dataframes
dframe1 = pd.DataFrame(dict1)
dframe2 = pd.DataFrame(dict2)
# empty lists for storing the matches
# later
mat1 = []
mat2 = []
p = []
# printing the pandas dataframes
print("First dataframe:\n", dframe1,
"\nSecond dataframe:\n", dframe2)
# converting dataframe column to
# list of elements
# to do fuzzy matching
list1 = dframe1['name'].tolist()
list2 = dframe2['name'].tolist()
# taking the threshold as 80
threshold = 80
# iterating through list1 to extract
# it's closest match from list2
for i in list1:
mat1.append(process.extractOne(
i, list2, scorer=fuzz.token_sort_ratio))
dframe1['matches'] = mat1
# iterating through the closest matches
# to filter out the maximum closest match
for j in dframe1['matches']:
if j[1] >= threshold:
p.append(j[0])
mat2.append(",".join(p))
p = []
# storing the resultant matches back
# to dframe1
dframe1['matches'] = mat2
print("\nDataFrame after Fuzzy matching using fuzz.token_sort_ratio:")
dframe1
输出:
示例 6:使用 fuzz.token_set_ratio()
蟒蛇3
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
# creating the dictionaries
dict1 = {'name': ["aparna", "pankaj", "Geeku",
"geeks for geeks"]}
dict2 = {'name': ["aparn", "arup", "Pankaj",
"geeks for for geeks",
"geeks for geeks", "Geek"]}
# converting to pandas dataframes
dframe1 = pd.DataFrame(dict1)
dframe2 = pd.DataFrame(dict2)
# empty lists for storing the matches
# later
mat1 = []
mat2 = []
p = []
# printing the pandas dataframes
print("First dataframe:\n", dframe1,
"\nSecond dataframe:\n", dframe2)
# converting dataframe column
# to list of elements
# to do fuzzy matching
list1 = dframe1['name'].tolist()
list2 = dframe2['name'].tolist()
# taking the threshold as 80
threshold = 80
# iterating through list1 to extract
# it's closest match from list2
for i in list1:
mat1.append(process.extractOne(
i, list2, scorer=fuzz.token_set_ratio))
dframe1['matches'] = mat1
# iterating through the closest matches
# to filter out the maximum closest match
for j in dframe1['matches']:
if j[1] >= threshold:
p.append(j[0])
mat2.append(",".join(p))
p = []
# storing the resultant matches back
# to dframe1
dframe1['matches'] = mat2
print("\nDataFrame after Fuzzy matching using token_set_ratio():")
dframe1
输出: