如何使用Python Pandas 通过通用键合并多个 TSV 文件？

对于数据分析，最重要的是数据，我们需要先准备好数据，然后才能将其用于分析。有时所需的数据可能分散在多个文件中，我们需要将它们合并。在本文中，我们将使用一个公共键合并多个 TSV（制表符分隔值）文件。这可以通过使用 pandas Python库的 merge 方法来实现。这种方法允许我们使用公共密钥来组合文件。

方法：

导入熊猫库
然后读取前两个 tsv 文件并使用 pd.merge()函数通过将“on”参数设置为两个文件中存在的公共列来合并它们。然后将结果存储在一个名为“Output_df”的新数据框中。
将剩余文件存储在列表中。
运行将遍历这些文件名的循环。一一读取这些文件并将它们与“Output_df”数据框合并
将“Output_df”保存在 tsv 文件中

示例 1：

在此示例中，我们将使用内部连接合并 tsv 文件。我们为此示例采用了四个 tsv 文件，如下所示。

使用的文件： Customer.tsv 、 Account.tsv 、 Branch.tsv 、 Loan.tsv

Python3

# Import pandas library
import pandas as pd
 
# Read first two csv files with '\t' separator
tsv1 = pd.read_csv("Documents/Customer.tsv", sep='\t')
tsv2 = pd.read_csv("Documents/Account.tsv", sep='\t')
 
# store the result in Output_df dataframe.
# Here common column is 'ID' column
Output_df = pd.merge(tsv1, tsv2, on='ID',
                     how='inner')
 
# store remaining file names in list
tsv_files = ["Branch.tsv", "Loan.tsv"]
 
# One by one read tsv files and merge with
# 'Output_df' dataframe and again store
# the final result in Output_df
for i in tsv_files:
    path = "Documents/"+i
    tsv = pd.read_csv(path, sep='\t')
    Output_df = pd.merge(Output_df, tsv,
                         on='ID', how='inner')
 
# Now store the 'Output_df'
# in tsv file 'Output.tsv'
Output_df.to_csv("Documents/Output.tsv",
                 sep="\t", header=True,
                 index=False)

Python3

# Import pandas library
import pandas as pd
 
# Read first two csv files with '\t' separator
tsv3 = pd.read_csv("Documents/Course.tsv", sep='\t')
tsv4 = pd.read_csv("Documents/Teacher.tsv", sep='\t')
 
# store the result in Output_df dataframe.
# Here common column is 'Course_ID' column
Output_df2 = pd.merge(tsv3, tsv4, on='Course_ID', how='outer')
 
# store remaining file names in list
tsv_files = ["Credits.tsv", "Marks.tsv"]
 
# One by one read tsv files and merge with
# 'Output_df2' dataframe and again store
# the final result in 'Output_df2'
for i in tsv_files:
    path = "Documents/"+i
    tsv = pd.read_csv(path, sep='\t')
    Output_df2 = pd.merge(Output_df2, tsv,
                          on='Course_ID', how='outer')
 
 
# Now store the 'Output_df2' in tsv file 'Output_outer.tsv'
# Here we replacing nan values with NA
Output_df2.to_csv("Documents/Output_outer.tsv", sep="\t",
                  header=True, index=False, na_rep="NA")

输出：

输出.tsv

示例 2：

在此示例中，我们将使用外连接合并 tsv 文件。我们为此示例采用了四个 tsv 文件，如下所示。

使用的文件： Course.tsv , Teacher.tsv , Credits.tsv , Marks.tsv

Python3

# Import pandas library
import pandas as pd
 
# Read first two csv files with '\t' separator
tsv3 = pd.read_csv("Documents/Course.tsv", sep='\t')
tsv4 = pd.read_csv("Documents/Teacher.tsv", sep='\t')
 
# store the result in Output_df dataframe.
# Here common column is 'Course_ID' column
Output_df2 = pd.merge(tsv3, tsv4, on='Course_ID', how='outer')
 
# store remaining file names in list
tsv_files = ["Credits.tsv", "Marks.tsv"]
 
# One by one read tsv files and merge with
# 'Output_df2' dataframe and again store
# the final result in 'Output_df2'
for i in tsv_files:
    path = "Documents/"+i
    tsv = pd.read_csv(path, sep='\t')
    Output_df2 = pd.merge(Output_df2, tsv,
                          on='Course_ID', how='outer')
 
 
# Now store the 'Output_df2' in tsv file 'Output_outer.tsv'
# Here we replacing nan values with NA
Output_df2.to_csv("Documents/Output_outer.tsv", sep="\t",
                  header=True, index=False, na_rep="NA")

输出：