匹配 2 个 CSV 文件(根据列)
Matching 2 CSV Files (According to columns)
我已经阅读了这里和其他网站上的几个主题,但我仍然不知道该怎么做。
基本上,我得到了两个输入文件(假定为 CSV 文件)。每个文件有 2 列,id 和 col_name。 2 个文件的区别在于 col_name 的命名约定。一个文件是小写的,另一个是驼峰式的。
我想知道如何通过 col_name 匹配两个文件,然后创建一个包含 4 列的输出文件,id, col_name, col_name, id.
Input1.csv
1 _id
2 rawrequest
3 rawresponse
4 products
5 _deleted
6 enterpriseid
7 source
8 transactionuid
9 type
10 isotransactiontype
11 status
12 terminalid
13 merchantid
14 merchantname
15 settlementbatchid
16 errordescription
17 referencetransactions
18 createdat
19 updatedat
20 __v
Input2.csv
101 _id
102 rawRequest
103 rawResponse
104 products
105 _deleted
106 enterpriseid
107 source
108 transactionUid
109 type
110 isoTransactionType
111 status
112 terminalId
113 merchantId
114 merchantName
115 settlementBatchId
116 errorDescription
117 referenceTransactions
118 createdAt
119 updatedAt
120 __v
期望的输出:
1 _id _id 101
2 rawrequest rawRequest 102
3 rawresponse rawResponse 103
4 products products 104
5 _deleted _deleted 105
6 enterpriseid enterpriseid 106
7 source source 107
8 transactionuid transactionUid 108
9 type type 109
10 isotransactiontype isoTransactionType 110
11 status status 111
12 terminalid terminalId 112
13 merchantid merchantId 113
14 merchantname merchantName 114
15 settlementbatchid settlementBatchId 115
16 errordescription errorDescription 116
17 referencetransactions referenceTransactions 117
18 createdat createdAt 118
19 updatedat updatedAt 119
20 __v __v 120
我试过写的代码:
import pandas as pd
csv1 = pd.read_csv("Input1.csv")
csv2 = pd.read_csv("Input2.csv")
# Method 1
merge_data = csv1.merge(csv2, on = 'col_name')
merge_data.to_csv("output.csv", index = False)
# Method 2
merge = pd.merge(csv1, csv2, how="outer")
merge.to_csv("output1.csv", index = False)
# Method 3
import csv
with open('Input1.csv', 'r') as csv_file1:
csv_reader = csv.reader(csv_file1)
with open('output2.csv', 'w') as new_file:
csv_writer = csv.writer(new_file)
for line in csv_reader:
csv_writer.writerow(line)
with open('Input2.csv', 'r') as csv_file2:
csv_reader2 = csv.reader(csv_file2)
with open('output2.csv', 'a') as new_file:
csv_writer = csv.writer(new_file)
for line in csv_reader2:
csv_writer.writerow(line)
输出(来自代码):
output.csv
id_x,col_name,id_y
1,_id,101
4,products,104
5,_deleted,105
7,source,107
9,type,109
11,status,111
20,__v,120
25,subtype,125
29,amount,129
output1.csv
id,col_name
1,_id
2,rawrequest
3,rawresponse
4,products
5,_deleted
6,enterpriseid
7,source
8,transactionuid
9,type
10,isotransactiontype
11,status
12,terminalid
13,merchantid
14,merchantname
15,settlementbatchid
16,errordescription
17,referencetransactions
18,createdat
19,updatedat
20,__v
101,_id
102,rawRequest
103,rawResponse
104,products
105,_deleted
106,enterpriseId
107,source
108,transactionUid
109,type
110,isoTransactionType
111,status
112,terminalId
113,merchantId
114,merchantName
115,settlementBatchId
116,errorDescription
117,referenceTransactions
118,createdAt
119,updatedAt
120,__v
output2.csv
id,col_name
1,_id
2,rawrequest
3,rawresponse
4,products
5,_deleted
6,enterpriseid
7,source
8,transactionuid
9,type
10,isotransactiontype
11,status
12,terminalid
13,merchantid
14,merchantname
15,settlementbatchid
16,errordescription
17,referencetransactions
18,createdat
19,updatedat
20,__v
id,col_name
101,_id
102,rawRequest
103,rawResponse
104,products
105,_deleted
106,enterpriseId
107,source
108,transactionUid
109,type
110,isoTransactionType
111,status
112,terminalId
113,merchantId
114,merchantName
115,settlementBatchId
116,errorDescription
117,referenceTransactions
118,createdAt
119,updatedAt
120,__v
我在你的代码中没有看到你将 col_name
s 转换为小写字母的地方,但你肯定需要这样做才能获得匹配,例如rawrequest
不等于 rawRequest
.
我得到了你想要的输出如下。
首先,既然你提到它们是列名 id
和 col_name
的 csvs,我假设你的输入文件实际上看起来像:
Input1.csv:
id,col_name
1,_id
2,rawrequest
3,rawresponse
4,products
5,_deleted
...
Input2.csv:
id,col_name
101,_id
102,rawRequest
103,rawResponse
104,products
105,_deleted
...
保存这些文件后我做了:
import pandas as pd
csv1 = pd.read_csv("Input1.csv")
csv2 = pd.read_csv("Input2.csv")
# Method 1
print(csv2.head())
# Make a copy of csv2 col_name
csv2['col_name_original'] = csv2['col_name']
# Convert csv2 col_name to lowercase
csv2['col_name'] = csv2['col_name'].apply(lambda x: x.lower())
# Reorder csv2 columns
csv2 = csv2[['col_name', 'col_name_original', 'id']]
# Merge on col_name
merge_data = csv1.merge(csv2, on='col_name')
# Rename columns of resulting datasheet
merge_data.columns = ['id_', 'col_name', 'col_name', 'id_']
# Save merged data
merge_data.to_csv("output.csv", index=False)
输出:
id_,col_name,col_name,id_
1,_id,_id,101
2,rawrequest,rawRequest,102
3,rawresponse,rawResponse,103
4,products,products,104
5,_deleted,_deleted,105
你问的部分说明:
Lambda 函数
Lambda 基本上是一种定义函数的 shorthand 方式,因此在 Python 中定义转换为小写的函数的通常方式是:
def convert_to_lowercase(x):
return x.lower()
我们可以改写 lambda x: x.lower()
并在那里定义一个函数,然后将其直接传递给 apply
方法。 Lambda 表达式也存在于其他编程语言中,例如 JavaScript。
https://www.w3schools.com/python/python_lambda.asp
应用
当您使用函数作为参数在 DataFrame 列上调用此方法时,它 returns 将该函数应用于列中每个元素的结果(例如,在本例中它 将函数 lambda x: x.lower()
应用 到列中的每个值,并用结果逐行覆盖列 col_name
)。如果您熟悉该概念,则相当于 map
。
我已经阅读了这里和其他网站上的几个主题,但我仍然不知道该怎么做。
基本上,我得到了两个输入文件(假定为 CSV 文件)。每个文件有 2 列,id 和 col_name。 2 个文件的区别在于 col_name 的命名约定。一个文件是小写的,另一个是驼峰式的。
我想知道如何通过 col_name 匹配两个文件,然后创建一个包含 4 列的输出文件,id, col_name, col_name, id.
Input1.csv
1 _id
2 rawrequest
3 rawresponse
4 products
5 _deleted
6 enterpriseid
7 source
8 transactionuid
9 type
10 isotransactiontype
11 status
12 terminalid
13 merchantid
14 merchantname
15 settlementbatchid
16 errordescription
17 referencetransactions
18 createdat
19 updatedat
20 __v
Input2.csv
101 _id
102 rawRequest
103 rawResponse
104 products
105 _deleted
106 enterpriseid
107 source
108 transactionUid
109 type
110 isoTransactionType
111 status
112 terminalId
113 merchantId
114 merchantName
115 settlementBatchId
116 errorDescription
117 referenceTransactions
118 createdAt
119 updatedAt
120 __v
期望的输出:
1 _id _id 101
2 rawrequest rawRequest 102
3 rawresponse rawResponse 103
4 products products 104
5 _deleted _deleted 105
6 enterpriseid enterpriseid 106
7 source source 107
8 transactionuid transactionUid 108
9 type type 109
10 isotransactiontype isoTransactionType 110
11 status status 111
12 terminalid terminalId 112
13 merchantid merchantId 113
14 merchantname merchantName 114
15 settlementbatchid settlementBatchId 115
16 errordescription errorDescription 116
17 referencetransactions referenceTransactions 117
18 createdat createdAt 118
19 updatedat updatedAt 119
20 __v __v 120
我试过写的代码:
import pandas as pd
csv1 = pd.read_csv("Input1.csv")
csv2 = pd.read_csv("Input2.csv")
# Method 1
merge_data = csv1.merge(csv2, on = 'col_name')
merge_data.to_csv("output.csv", index = False)
# Method 2
merge = pd.merge(csv1, csv2, how="outer")
merge.to_csv("output1.csv", index = False)
# Method 3
import csv
with open('Input1.csv', 'r') as csv_file1:
csv_reader = csv.reader(csv_file1)
with open('output2.csv', 'w') as new_file:
csv_writer = csv.writer(new_file)
for line in csv_reader:
csv_writer.writerow(line)
with open('Input2.csv', 'r') as csv_file2:
csv_reader2 = csv.reader(csv_file2)
with open('output2.csv', 'a') as new_file:
csv_writer = csv.writer(new_file)
for line in csv_reader2:
csv_writer.writerow(line)
输出(来自代码):
output.csv
id_x,col_name,id_y
1,_id,101
4,products,104
5,_deleted,105
7,source,107
9,type,109
11,status,111
20,__v,120
25,subtype,125
29,amount,129
output1.csv
id,col_name
1,_id
2,rawrequest
3,rawresponse
4,products
5,_deleted
6,enterpriseid
7,source
8,transactionuid
9,type
10,isotransactiontype
11,status
12,terminalid
13,merchantid
14,merchantname
15,settlementbatchid
16,errordescription
17,referencetransactions
18,createdat
19,updatedat
20,__v
101,_id
102,rawRequest
103,rawResponse
104,products
105,_deleted
106,enterpriseId
107,source
108,transactionUid
109,type
110,isoTransactionType
111,status
112,terminalId
113,merchantId
114,merchantName
115,settlementBatchId
116,errorDescription
117,referenceTransactions
118,createdAt
119,updatedAt
120,__v
output2.csv
id,col_name
1,_id
2,rawrequest
3,rawresponse
4,products
5,_deleted
6,enterpriseid
7,source
8,transactionuid
9,type
10,isotransactiontype
11,status
12,terminalid
13,merchantid
14,merchantname
15,settlementbatchid
16,errordescription
17,referencetransactions
18,createdat
19,updatedat
20,__v
id,col_name
101,_id
102,rawRequest
103,rawResponse
104,products
105,_deleted
106,enterpriseId
107,source
108,transactionUid
109,type
110,isoTransactionType
111,status
112,terminalId
113,merchantId
114,merchantName
115,settlementBatchId
116,errorDescription
117,referenceTransactions
118,createdAt
119,updatedAt
120,__v
我在你的代码中没有看到你将 col_name
s 转换为小写字母的地方,但你肯定需要这样做才能获得匹配,例如rawrequest
不等于 rawRequest
.
我得到了你想要的输出如下。
首先,既然你提到它们是列名 id
和 col_name
的 csvs,我假设你的输入文件实际上看起来像:
Input1.csv:
id,col_name
1,_id
2,rawrequest
3,rawresponse
4,products
5,_deleted
...
Input2.csv:
id,col_name
101,_id
102,rawRequest
103,rawResponse
104,products
105,_deleted
...
保存这些文件后我做了:
import pandas as pd
csv1 = pd.read_csv("Input1.csv")
csv2 = pd.read_csv("Input2.csv")
# Method 1
print(csv2.head())
# Make a copy of csv2 col_name
csv2['col_name_original'] = csv2['col_name']
# Convert csv2 col_name to lowercase
csv2['col_name'] = csv2['col_name'].apply(lambda x: x.lower())
# Reorder csv2 columns
csv2 = csv2[['col_name', 'col_name_original', 'id']]
# Merge on col_name
merge_data = csv1.merge(csv2, on='col_name')
# Rename columns of resulting datasheet
merge_data.columns = ['id_', 'col_name', 'col_name', 'id_']
# Save merged data
merge_data.to_csv("output.csv", index=False)
输出:
id_,col_name,col_name,id_
1,_id,_id,101
2,rawrequest,rawRequest,102
3,rawresponse,rawResponse,103
4,products,products,104
5,_deleted,_deleted,105
你问的部分说明:
Lambda 函数
Lambda 基本上是一种定义函数的 shorthand 方式,因此在 Python 中定义转换为小写的函数的通常方式是:
def convert_to_lowercase(x):
return x.lower()
我们可以改写 lambda x: x.lower()
并在那里定义一个函数,然后将其直接传递给 apply
方法。 Lambda 表达式也存在于其他编程语言中,例如 JavaScript。
https://www.w3schools.com/python/python_lambda.asp
应用
当您使用函数作为参数在 DataFrame 列上调用此方法时,它 returns 将该函数应用于列中每个元素的结果(例如,在本例中它 将函数 lambda x: x.lower()
应用 到列中的每个值,并用结果逐行覆盖列 col_name
)。如果您熟悉该概念,则相当于 map
。