通过将 1 列固定为 1 sheet 来比较 2 excel 文件,然后使用 python 与具有相同列的另一个文件进行比较
compare 2 excel files by keeping 1 column fixed of 1 sheet and then comparing with another file with same col by using python
我们有 2 个 excel 文件,一个有 7.5k 条记录,另一个有 7k 条记录。我们需要通过将一个特定列从一个 sheet 固定到与另一个 sheet.
进行比较来比较数据
例如sheet1:
**Emp_ID| Name| Phone| Address**
-------------------------------------
1 | A | 123 | ABC
-------------------------------------
2 | B | 456 | CBD
-------------------------------------
3 | C | 789 | S
例如sheet2:
**Emp_ID| Name| Phone| Address**
-------------------------------------
1 | A | 123 | ABC
-------------------------------------
3 | C | 789 | S
Python 比较应该基于 Emp_ID 并且 Emp_ID=2 在执行 [=28] 时将参数作为 Emp_ID 传递时应该输出为丢失=] 脚本。
我正在尝试通过使用 XLRD 模块进行同样的操作,但它只是逐个单元格地进行比较,而不是冻结一列,然后将该行与其他 excel 文件进行比较。
def compareexcel(oldSheet, newSheet):
rowb2 = xlrd.open_workbook(oldSheet)
rowb1 = xlrd.open_workbook(newSheet)
sheet1 = rowb1.sheet_by_index(0)
sheet2 = rowb2.sheet_by_index(0)
for rownum in range(max(sheet1.nrows, sheet2.nrows)):
if rownum < sheet1.nrows:
row_rb1 = sheet1.row_values(rownum)
row_rb2 = sheet2.row_values(rownum)
for colnum, (c1, c2) in enumerate(izip_longest(row_rb1, row_rb2)):
if c1 != c2:
print "Row {} Col {} - {} != {}".format(rownum+1, colnum+1, c1, c2)
我已经编写了一个函数来将列值搜索到另一个 sheet 并且基于该比较将在比较函数中进行
def search(sheet2 , s):
for row in range(sheet2.nrows):`enter code here`
if s == sheet2.cell(row,0).value:
return (row,0)
return (9,9)
def compare(oldPerPaxSheet,newPerPaxSheet):
rb1 = xlrd.open_workbook(oldPerPaxSheet)
rb2 = xlrd.open_workbook(newPerPaxSheet)
sheet1 = rb1.sheet_by_index(0)
sheet2 = rb2.sheet_by_index(0)
for rownum in range(max(self.sheet1.nrows, self.sheet2.nrows)):
if rownum < sheet1.nrows:
row_rb1 = sheet1.row_values(rownum)
print ("row_rb1 : "), row_rb1
search_str = sheet1.cell(rownum,0).value
r,c = search(sheet2,search_str)
if (c != 9):
row_rb2 = sheet2.row_values(r)
for colnum, (c1, c2) in enumerate(izip_longest(row_rb1, row_rb2)):
if c1 != c2:
print "Row {} Col {} - {} != {}".format(rownum+1, colnum+1, c1, c2)
else:
print ("ROw does not exists in the other sheet")
pass
else:
print ("Row {} missing").format(rownum+1)
你可以很容易地使用 pandas.read_excel
来做到这一点。
我将制作 2 个数据帧 Emp_ID
作为索引
import pandas as pd
sheets = pd.read_excel(excel_filename, sheetname=[old_sheet, new_sheet], index_col=0)
sheet1 = sheets[old_sheet]
sheet2 = sheets[new_sheet]
我添加了一些行以使差异更明显
sheet1
Name Phone Address
Emp_ID
1 A 123 ABC
2 B 456 CBD
3 C 789 S
5 A 123 ABC
sheet2
Name Phone Address
Emp_ID
1 A 123 ABC
3 C 789 S
4 D 12 A
5 E 123 ABC
计算缺失Emp_ID
就变得很简单了
missing_in_1 = set(sheet2.index) - set(sheet1.index)
missing_in_2 = set(sheet1.index) - set(sheet2.index)
missing_in_1, missing_in_2
({4}, {2})
所以 sheet1 没有 Emp_ID
sheet2 中的 4,而 sheet2 缺少 2,正如预期的那样
然后为了寻找差异,我们对 2 张纸进行内部连接
combined = pd.merge(sheet1, sheet2, left_index=True, right_index=True, suffixes=('_1', '_2'))
combined
Name_1 Phone_1 Address_1 Name_2 Phone_2 Address_2
Emp_ID
1 A 123 ABC A 123 ABC
3 C 789 S C 789 S
5 A 123 ABC E 123 ABC
并遍历 sheet1 的列以查找差异并将它们保存在 dict
differences = {}
for column in sheet1.columns:
diff = combined[column+'_1'] != combined[column+'_2']
if diff.any():
differences[column] = list(combined[diff].index)
differences
{'Name': [5]}
如果您想要完整的差异列表,请将最后一行更改为 differences[column] = combined[diff]
differences
{'Name':
Name_1 Phone_1 Address_1 Name_2 Phone_2 Address_2
Emp_ID
5 A 123 ABC E 123 ABC}
我们有 2 个 excel 文件,一个有 7.5k 条记录,另一个有 7k 条记录。我们需要通过将一个特定列从一个 sheet 固定到与另一个 sheet.
进行比较来比较数据例如sheet1:
**Emp_ID| Name| Phone| Address**
-------------------------------------
1 | A | 123 | ABC
-------------------------------------
2 | B | 456 | CBD
-------------------------------------
3 | C | 789 | S
例如sheet2:
**Emp_ID| Name| Phone| Address**
-------------------------------------
1 | A | 123 | ABC
-------------------------------------
3 | C | 789 | S
Python 比较应该基于 Emp_ID 并且 Emp_ID=2 在执行 [=28] 时将参数作为 Emp_ID 传递时应该输出为丢失=] 脚本。 我正在尝试通过使用 XLRD 模块进行同样的操作,但它只是逐个单元格地进行比较,而不是冻结一列,然后将该行与其他 excel 文件进行比较。
def compareexcel(oldSheet, newSheet):
rowb2 = xlrd.open_workbook(oldSheet)
rowb1 = xlrd.open_workbook(newSheet)
sheet1 = rowb1.sheet_by_index(0)
sheet2 = rowb2.sheet_by_index(0)
for rownum in range(max(sheet1.nrows, sheet2.nrows)):
if rownum < sheet1.nrows:
row_rb1 = sheet1.row_values(rownum)
row_rb2 = sheet2.row_values(rownum)
for colnum, (c1, c2) in enumerate(izip_longest(row_rb1, row_rb2)):
if c1 != c2:
print "Row {} Col {} - {} != {}".format(rownum+1, colnum+1, c1, c2)
我已经编写了一个函数来将列值搜索到另一个 sheet 并且基于该比较将在比较函数中进行
def search(sheet2 , s):
for row in range(sheet2.nrows):`enter code here`
if s == sheet2.cell(row,0).value:
return (row,0)
return (9,9)
def compare(oldPerPaxSheet,newPerPaxSheet):
rb1 = xlrd.open_workbook(oldPerPaxSheet)
rb2 = xlrd.open_workbook(newPerPaxSheet)
sheet1 = rb1.sheet_by_index(0)
sheet2 = rb2.sheet_by_index(0)
for rownum in range(max(self.sheet1.nrows, self.sheet2.nrows)):
if rownum < sheet1.nrows:
row_rb1 = sheet1.row_values(rownum)
print ("row_rb1 : "), row_rb1
search_str = sheet1.cell(rownum,0).value
r,c = search(sheet2,search_str)
if (c != 9):
row_rb2 = sheet2.row_values(r)
for colnum, (c1, c2) in enumerate(izip_longest(row_rb1, row_rb2)):
if c1 != c2:
print "Row {} Col {} - {} != {}".format(rownum+1, colnum+1, c1, c2)
else:
print ("ROw does not exists in the other sheet")
pass
else:
print ("Row {} missing").format(rownum+1)
你可以很容易地使用 pandas.read_excel
来做到这一点。
我将制作 2 个数据帧 Emp_ID
作为索引
import pandas as pd
sheets = pd.read_excel(excel_filename, sheetname=[old_sheet, new_sheet], index_col=0)
sheet1 = sheets[old_sheet]
sheet2 = sheets[new_sheet]
我添加了一些行以使差异更明显
sheet1
Name Phone Address
Emp_ID
1 A 123 ABC
2 B 456 CBD
3 C 789 S
5 A 123 ABC
sheet2
Name Phone Address
Emp_ID
1 A 123 ABC
3 C 789 S
4 D 12 A
5 E 123 ABC
计算缺失Emp_ID
就变得很简单了
missing_in_1 = set(sheet2.index) - set(sheet1.index)
missing_in_2 = set(sheet1.index) - set(sheet2.index)
missing_in_1, missing_in_2
({4}, {2})
所以 sheet1 没有 Emp_ID
sheet2 中的 4,而 sheet2 缺少 2,正如预期的那样
然后为了寻找差异,我们对 2 张纸进行内部连接
combined = pd.merge(sheet1, sheet2, left_index=True, right_index=True, suffixes=('_1', '_2'))
combined
Name_1 Phone_1 Address_1 Name_2 Phone_2 Address_2
Emp_ID
1 A 123 ABC A 123 ABC
3 C 789 S C 789 S
5 A 123 ABC E 123 ABC
并遍历 sheet1 的列以查找差异并将它们保存在 dict
differences = {}
for column in sheet1.columns:
diff = combined[column+'_1'] != combined[column+'_2']
if diff.any():
differences[column] = list(combined[diff].index)
differences
{'Name': [5]}
如果您想要完整的差异列表,请将最后一行更改为 differences[column] = combined[diff]
differences
{'Name':
Name_1 Phone_1 Address_1 Name_2 Phone_2 Address_2
Emp_ID
5 A 123 ABC E 123 ABC}