如何使用 Python 比较 Excel 中的多个列？

Question

我有一个 excel sheet，其值如下：

ProductID	Reference	Component used	DESCRIPTION
5325646	PD-42#23	ASDGE34#W-AL	HOUSING PLATE
4563756	ITA-COPPER	RAW AL	WASHER
4345432	ASDGE34#W-AL	ZK23456	3MM X 2MM X 10MM COPPER BAR
1245456	ASDGE34#W-AL	ZK66634	1MM X 2MM X 9MM COPPER BAR

我基本上想要做的是将“组件”中的任何内容与“参考”匹配，如果它们成真，就像在这个例子中一样，ASDGE34#W-AL 匹配“组件”和“参考”，所以我想复制详细信息并将其移动到“描述”之后。我想要达到的最终结果是这样的。

ProductID	Reference	Component used	DESCRIPTION	Child Component ID	DESCRIPTION
5325646	PD-42#23	ASDGE34#W-AL	HOUSING PLATE	ZK23456 & ZK654534	3MM X 2MM X 10MM COPPER BAR &1MM X 2MM X 9MM COPPER BAR
4563756	ITA-COPPER	ZAAGD0545	ALUMINIUM BAR
5656565	0788866-7#D	YN435611	DUSTER

有人可以帮忙吗？我一直在使用 Python 尝试在“使用的组件”和“参考”列上迭代一个循环，看看有什么匹配但到目前为止没有运气。

Answer 1

我有做类似事情的代码。我已将其更改为适合您的问题。

产品存储在 2 个单独的 dict list 中，通过 Reference 和 Component used 访问.您可以使用 Component used 值轻松地遍历 products_by_reference 字典，以获取具有匹配参考的产品列表。

请注意，此实现假设每一列 header 是唯一的，因为每一行都存储为字典，其中每一列 header 用作访问相应值的键，因此如果header 使用了两次，第一个值将被覆盖。因此，我在编写 header 列后手动将 header Child DESCRIPTION 重命名为 DESCRIPTION .

import os
from collections import defaultdict
import openpyxl

# Load the existing spreadsheet.
wb = openpyxl.load_workbook(filename="Products.xlsx")
ws = wb.worksheets[0]

# Style used for the header row
bold_font = openpyxl.styles.Font(bold=True)
grey_fill = openpyxl.styles.fills.PatternFill("solid", fgColor="F8F9F9")

# Store the header column names and numbers.
# Key = column name, value = column number. e.g. "ProductID": 1
header_columns = {}
for column_num in range(1, ws.max_column + 1):
    column_name = ws.cell(row=1, column=column_num).value
    header_columns[column_name] = column_num


# Read in all the values, starting after the header row.
# Store each product as a dict, and store every product in 2 separate 
# dict of lists, with the keys being the reference and component used.
products_by_reference = defaultdict(list)
products_by_component = defaultdict(list)
for row_num in range(2, ws.max_row + 1):
    product = {}
    for column_name, column_num in header_columns.items():
        product[column_name] = ws.cell(row=row_num, column=column_num).value
    
    reference = product["Reference"]
    component = product["Component used"]
    products_by_reference[reference].append(product)
    products_by_component[component].append(product)

# Iterate over the products by component, and copy and join all the 
# components used and descriptions to the relevant product's dict.
for component, products in products_by_component.items():
    # Use dicts here instead of sets if preserving the order is important.
    child_components = set()
    child_descriptions = set()
    for product in products_by_reference[component]:
        child_components.add(product["Component used"])
        child_descriptions.add(product["DESCRIPTION"])
    
    child_components_string = " & ".join(child_components)
    child_descriptions_string = " & ".join(child_descriptions)
    for product in products:
        product["Child Component ID"] = child_components_string
        product["Child DESCRIPTION"] = child_descriptions_string
    
    # products_by_reference will be used later for the new spreadsheet, and 
    # we don't want to include any products that have a reference which is a 
    # component used by another product, since they have now been merged with 
    # that other product.
    del products_by_reference[component]

# Write the products to a new spreadsheet.
wb = openpyxl.Workbook()
ws = wb.active
ws.title = "Products"

# Freeze the header row.
ws.freeze_panes = "A2"

# Add the extra headers.
column_num = len(header_columns)
header_columns["Child Component ID"] = column_num + 1
header_columns["Child DESCRIPTION"] = column_num + 2

# Write the headers.
for column_num, column_name in enumerate(header_columns, start=1):
    cell = ws.cell(row=1, column=column_num, value=column_name)
    cell.font = bold_font
    cell.fill = grey_fill

# This implementation doesn't allow there to be duplicate headers, as the 
# headers are used as keys to retrieve values from the product dicts. So the 
# the last header is manually changed from "Child DESCRIPTION" to 
# "DESCRIPTION" after it has been written to the sheet.
cell.value = "DESCRIPTION"

# Flatten the dictionary of lists, and write every product to the spreadsheet.
products = [product for products_list in products_by_reference.values() for product in products_list]
for row_num, product in enumerate(products, start=2):
    for column_name, column_num in header_columns.items():
        cell = ws.cell(row=row_num, column=column_num, value=product.get(column_name))

# Save the spreadsheet to file, then launch it.
filename = "Merged products.xlsx"
wb.save(filename)
os.startfile(filename)

如何使用 Python 比较 Excel 中的多个列？

How do you compare multiple columns in Excel using Python?

python

excel

automation