有没有办法从单个 xlsx 读取多个 excel tab/sheets 到多个数据帧，每个数据帧以 sheet 名称命名？

Question

我不擅长 python 请原谅我这个问题，但我需要创建一个函数来执行以下操作：

从单个 xlsx 文件中存在的多个 excel tab/sheet 创建多个数据帧，并以 sheet 名称命名。
应该连接列的值并检查是否没有重复值。
如果 concat 值重复，则应在另一列中将其告知为 yes/No。
然后所有的数据帧应该作为不同的工作sheet写入一个单独的工作簿。 () 内的值是为了更好理解的列

示例：

sheet1

(a) (b) (c) (d)
a1  b1  c1  d1
a2  b2  c2  d2

结果：

(c) (d) (concate) (is duplicate)
c1  d1  c1_d1     no
c2  d2  c2_d2     no

sheet2

(a) (b) (e) (f)
a3  b3  e1  f1
a4  b4  e1  f1
a5  b5  e2  f2
a6  b6  e4  f4
a7  a8  e4  f5

结果：

(e) (f) (concat) (has duplicate)
e1 f1 e1_f1 yes
e2 f2 e2_f2 no
e4 f4 e4_f4 no
e4 f5 e4_f5 no

Answer 1

首先，要读取包含多个 sheet 的 excel 文件，请使用 pandas ExcelFile 函数。

例如df = pd.ExcelFile(filepath)

而且，在从上面的步骤中读取 excel 之后，您可以使用 read_excel 函数在单独的数据框中读取每个 sheet，例如

df1 = pd.read_excel(df, 'sheet_name_1')
df2 = pd.read_excel(df, 'sheet_name_2')

插入不同的 sheet 名称并读取不同数据帧中的 sheet。

你问题的后半部分我没看懂，请再详细一点。

Answer 2

给你：

import pandas as pd
from pandas import ExcelWriter

def detect_duplicate(group):
    group['is_duplicate'] = ['No'] + ['Yes'] * (len(group) - 1)
    return group

with ExcelWriter('output.xlsx') as output:
    for sheet_name, df in pd.read_excel('input.xlsx', sheet_name=None).items():
        df = df.drop(['a', 'b'], axis=1)
        df['concat'] = df.apply(lambda row: '_'.join(row), axis=1)
        df = df.groupby(['concat']).apply(detect_duplicate)
        df = df.drop_duplicates(keep='last', subset=['concat'])
        df.to_excel(output, sheet_name=sheet_name, index=False)

检查 output.xlsx 的输出。

有没有办法从单个 xlsx 读取多个 excel tab/sheets 到多个数据帧，每个数据帧以 sheet 名称命名？

is there a way to read multiple excel tab/sheets from single xlsx to multiple dataframes with each dataframe named with sheet name?

python

dataframe

python-3.x

pandas

pyspark-dataframes