在单个 excel sheet 中删除多个 headers

Removing multiple headers in a single excel sheet

类似于

我想从 excel sheet 中删除 header。

使用相同的示例但进行了编辑:

Company ABC

Account Name 1
Account No.1
Description    group     pair          amount    ...  result
value1         value1    value1        value1    ...  value1
value2         value2    value2        value2    ...  value2
totals                               sum values

Account Name 2
Account No.2
Description    group     pair          amount    ...  result
value3         value3    value3        value3    ...  value3
value4         value4    value4        value4    ...  value4
totals                               sum values

Sales
00234
Description    group     pair          amount    ...  result
value5         value5    value5        value5    ...  value5
value6         value6    value6        value6    ...  value6
totals                               sum values

Inventory
00012345
Description    group     pair          amount    ...  result
value7         value7    value7        value7    ...  value7
value8         value8    value8        value8    ...  value8
value9         value9    value9        value9    ...  value9
totals                               sum values

我希望把它拼接成这样

cabinet_name   group     pair          amount    ...  result
value1         value1    value1        value1    ...  value1
value2         value2    value2        value2    ...  value2
value3         value3    value3        value3    ...  value3
value4         value4    value4        value4    ...  value4
value5         value5    value5        value5    ...  value5
value6         value6    value6        value6    ...  value6
value7         value7    value7        value7    ...  value7
value8         value8    value8        value8    ...  value8
value9         value9    value9        value9    ...  value9

我已经设法通过跳过行删除了顶部 header,例如 (skiprows = 4)。但是,这仍然留下其他 headers 和总计,它们像这样追加:

cabinet_name   group     pair          amount    ...  result
value1         value1    value1        value1    ...  value1
value2         value2    value2        value2    ...  value2
totals                               sum values
Account Name 2
Account No.2
cabinet_name   group     pair          amount    ...  result
value3         value3    value3        value3    ...  value3
value4         value4    value4        value4    ...  value4
totals                               sum values
Sales
00234
Description    group     pair          amount    ...  result
value5         value5    value5        value5    ...  value5
value6         value6    value6        value6    ...  value6
totals                               sum values
Inventory
00012345
Description    group     pair          amount    ...  result
value7         value7    value7        value7    ...  value7
value8         value8    value8        value8    ...  value8
value9         value9    value9        value9    ...  value9
totals                               sum values

如果有人能告诉我如何用 pandas 清理这个 sheet 将不胜感激,因为我在网上看到的所有内容都只能在一个 table 上工作excel sheet.

如果我遗漏了什么,请告诉我,我很乐意编辑这个问题。


我认为这可能有用,

我通常清理excel中的文件的过程是先删除前4行,只留下

cabinet_name   group     pair          amount    ...  result
value1         value1    value1        value1    ...  value1
value2         value2    value2        value2    ...  value2
totals         *blank*   *blank*       sum values

Account Name 2
Account No.2
cabinet_name   group     pair          amount    ...  result
value3         value3    value3        value3    ...  value3
value4         value4    value4        value4    ...  value4
totals         *blank*   *blank*       sum values

Account Name 3
Account No.3
cabinet_name   group     pair          amount    ...  result
value5         value5    value5        value5    ...  value5
value6         value6    value6        value6    ...  value6
totals         *blank*   *blank*       sum values

然后我将过滤以在所述列中查找空白值并将其删除。

这是

的结果
print(df_total.head(8).to_dict())
import datetime
from numpy import nan
{'Date': {0: nan, 1: datetime.datetime(2021, 1, 1, 0, 0), 2: datetime.datetime(2021, 1, 1, 0, 0), 3: datetime.datetime(2021, 1, 29, 0, 0), 4: datetime.datetime(2021, 1, 31, 0, 0), 5: 
'Totals', 6: 'Net difference', 7: nan}, 
'Journal number': {0: nan, 1: 'AX009473', 2: 'AX009473', 3: 'AX003312', 4: 'AX009641', 5: nan, 6: nan, 7: nan}, 
'Voucher': {0: nan, 1: 'TSPN-2021-3', 2: 'TSPN-2021-3', 3: 'GBJ-2021-1', 4: 'VIT-2021-1', 5: nan, 6: nan, 7: nan}, 
'Posting type': {0: nan, 1: nan, 2: nan, 3: 'Ledger journal', 4: 'Ledger journal', 5: nan, 6: nan, 7: nan}, 
'Ledger account': {0: nan, 1: '00388211', 2: '00388211', 3: '00388211', 4: '00388211', 5: nan, 6: nan, 7: nan}, 
'Description': {0: nan, 1: nan, 2: nan, 3: 'DISBERSMENT FOR PETROL', 4: 'TAXI FAIR', 5: nan, 6: nan, 7: nan}, 
'Unnamed: 6': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan}, 
'Unnamed: 7': {0: nan, 1: 'SGD', 2: 'USD', 3: 'SGD', 4: 'SGD', 5: nan, 6: nan, 7: nan}, 
'Amount in transaction currency': {0: 'Debit', 1: 13.55, 2: 0, 3: 0, 4: 5, 5: nan, 6: nan, 7: nan}, 'Unnamed: 9': {0: 'Credit', 1: 0, 2: 25, 3: 52, 4: 0, 5: nan, 6: nan, 7: nan}, 
'Amount in accounting currency': {0: 'Debit', 1: 13.55, 2: 0, 3: 0, 4: 5, 5: 18.55, 6: nan, 7: nan}, 'Unnamed: 11': {0: 'Credit', 1: 0, 2: 33.42, 3: 52, 4: 0, 5: 85.42, 6: 66.87, 7: nan}, 
'Amount in reporting currency': {0: 'Debit', 1: 13.55, 2: 0, 3: 0, 4: 5, 5: 18.55, 6: nan, 7: nan}, 'Unnamed: 13': {0: 'Credit', 1: 0, 2: 33.42, 3: 52, 4: 0, 5: 85.42, 6: nan, 7: nan}, 
'Unnamed: 14': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan}, 'Unnamed: 15': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan}, 'Unnamed: 16': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan}}

创建了我所拥有的样本 excel

VS

一个 excel 转换后的数据

您可以尝试像这样删除不需要的行。

df= df[~df['first_column_name'].str.startswith(('Company name','Account Name','Account No.','cabinet_name'))]

据推测,前 3 行以及包含“总计”和“总和值”的行的单元格为空,因此 dropna 应删除这些行。然后 drop_duplicateskeep=False 参数应该删除重复的列名:

out = df.replace('', np.nan).replace(' ', np.nan).dropna().drop_duplicates()

我无法测试它,因为您的 read_clipboard 抱怨您的数据格式。

因此,我们可以通过向其中添加第一行来稍微修改列名(这更正了名称以“未命名”开头的所需列)。然后过滤掉以“Unnamed”开头的列名(剩下不需要的),然后使用“Date”列,创建一个掩码并过滤DataFrame:

df = df.rename(columns={**{f'Unnamed: {i}': j for i,j in zip((9,11,13), 
                                                             ('Amount in transaction currency',
                                                              'Amount in accounting currency',
                                                              'Amount in reporting currency'))}, 
                        **{'Unnamed: 7': 'Currency'}})

df.columns = [f'{col}_{first}' if first==first else col for col, first in zip(df.columns, df.loc[0])]
df = df[df.columns[~df.columns.str.startswith('Unnamed')]]
date_filter = df['Date'].apply(isinstance, args=(datetime.datetime,))
df = df[date_filter]

以上代码用显式循环构造列表修改列名:

cols = {'Unnamed: 7': 'Currency', 'Unnamed: 9': 'Amount in transaction currency', 
        'Unnamed: 11': 'Amount in accounting currency', 'Unnamed: 13': 'Amount in reporting currency'}
df = df.rename(columns=cols)

another_cols = []
for col, first in zip(df.columns, df.loc[0]):
    if first==first:
        another_cols.append(f'{col}_{first}')
    else:
        another_cols.append(col)
df.columns = another_cols
df = df[df.columns[~df.columns.str.startswith('Unnamed')]]
date_filter = df['Date'].apply(isinstance, args=(datetime.datetime,))
df = df[date_filter]

输出:

                  Date Journal number      Voucher    Posting type  \
1  2021-01-01 00:00:00       AX009473  TSPN-2021-3             NaN   
2  2021-01-01 00:00:00       AX009473  TSPN-2021-3             NaN   
3  2021-01-29 00:00:00       AX003312   GBJ-2021-1  Ledger journal   
4  2021-01-31 00:00:00       AX009641   VIT-2021-1  Ledger journal   

  Ledger account             Description Currency  \
1       00388211                     NaN      SGD   
2       00388211                     NaN      USD   
3       00388211  DISBERSMENT FOR PETROL      SGD   
4       00388211               TAXI FAIR      SGD   

  Amount in transaction currency_Debit Amount in transaction currency_Credit  \
1                                13.55                                     0   
2                                    0                                    25   
3                                    0                                    52   
4                                    5                                     0   

  Amount in accounting currency_Debit Amount in accounting currency_Credit  \
1                               13.55                                    0   
2                                   0                                33.42   
3                                   0                                   52   
4                                   5                                    0   

  Amount in reporting currency_Debit Amount in reporting currency_Credit  
1                              13.55                                   0  
2                                  0                               33.42  
3                                  0                                  52  
4                                  5                                   0