在单个 excel sheet 中删除多个 headers
Removing multiple headers in a single excel sheet
类似于
我想从 excel sheet 中删除 header。
使用相同的示例但进行了编辑:
Company ABC
Account Name 1
Account No.1
Description group pair amount ... result
value1 value1 value1 value1 ... value1
value2 value2 value2 value2 ... value2
totals sum values
Account Name 2
Account No.2
Description group pair amount ... result
value3 value3 value3 value3 ... value3
value4 value4 value4 value4 ... value4
totals sum values
Sales
00234
Description group pair amount ... result
value5 value5 value5 value5 ... value5
value6 value6 value6 value6 ... value6
totals sum values
Inventory
00012345
Description group pair amount ... result
value7 value7 value7 value7 ... value7
value8 value8 value8 value8 ... value8
value9 value9 value9 value9 ... value9
totals sum values
我希望把它拼接成这样
cabinet_name group pair amount ... result
value1 value1 value1 value1 ... value1
value2 value2 value2 value2 ... value2
value3 value3 value3 value3 ... value3
value4 value4 value4 value4 ... value4
value5 value5 value5 value5 ... value5
value6 value6 value6 value6 ... value6
value7 value7 value7 value7 ... value7
value8 value8 value8 value8 ... value8
value9 value9 value9 value9 ... value9
我已经设法通过跳过行删除了顶部 header,例如 (skiprows = 4)。但是,这仍然留下其他 headers 和总计,它们像这样追加:
cabinet_name group pair amount ... result
value1 value1 value1 value1 ... value1
value2 value2 value2 value2 ... value2
totals sum values
Account Name 2
Account No.2
cabinet_name group pair amount ... result
value3 value3 value3 value3 ... value3
value4 value4 value4 value4 ... value4
totals sum values
Sales
00234
Description group pair amount ... result
value5 value5 value5 value5 ... value5
value6 value6 value6 value6 ... value6
totals sum values
Inventory
00012345
Description group pair amount ... result
value7 value7 value7 value7 ... value7
value8 value8 value8 value8 ... value8
value9 value9 value9 value9 ... value9
totals sum values
如果有人能告诉我如何用 pandas 清理这个 sheet 将不胜感激,因为我在网上看到的所有内容都只能在一个 table 上工作excel sheet.
如果我遗漏了什么,请告诉我,我很乐意编辑这个问题。
我认为这可能有用,
我通常清理excel中的文件的过程是先删除前4行,只留下
cabinet_name group pair amount ... result
value1 value1 value1 value1 ... value1
value2 value2 value2 value2 ... value2
totals *blank* *blank* sum values
Account Name 2
Account No.2
cabinet_name group pair amount ... result
value3 value3 value3 value3 ... value3
value4 value4 value4 value4 ... value4
totals *blank* *blank* sum values
Account Name 3
Account No.3
cabinet_name group pair amount ... result
value5 value5 value5 value5 ... value5
value6 value6 value6 value6 ... value6
totals *blank* *blank* sum values
然后我将过滤组或对以在所述列中查找空白值并将其删除。
这是
的结果
print(df_total.head(8).to_dict())
import datetime
from numpy import nan
{'Date': {0: nan, 1: datetime.datetime(2021, 1, 1, 0, 0), 2: datetime.datetime(2021, 1, 1, 0, 0), 3: datetime.datetime(2021, 1, 29, 0, 0), 4: datetime.datetime(2021, 1, 31, 0, 0), 5:
'Totals', 6: 'Net difference', 7: nan},
'Journal number': {0: nan, 1: 'AX009473', 2: 'AX009473', 3: 'AX003312', 4: 'AX009641', 5: nan, 6: nan, 7: nan},
'Voucher': {0: nan, 1: 'TSPN-2021-3', 2: 'TSPN-2021-3', 3: 'GBJ-2021-1', 4: 'VIT-2021-1', 5: nan, 6: nan, 7: nan},
'Posting type': {0: nan, 1: nan, 2: nan, 3: 'Ledger journal', 4: 'Ledger journal', 5: nan, 6: nan, 7: nan},
'Ledger account': {0: nan, 1: '00388211', 2: '00388211', 3: '00388211', 4: '00388211', 5: nan, 6: nan, 7: nan},
'Description': {0: nan, 1: nan, 2: nan, 3: 'DISBERSMENT FOR PETROL', 4: 'TAXI FAIR', 5: nan, 6: nan, 7: nan},
'Unnamed: 6': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan},
'Unnamed: 7': {0: nan, 1: 'SGD', 2: 'USD', 3: 'SGD', 4: 'SGD', 5: nan, 6: nan, 7: nan},
'Amount in transaction currency': {0: 'Debit', 1: 13.55, 2: 0, 3: 0, 4: 5, 5: nan, 6: nan, 7: nan}, 'Unnamed: 9': {0: 'Credit', 1: 0, 2: 25, 3: 52, 4: 0, 5: nan, 6: nan, 7: nan},
'Amount in accounting currency': {0: 'Debit', 1: 13.55, 2: 0, 3: 0, 4: 5, 5: 18.55, 6: nan, 7: nan}, 'Unnamed: 11': {0: 'Credit', 1: 0, 2: 33.42, 3: 52, 4: 0, 5: 85.42, 6: 66.87, 7: nan},
'Amount in reporting currency': {0: 'Debit', 1: 13.55, 2: 0, 3: 0, 4: 5, 5: 18.55, 6: nan, 7: nan}, 'Unnamed: 13': {0: 'Credit', 1: 0, 2: 33.42, 3: 52, 4: 0, 5: 85.42, 6: nan, 7: nan},
'Unnamed: 14': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan}, 'Unnamed: 15': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan}, 'Unnamed: 16': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan}}
创建了我所拥有的样本 excel
VS
一个 excel 转换后的数据
您可以尝试像这样删除不需要的行。
df= df[~df['first_column_name'].str.startswith(('Company name','Account Name','Account No.','cabinet_name'))]
据推测,前 3 行以及包含“总计”和“总和值”的行的单元格为空,因此 dropna
应删除这些行。然后 drop_duplicates
和 keep=False
参数应该删除重复的列名:
out = df.replace('', np.nan).replace(' ', np.nan).dropna().drop_duplicates()
我无法测试它,因为您的 read_clipboard
抱怨您的数据格式。
因此,我们可以通过向其中添加第一行来稍微修改列名(这更正了名称以“未命名”开头的所需列)。然后过滤掉以“Unnamed”开头的列名(剩下不需要的),然后使用“Date”列,创建一个掩码并过滤DataFrame:
df = df.rename(columns={**{f'Unnamed: {i}': j for i,j in zip((9,11,13),
('Amount in transaction currency',
'Amount in accounting currency',
'Amount in reporting currency'))},
**{'Unnamed: 7': 'Currency'}})
df.columns = [f'{col}_{first}' if first==first else col for col, first in zip(df.columns, df.loc[0])]
df = df[df.columns[~df.columns.str.startswith('Unnamed')]]
date_filter = df['Date'].apply(isinstance, args=(datetime.datetime,))
df = df[date_filter]
以上代码用显式循环构造列表修改列名:
cols = {'Unnamed: 7': 'Currency', 'Unnamed: 9': 'Amount in transaction currency',
'Unnamed: 11': 'Amount in accounting currency', 'Unnamed: 13': 'Amount in reporting currency'}
df = df.rename(columns=cols)
another_cols = []
for col, first in zip(df.columns, df.loc[0]):
if first==first:
another_cols.append(f'{col}_{first}')
else:
another_cols.append(col)
df.columns = another_cols
df = df[df.columns[~df.columns.str.startswith('Unnamed')]]
date_filter = df['Date'].apply(isinstance, args=(datetime.datetime,))
df = df[date_filter]
输出:
Date Journal number Voucher Posting type \
1 2021-01-01 00:00:00 AX009473 TSPN-2021-3 NaN
2 2021-01-01 00:00:00 AX009473 TSPN-2021-3 NaN
3 2021-01-29 00:00:00 AX003312 GBJ-2021-1 Ledger journal
4 2021-01-31 00:00:00 AX009641 VIT-2021-1 Ledger journal
Ledger account Description Currency \
1 00388211 NaN SGD
2 00388211 NaN USD
3 00388211 DISBERSMENT FOR PETROL SGD
4 00388211 TAXI FAIR SGD
Amount in transaction currency_Debit Amount in transaction currency_Credit \
1 13.55 0
2 0 25
3 0 52
4 5 0
Amount in accounting currency_Debit Amount in accounting currency_Credit \
1 13.55 0
2 0 33.42
3 0 52
4 5 0
Amount in reporting currency_Debit Amount in reporting currency_Credit
1 13.55 0
2 0 33.42
3 0 52
4 5 0
类似于
我想从 excel sheet 中删除 header。
使用相同的示例但进行了编辑:
Company ABC
Account Name 1
Account No.1
Description group pair amount ... result
value1 value1 value1 value1 ... value1
value2 value2 value2 value2 ... value2
totals sum values
Account Name 2
Account No.2
Description group pair amount ... result
value3 value3 value3 value3 ... value3
value4 value4 value4 value4 ... value4
totals sum values
Sales
00234
Description group pair amount ... result
value5 value5 value5 value5 ... value5
value6 value6 value6 value6 ... value6
totals sum values
Inventory
00012345
Description group pair amount ... result
value7 value7 value7 value7 ... value7
value8 value8 value8 value8 ... value8
value9 value9 value9 value9 ... value9
totals sum values
我希望把它拼接成这样
cabinet_name group pair amount ... result
value1 value1 value1 value1 ... value1
value2 value2 value2 value2 ... value2
value3 value3 value3 value3 ... value3
value4 value4 value4 value4 ... value4
value5 value5 value5 value5 ... value5
value6 value6 value6 value6 ... value6
value7 value7 value7 value7 ... value7
value8 value8 value8 value8 ... value8
value9 value9 value9 value9 ... value9
我已经设法通过跳过行删除了顶部 header,例如 (skiprows = 4)。但是,这仍然留下其他 headers 和总计,它们像这样追加:
cabinet_name group pair amount ... result
value1 value1 value1 value1 ... value1
value2 value2 value2 value2 ... value2
totals sum values
Account Name 2
Account No.2
cabinet_name group pair amount ... result
value3 value3 value3 value3 ... value3
value4 value4 value4 value4 ... value4
totals sum values
Sales
00234
Description group pair amount ... result
value5 value5 value5 value5 ... value5
value6 value6 value6 value6 ... value6
totals sum values
Inventory
00012345
Description group pair amount ... result
value7 value7 value7 value7 ... value7
value8 value8 value8 value8 ... value8
value9 value9 value9 value9 ... value9
totals sum values
如果有人能告诉我如何用 pandas 清理这个 sheet 将不胜感激,因为我在网上看到的所有内容都只能在一个 table 上工作excel sheet.
如果我遗漏了什么,请告诉我,我很乐意编辑这个问题。
我认为这可能有用,
我通常清理excel中的文件的过程是先删除前4行,只留下
cabinet_name group pair amount ... result
value1 value1 value1 value1 ... value1
value2 value2 value2 value2 ... value2
totals *blank* *blank* sum values
Account Name 2
Account No.2
cabinet_name group pair amount ... result
value3 value3 value3 value3 ... value3
value4 value4 value4 value4 ... value4
totals *blank* *blank* sum values
Account Name 3
Account No.3
cabinet_name group pair amount ... result
value5 value5 value5 value5 ... value5
value6 value6 value6 value6 ... value6
totals *blank* *blank* sum values
然后我将过滤组或对以在所述列中查找空白值并将其删除。
这是
的结果print(df_total.head(8).to_dict())
import datetime
from numpy import nan
{'Date': {0: nan, 1: datetime.datetime(2021, 1, 1, 0, 0), 2: datetime.datetime(2021, 1, 1, 0, 0), 3: datetime.datetime(2021, 1, 29, 0, 0), 4: datetime.datetime(2021, 1, 31, 0, 0), 5:
'Totals', 6: 'Net difference', 7: nan},
'Journal number': {0: nan, 1: 'AX009473', 2: 'AX009473', 3: 'AX003312', 4: 'AX009641', 5: nan, 6: nan, 7: nan},
'Voucher': {0: nan, 1: 'TSPN-2021-3', 2: 'TSPN-2021-3', 3: 'GBJ-2021-1', 4: 'VIT-2021-1', 5: nan, 6: nan, 7: nan},
'Posting type': {0: nan, 1: nan, 2: nan, 3: 'Ledger journal', 4: 'Ledger journal', 5: nan, 6: nan, 7: nan},
'Ledger account': {0: nan, 1: '00388211', 2: '00388211', 3: '00388211', 4: '00388211', 5: nan, 6: nan, 7: nan},
'Description': {0: nan, 1: nan, 2: nan, 3: 'DISBERSMENT FOR PETROL', 4: 'TAXI FAIR', 5: nan, 6: nan, 7: nan},
'Unnamed: 6': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan},
'Unnamed: 7': {0: nan, 1: 'SGD', 2: 'USD', 3: 'SGD', 4: 'SGD', 5: nan, 6: nan, 7: nan},
'Amount in transaction currency': {0: 'Debit', 1: 13.55, 2: 0, 3: 0, 4: 5, 5: nan, 6: nan, 7: nan}, 'Unnamed: 9': {0: 'Credit', 1: 0, 2: 25, 3: 52, 4: 0, 5: nan, 6: nan, 7: nan},
'Amount in accounting currency': {0: 'Debit', 1: 13.55, 2: 0, 3: 0, 4: 5, 5: 18.55, 6: nan, 7: nan}, 'Unnamed: 11': {0: 'Credit', 1: 0, 2: 33.42, 3: 52, 4: 0, 5: 85.42, 6: 66.87, 7: nan},
'Amount in reporting currency': {0: 'Debit', 1: 13.55, 2: 0, 3: 0, 4: 5, 5: 18.55, 6: nan, 7: nan}, 'Unnamed: 13': {0: 'Credit', 1: 0, 2: 33.42, 3: 52, 4: 0, 5: 85.42, 6: nan, 7: nan},
'Unnamed: 14': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan}, 'Unnamed: 15': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan}, 'Unnamed: 16': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan}}
创建了我所拥有的样本 excel
VS
一个 excel 转换后的数据
您可以尝试像这样删除不需要的行。
df= df[~df['first_column_name'].str.startswith(('Company name','Account Name','Account No.','cabinet_name'))]
据推测,前 3 行以及包含“总计”和“总和值”的行的单元格为空,因此 dropna
应删除这些行。然后 drop_duplicates
和 keep=False
参数应该删除重复的列名:
out = df.replace('', np.nan).replace(' ', np.nan).dropna().drop_duplicates()
我无法测试它,因为您的 read_clipboard
抱怨您的数据格式。
因此,我们可以通过向其中添加第一行来稍微修改列名(这更正了名称以“未命名”开头的所需列)。然后过滤掉以“Unnamed”开头的列名(剩下不需要的),然后使用“Date”列,创建一个掩码并过滤DataFrame:
df = df.rename(columns={**{f'Unnamed: {i}': j for i,j in zip((9,11,13),
('Amount in transaction currency',
'Amount in accounting currency',
'Amount in reporting currency'))},
**{'Unnamed: 7': 'Currency'}})
df.columns = [f'{col}_{first}' if first==first else col for col, first in zip(df.columns, df.loc[0])]
df = df[df.columns[~df.columns.str.startswith('Unnamed')]]
date_filter = df['Date'].apply(isinstance, args=(datetime.datetime,))
df = df[date_filter]
以上代码用显式循环构造列表修改列名:
cols = {'Unnamed: 7': 'Currency', 'Unnamed: 9': 'Amount in transaction currency',
'Unnamed: 11': 'Amount in accounting currency', 'Unnamed: 13': 'Amount in reporting currency'}
df = df.rename(columns=cols)
another_cols = []
for col, first in zip(df.columns, df.loc[0]):
if first==first:
another_cols.append(f'{col}_{first}')
else:
another_cols.append(col)
df.columns = another_cols
df = df[df.columns[~df.columns.str.startswith('Unnamed')]]
date_filter = df['Date'].apply(isinstance, args=(datetime.datetime,))
df = df[date_filter]
输出:
Date Journal number Voucher Posting type \
1 2021-01-01 00:00:00 AX009473 TSPN-2021-3 NaN
2 2021-01-01 00:00:00 AX009473 TSPN-2021-3 NaN
3 2021-01-29 00:00:00 AX003312 GBJ-2021-1 Ledger journal
4 2021-01-31 00:00:00 AX009641 VIT-2021-1 Ledger journal
Ledger account Description Currency \
1 00388211 NaN SGD
2 00388211 NaN USD
3 00388211 DISBERSMENT FOR PETROL SGD
4 00388211 TAXI FAIR SGD
Amount in transaction currency_Debit Amount in transaction currency_Credit \
1 13.55 0
2 0 25
3 0 52
4 5 0
Amount in accounting currency_Debit Amount in accounting currency_Credit \
1 13.55 0
2 0 33.42
3 0 52
4 5 0
Amount in reporting currency_Debit Amount in reporting currency_Credit
1 13.55 0
2 0 33.42
3 0 52
4 5 0