从 xlsx 导入到 pandas 时带有 'None' header 的列
Columns with 'None' header when importing from xlsx to pandas
将格式繁多的 excel 工作表导入 pandas 会导致某些列完全空白,并且在查看 df.columns
时显示 'None'。我需要删除这些列,但我得到了一些奇怪的输出,这让我很难弄清楚如何删除它们。
****为清楚起见进行编辑****
excel 工作表格式很重,必须重新调整数据以用于分析。本质上,col A 是问题列表,col B 是对每个问题的解释,col C 是对问题的回答。期望的结果是 A 列成为表格数据集的 header,B 列被删除,C 列是第一行。然后需要以这样一种方式保存,即 excel 工作表的另一个副本(将为另一个客户填写)的 C 列可以附加到表格数据集。
我已经能够将工作表导入 python 和 pandas,转置数据,并进行一些最小的整形和清理。
示例代码:
import os
import pandas as pd
import xlwings as xw
dir_path = "C:\Users\user.name\directory\project\data\january"
file_path = "C:\Users\user.name\directory\project\data\january\D10A0021_10.01.20.xlsx"
os.chdir(dir_path)# setting the directory
wb = xw.Book(file_path, password = 'mypassword') # getting python to open the workbook
demographics = wb.sheets[0] # selecting the demographic sheet.
df = demographics['B2:D33'].options(pd.DataFrame, index=False, header = True).value # importing all the used cells into pandas
df.columns = [0,1,2] #adding column names that I can track
df = df.T #Transposing the data
df.columns = df.loc[0] #turning the question items into the column headers
df = df.loc[2:] remove the unneeded first and second row from the set
for num, col in enumerate(df.columns):
print(f'{num}: {col}') # This code has fixed the issue one of the issues. Suggested by Datanovice.
Output:
0: Client code
1: Client's date of birth
2: Sex
3: Previous symptom recurrence
4: None
5: Has the client attended Primary Care Psychology in the past?
6: None
7: Ethnicity
8: None
9: Did the parent/ guardian/ carer require help completing the scales due to literacy difficulties?
10: Did the parent/ guardian/ carer require help completing the scales due to perceived complexity of questionnaires?
11: Did the client require help completing the scales due to literacy difficulties?
12: Did the client require help completing the scales due to perceived complexity of questionnaires?
13: Accommodation status
14: None
15: Relationship with main carer
16: None
17: Any long term stressors
18: Referral source
19: Referral date
20: Referral reason
21: Actual presenting difficulty (post formulation)
22: Date first seen
23: Discharge date
24: Reason for terminating treatment
25: None
26: Type of intervention
27: Total number of sessions offered (including DNA’s CNA’s)
28: No. of sessions: attended (by type of intervention)
29: No. of sessions: did not attend (by type of intervention)
30: No. of sessions: could not attend (by type of intervention)
31
在将数据重新导出到另一个 excel 工作表之前,我需要能够删除 header 中具有 'None' 的任何列,然后可以使用新数据更新该工作表提交新的客户记录。
如有任何建议,我们将不胜感激。
所以您有一个 Excel sheet,其中有些列没有数据。
而xlwings
会将所有没有数据的单元格默认设置为NaN
/None
。
您可以做的是只保留名称不是 None
的列:
cols = [x for x in df.columns if x is not None]
df = df[cols]
那么df
只会保留相关的列。
将格式繁多的 excel 工作表导入 pandas 会导致某些列完全空白,并且在查看 df.columns
时显示 'None'。我需要删除这些列,但我得到了一些奇怪的输出,这让我很难弄清楚如何删除它们。
****为清楚起见进行编辑****
excel 工作表格式很重,必须重新调整数据以用于分析。本质上,col A 是问题列表,col B 是对每个问题的解释,col C 是对问题的回答。期望的结果是 A 列成为表格数据集的 header,B 列被删除,C 列是第一行。然后需要以这样一种方式保存,即 excel 工作表的另一个副本(将为另一个客户填写)的 C 列可以附加到表格数据集。
我已经能够将工作表导入 python 和 pandas,转置数据,并进行一些最小的整形和清理。
示例代码:
import os
import pandas as pd
import xlwings as xw
dir_path = "C:\Users\user.name\directory\project\data\january"
file_path = "C:\Users\user.name\directory\project\data\january\D10A0021_10.01.20.xlsx"
os.chdir(dir_path)# setting the directory
wb = xw.Book(file_path, password = 'mypassword') # getting python to open the workbook
demographics = wb.sheets[0] # selecting the demographic sheet.
df = demographics['B2:D33'].options(pd.DataFrame, index=False, header = True).value # importing all the used cells into pandas
df.columns = [0,1,2] #adding column names that I can track
df = df.T #Transposing the data
df.columns = df.loc[0] #turning the question items into the column headers
df = df.loc[2:] remove the unneeded first and second row from the set
for num, col in enumerate(df.columns):
print(f'{num}: {col}') # This code has fixed the issue one of the issues. Suggested by Datanovice.
Output:
0: Client code
1: Client's date of birth
2: Sex
3: Previous symptom recurrence
4: None
5: Has the client attended Primary Care Psychology in the past?
6: None
7: Ethnicity
8: None
9: Did the parent/ guardian/ carer require help completing the scales due to literacy difficulties?
10: Did the parent/ guardian/ carer require help completing the scales due to perceived complexity of questionnaires?
11: Did the client require help completing the scales due to literacy difficulties?
12: Did the client require help completing the scales due to perceived complexity of questionnaires?
13: Accommodation status
14: None
15: Relationship with main carer
16: None
17: Any long term stressors
18: Referral source
19: Referral date
20: Referral reason
21: Actual presenting difficulty (post formulation)
22: Date first seen
23: Discharge date
24: Reason for terminating treatment
25: None
26: Type of intervention
27: Total number of sessions offered (including DNA’s CNA’s)
28: No. of sessions: attended (by type of intervention)
29: No. of sessions: did not attend (by type of intervention)
30: No. of sessions: could not attend (by type of intervention)
31
在将数据重新导出到另一个 excel 工作表之前,我需要能够删除 header 中具有 'None' 的任何列,然后可以使用新数据更新该工作表提交新的客户记录。
如有任何建议,我们将不胜感激。
所以您有一个 Excel sheet,其中有些列没有数据。
而xlwings
会将所有没有数据的单元格默认设置为NaN
/None
。
您可以做的是只保留名称不是 None
的列:
cols = [x for x in df.columns if x is not None]
df = df[cols]
那么df
只会保留相关的列。