我正在尝试使用 pandas 将长数据集转换为宽数据集,但我想在单独的行上保留重复的索引值
I am attempting to use pandas to turn a long dataset into a wide dataset, but I want to keep repeated index values on separate rows
我从一个看起来像这样的数据集开始:(更新:添加了 Qstn Resp TS 以帮助将 Resp 值与 Qstn 标题匹配。
longDF = pd.DataFrame({'id':[1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3],
'Qstn Title':['Date Services were P','Date of Request','Services Requested','Type of Oral Service','Date Services were P','Date of Request','Describe how to resolve','Services Requested','Type of Oral Service','Date Services were P','Date Services were P','Date Services were P','Date Services were P','Date of Request','Date of Request','Describe how to resolve','Services Requested','Services Requested','Services Requested','Type of Oral Service','Type of Oral Service','Type of Oral Service'],
'Resp Value':['05/01/2020','05/01/2020','Chinese (Cantonese)','Telephone Interpreter','07/31/2020','07/31/2020','services were provided','Chinese (Cantonese)','Telephone Interpreter','09/24/2020','09/24/2020','11/19/2020','09/24/2020','09/24/2020','11/19/2020','interpreter lm on vm','Vietnamese','Vietnamese','Vietnamese','Telephone Interpreter','Telephone Interpreter','Telephone Interpreter'],
'Qstn Resp TS':['5/1/2020','5/1/2020','5/1/2020','5/1/2020','7/31/2020','7/31/2020','7/31/2020','7/31/2020','7/31/2020','9/24/2020','9/24/2020','11/19/2020','9/24/2020','9/24/2020','11/19/2020','11/19/2020','9/24/2020','9/24/2020','11/19/2020','9/24/2020','9/24/2020','11/19/2020']})
要创建宽数据集,我会这样做:
wideDF = pd.pivot_table(longDF, values='Resp Value', index=['id'], columns='Qstn Title', aggfunc=np.sum)
我的目标是为每个 'ID' 和 'Qstn Title' 集生成一行,其中列名='Qstn Title' 和值='Resp Value',因此 wideDF 应该有 6行。当我尝试使用上面的 pivot_table 命令时,我只得到 3 行。 'Qstn Title' 列将有多个 'Resp Values' for 'ID' 3 因为 aggfunc=np.sum.
预期输出:
wideDFout = pd.DataFrame({'id':[1,2,3,3,3,3],
'Date Services were P':['05/01/2020','07/31/2020','09/24/2020','09/24/2020','09/24/2020','11/19/2020'],
'Date of Request':['05/01/2020','07/31/2020','09/24/2020','','','11/19/2020'],
'Services Requested':['Chinese (Cantonese)','Chinese (Cantonese)','Vietnamese','Vietnamese','','Vietnamese'],
'Type of Oral Service':['Telephone Interpreter','Chinese (Cantonese)','Telephone Interpreter','Telephone Interpreter','','Telephone Interpreter'],
'Describe how to resolve':['','services were provided','','interpreter lm on vm','','']})
有没有办法在保留一组列值的 index/rows 的同时从长变宽?
下面是更正后的输入数据框和脚本以生成所需的输出:
longDF2 = pd.DataFrame({'id':[1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3],
'Qstn Title':['Date Services were P','Date of Request','Services Requested','Type of Oral Service','Date Services were P','Date of Request','Describe how to resolve','Services Requested','Type of Oral Service','Date Services were P','Date Services were P','Date Services were P','Date of Request','Date of Request','Date of Request','Describe how to resolve','Services Requested','Services Requested','Services Requested','Type of Oral Service','Type of Oral Service','Type of Oral Service'],
'Resp Value':['05/01/2020','05/01/2020','Chinese (Cantonese)','Telephone Interpreter','07/31/2020','07/31/2020','services were provided','Chinese (Cantonese)','Telephone Interpreter','09/24/2020','09/24/2020','11/19/2020','09/24/2020','09/24/2020','11/19/2020','interpreter lm on vm','Vietnamese','Vietnamese','Vietnamese','Telephone Interpreter','Telephone Interpreter','Telephone Interpreter'],
'Qstn Resp TS':['5/1/2020','5/1/2020','5/1/2020','5/1/2020','7/31/2020','7/31/2020','7/31/2020','7/31/2020','7/31/2020','9/24/2020','9/24/2020','11/19/2020','9/24/2020','9/24/2020','11/19/2020','11/19/2020','9/24/2020','9/24/2020','11/19/2020','9/24/2020','9/24/2020','11/19/2020']})
# add temp column 'q' to hold sub-id for each repeated id
longDF2['q'] = longDF2.groupby(['id','Qstn Title','Qstn Resp TS'], group_keys = False).cumcount()
# create multiIndex dataframe based on id, sub-id and qestion title
# and then unstack it
longDFOut2 = longDF2.set_index(['id','q','Qstn Title','Qstn Resp TS']).unstack(level=2, fill_value='')
#### re-pivot ##########
longDFOutNew2 = pd.DataFrame(longDFOut2.to_records())
# rename columns
for c in range(len(longDFOutNew2.columns)):
#print('index='+str(c)+' - name='+longDFOutNew2.columns[c])
if c > 2:
longDFOutNew2.rename(columns={longDFOutNew2.columns[c]:longDFOutNew2.columns[c].split(',')[1].strip()[1:-2]}, inplace=True)
# pad ID with zeros
longDFOutNew2['id'] = longDFOutNew2['id'].astype(str).str.zfill(10)
# output cleaned dataframe
longDFOutFinal2 = longDFOutNew2[['id','Date Services were P','Date of Request','Describe how to resolve','Services Requested','Type of Oral Service']]
首先,我们创建一个辅助列 'q',每个 ID 中的每个重复 'Qstn Title' 都会递增。它将用作 sub-id,本质上,对于那些重复 'Qstn Title's:
的 id 组
longDF['q'] = longDF.groupby(['id','Qstn Title'], group_keys = False).cumcount()
现在我们有了 'id' 和 'q' 定义的组,我们可以将其用于 unstack
:
longDF.set_index(['id','q','Qstn Title']).unstack()
输出
Resp Value
Qstn Title Date Services were P Date of Request Describe how to resolve Services Requested Type of Oral Service
id q
1 0 05/01/2020 05/01/2020 NaN Chinese (Cantonese) Telephone Interpreter
2 0 07/31/2020 07/31/2020 services were provided Chinese (Cantonese) Telephone Interpreter
3 0 09/24/2020 09/24/2020 interpreter lm on vm Vietnamese Telephone Interpreter
1 09/24/2020 11/19/2020 NaN Vietnamese Telephone Interpreter
2 11/19/2020 NaN NaN Vietnamese Telephone Interpreter
3 09/24/2020 NaN NaN NaN NaN
我从一个看起来像这样的数据集开始:(更新:添加了 Qstn Resp TS 以帮助将 Resp 值与 Qstn 标题匹配。
longDF = pd.DataFrame({'id':[1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3],
'Qstn Title':['Date Services were P','Date of Request','Services Requested','Type of Oral Service','Date Services were P','Date of Request','Describe how to resolve','Services Requested','Type of Oral Service','Date Services were P','Date Services were P','Date Services were P','Date Services were P','Date of Request','Date of Request','Describe how to resolve','Services Requested','Services Requested','Services Requested','Type of Oral Service','Type of Oral Service','Type of Oral Service'],
'Resp Value':['05/01/2020','05/01/2020','Chinese (Cantonese)','Telephone Interpreter','07/31/2020','07/31/2020','services were provided','Chinese (Cantonese)','Telephone Interpreter','09/24/2020','09/24/2020','11/19/2020','09/24/2020','09/24/2020','11/19/2020','interpreter lm on vm','Vietnamese','Vietnamese','Vietnamese','Telephone Interpreter','Telephone Interpreter','Telephone Interpreter'],
'Qstn Resp TS':['5/1/2020','5/1/2020','5/1/2020','5/1/2020','7/31/2020','7/31/2020','7/31/2020','7/31/2020','7/31/2020','9/24/2020','9/24/2020','11/19/2020','9/24/2020','9/24/2020','11/19/2020','11/19/2020','9/24/2020','9/24/2020','11/19/2020','9/24/2020','9/24/2020','11/19/2020']})
要创建宽数据集,我会这样做:
wideDF = pd.pivot_table(longDF, values='Resp Value', index=['id'], columns='Qstn Title', aggfunc=np.sum)
我的目标是为每个 'ID' 和 'Qstn Title' 集生成一行,其中列名='Qstn Title' 和值='Resp Value',因此 wideDF 应该有 6行。当我尝试使用上面的 pivot_table 命令时,我只得到 3 行。 'Qstn Title' 列将有多个 'Resp Values' for 'ID' 3 因为 aggfunc=np.sum.
预期输出:
wideDFout = pd.DataFrame({'id':[1,2,3,3,3,3],
'Date Services were P':['05/01/2020','07/31/2020','09/24/2020','09/24/2020','09/24/2020','11/19/2020'],
'Date of Request':['05/01/2020','07/31/2020','09/24/2020','','','11/19/2020'],
'Services Requested':['Chinese (Cantonese)','Chinese (Cantonese)','Vietnamese','Vietnamese','','Vietnamese'],
'Type of Oral Service':['Telephone Interpreter','Chinese (Cantonese)','Telephone Interpreter','Telephone Interpreter','','Telephone Interpreter'],
'Describe how to resolve':['','services were provided','','interpreter lm on vm','','']})
有没有办法在保留一组列值的 index/rows 的同时从长变宽?
下面是更正后的输入数据框和脚本以生成所需的输出:
longDF2 = pd.DataFrame({'id':[1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3],
'Qstn Title':['Date Services were P','Date of Request','Services Requested','Type of Oral Service','Date Services were P','Date of Request','Describe how to resolve','Services Requested','Type of Oral Service','Date Services were P','Date Services were P','Date Services were P','Date of Request','Date of Request','Date of Request','Describe how to resolve','Services Requested','Services Requested','Services Requested','Type of Oral Service','Type of Oral Service','Type of Oral Service'],
'Resp Value':['05/01/2020','05/01/2020','Chinese (Cantonese)','Telephone Interpreter','07/31/2020','07/31/2020','services were provided','Chinese (Cantonese)','Telephone Interpreter','09/24/2020','09/24/2020','11/19/2020','09/24/2020','09/24/2020','11/19/2020','interpreter lm on vm','Vietnamese','Vietnamese','Vietnamese','Telephone Interpreter','Telephone Interpreter','Telephone Interpreter'],
'Qstn Resp TS':['5/1/2020','5/1/2020','5/1/2020','5/1/2020','7/31/2020','7/31/2020','7/31/2020','7/31/2020','7/31/2020','9/24/2020','9/24/2020','11/19/2020','9/24/2020','9/24/2020','11/19/2020','11/19/2020','9/24/2020','9/24/2020','11/19/2020','9/24/2020','9/24/2020','11/19/2020']})
# add temp column 'q' to hold sub-id for each repeated id
longDF2['q'] = longDF2.groupby(['id','Qstn Title','Qstn Resp TS'], group_keys = False).cumcount()
# create multiIndex dataframe based on id, sub-id and qestion title
# and then unstack it
longDFOut2 = longDF2.set_index(['id','q','Qstn Title','Qstn Resp TS']).unstack(level=2, fill_value='')
#### re-pivot ##########
longDFOutNew2 = pd.DataFrame(longDFOut2.to_records())
# rename columns
for c in range(len(longDFOutNew2.columns)):
#print('index='+str(c)+' - name='+longDFOutNew2.columns[c])
if c > 2:
longDFOutNew2.rename(columns={longDFOutNew2.columns[c]:longDFOutNew2.columns[c].split(',')[1].strip()[1:-2]}, inplace=True)
# pad ID with zeros
longDFOutNew2['id'] = longDFOutNew2['id'].astype(str).str.zfill(10)
# output cleaned dataframe
longDFOutFinal2 = longDFOutNew2[['id','Date Services were P','Date of Request','Describe how to resolve','Services Requested','Type of Oral Service']]
首先,我们创建一个辅助列 'q',每个 ID 中的每个重复 'Qstn Title' 都会递增。它将用作 sub-id,本质上,对于那些重复 'Qstn Title's:
的 id 组longDF['q'] = longDF.groupby(['id','Qstn Title'], group_keys = False).cumcount()
现在我们有了 'id' 和 'q' 定义的组,我们可以将其用于 unstack
:
longDF.set_index(['id','q','Qstn Title']).unstack()
输出
Resp Value
Qstn Title Date Services were P Date of Request Describe how to resolve Services Requested Type of Oral Service
id q
1 0 05/01/2020 05/01/2020 NaN Chinese (Cantonese) Telephone Interpreter
2 0 07/31/2020 07/31/2020 services were provided Chinese (Cantonese) Telephone Interpreter
3 0 09/24/2020 09/24/2020 interpreter lm on vm Vietnamese Telephone Interpreter
1 09/24/2020 11/19/2020 NaN Vietnamese Telephone Interpreter
2 11/19/2020 NaN NaN Vietnamese Telephone Interpreter
3 09/24/2020 NaN NaN NaN NaN