我正在尝试使用 pandas 将长数据集转换为宽数据集,但我想在单独的行上保留重复的索引值

I am attempting to use pandas to turn a long dataset into a wide dataset, but I want to keep repeated index values on separate rows

我从一个看起来像这样的数据集开始:(更新:添加了 Qstn Resp TS 以帮助将 Resp 值与 Qstn 标题匹配。

longDF = pd.DataFrame({'id':[1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3],
                        'Qstn Title':['Date Services were P','Date of Request','Services Requested','Type of Oral Service','Date Services were P','Date of Request','Describe how to resolve','Services Requested','Type of Oral Service','Date Services were P','Date Services were P','Date Services were P','Date Services were P','Date of Request','Date of Request','Describe how to resolve','Services Requested','Services Requested','Services Requested','Type of Oral Service','Type of Oral Service','Type of Oral Service'],
                        'Resp Value':['05/01/2020','05/01/2020','Chinese (Cantonese)','Telephone Interpreter','07/31/2020','07/31/2020','services were provided','Chinese (Cantonese)','Telephone Interpreter','09/24/2020','09/24/2020','11/19/2020','09/24/2020','09/24/2020','11/19/2020','interpreter lm on vm','Vietnamese','Vietnamese','Vietnamese','Telephone Interpreter','Telephone Interpreter','Telephone Interpreter'],
                        'Qstn Resp TS':['5/1/2020','5/1/2020','5/1/2020','5/1/2020','7/31/2020','7/31/2020','7/31/2020','7/31/2020','7/31/2020','9/24/2020','9/24/2020','11/19/2020','9/24/2020','9/24/2020','11/19/2020','11/19/2020','9/24/2020','9/24/2020','11/19/2020','9/24/2020','9/24/2020','11/19/2020']})

要创建宽数据集,我会这样做:

wideDF = pd.pivot_table(longDF, values='Resp Value',  index=['id'], columns='Qstn Title', aggfunc=np.sum)

我的目标是为每个 'ID' 和 'Qstn Title' 集生成一行,其中列名='Qstn Title' 和值='Resp Value',因此 wideDF 应该有 6行。当我尝试使用上面的 pivot_table 命令时,我只得到 3 行。 'Qstn Title' 列将有多个 'Resp Values' for 'ID' 3 因为 aggfunc=np.sum.

预期输出:

wideDFout = pd.DataFrame({'id':[1,2,3,3,3,3],
                          'Date Services were P':['05/01/2020','07/31/2020','09/24/2020','09/24/2020','09/24/2020','11/19/2020'],
                          'Date of Request':['05/01/2020','07/31/2020','09/24/2020','','','11/19/2020'],
                          'Services Requested':['Chinese (Cantonese)','Chinese (Cantonese)','Vietnamese','Vietnamese','','Vietnamese'],
                          'Type of Oral Service':['Telephone Interpreter','Chinese (Cantonese)','Telephone Interpreter','Telephone Interpreter','','Telephone Interpreter'],
                          'Describe how to resolve':['','services were provided','','interpreter lm on vm','','']})

有没有办法在保留一组列值的 index/rows 的同时从长变宽?

下面是更正后的输入数据框和脚本以生成所需的输出:

longDF2 = pd.DataFrame({'id':[1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3],
                        'Qstn Title':['Date Services were P','Date of Request','Services Requested','Type of Oral Service','Date Services were P','Date of Request','Describe how to resolve','Services Requested','Type of Oral Service','Date Services were P','Date Services were P','Date Services were P','Date of Request','Date of Request','Date of Request','Describe how to resolve','Services Requested','Services Requested','Services Requested','Type of Oral Service','Type of Oral Service','Type of Oral Service'],
                        'Resp Value':['05/01/2020','05/01/2020','Chinese (Cantonese)','Telephone Interpreter','07/31/2020','07/31/2020','services were provided','Chinese (Cantonese)','Telephone Interpreter','09/24/2020','09/24/2020','11/19/2020','09/24/2020','09/24/2020','11/19/2020','interpreter lm on vm','Vietnamese','Vietnamese','Vietnamese','Telephone Interpreter','Telephone Interpreter','Telephone Interpreter'],
                        'Qstn Resp TS':['5/1/2020','5/1/2020','5/1/2020','5/1/2020','7/31/2020','7/31/2020','7/31/2020','7/31/2020','7/31/2020','9/24/2020','9/24/2020','11/19/2020','9/24/2020','9/24/2020','11/19/2020','11/19/2020','9/24/2020','9/24/2020','11/19/2020','9/24/2020','9/24/2020','11/19/2020']})

# add temp column 'q' to hold sub-id for each repeated id
longDF2['q'] = longDF2.groupby(['id','Qstn Title','Qstn Resp TS'], group_keys = False).cumcount()

# create multiIndex dataframe based on id, sub-id and qestion title
# and then unstack it
longDFOut2 = longDF2.set_index(['id','q','Qstn Title','Qstn Resp TS']).unstack(level=2, fill_value='')


#### re-pivot ##########
longDFOutNew2 = pd.DataFrame(longDFOut2.to_records())

# rename columns        
for c in range(len(longDFOutNew2.columns)):
    #print('index='+str(c)+' - name='+longDFOutNew2.columns[c])
    if c > 2:
        longDFOutNew2.rename(columns={longDFOutNew2.columns[c]:longDFOutNew2.columns[c].split(',')[1].strip()[1:-2]}, inplace=True)

# pad ID with zeros
longDFOutNew2['id'] = longDFOutNew2['id'].astype(str).str.zfill(10)

# output cleaned dataframe
longDFOutFinal2 = longDFOutNew2[['id','Date Services were P','Date of Request','Describe how to resolve','Services Requested','Type of Oral Service']]

首先,我们创建一个辅助列 'q',每个 ID 中的每个重复 'Qstn Title' 都会递增。它将用作 sub-id,本质上,对于那些重复 'Qstn Title's:

的 id 组
longDF['q'] = longDF.groupby(['id','Qstn Title'], group_keys = False).cumcount()

现在我们有了 'id' 和 'q' 定义的组,我们可以将其用于 unstack:

longDF.set_index(['id','q','Qstn Title']).unstack()

输出

Resp Value
Qstn Title  Date Services were P    Date of Request Describe how to resolve Services Requested  Type of Oral Service
id  q                   
1   0       05/01/2020              05/01/2020      NaN                  Chinese (Cantonese)    Telephone Interpreter
2   0       07/31/2020              07/31/2020      services were provided  Chinese (Cantonese) Telephone Interpreter
3   0       09/24/2020              09/24/2020      interpreter lm on vm    Vietnamese  Telephone Interpreter
    1       09/24/2020              11/19/2020      NaN               Vietnamese    Telephone Interpreter
    2       11/19/2020              NaN             NaN               Vietnamese    Telephone Interpreter
    3       09/24/2020              NaN             NaN NaN NaN