数据框列的条件聚合,将 'n' 行合并为 1 行
Conditional aggregation on dataframe columns with combining 'n' rows into 1 row
我有一个输入数据框,它包含以下内容:
NAME TEXT START END
Tim Tim Wagner is a teacher. 10 20.5
Tim He is from Cleveland, Ohio. 20.5 40
Frank Frank is a musician. 40 50
Tim He like to travel with his family 50 62
Frank He is a performing artist who plays the cello. 62 70
Frank He performed at the Carnegie Hall last year. 70 85
Frank It was fantastic listening to him. 85 90
Frank I really enjoyed 90 93
想要输出dataframe如下:
NAME TEXT START END
Tim Tim Wagner is a teacher. He is from Cleveland, Ohio. 10 40
Frank Frank is a musician 40 50
Tim He like to travel with his family 50 62
Frank He is a performing artist who plays the cello. He performed at the Carnegie Hall last year. 62 85
Frank It was fantastic listening to him. I really enjoyed 85 93
我当前的代码:
grp = (df['NAME'] != df['NAME'].shift()).cumsum().rename('group')
df.groupby(['NAME', grp], sort=False)['TEXT','START','END']\
.agg({'TEXT':lambda x: ' '.join(x), 'START': 'min', 'END':'max'})\
.reset_index().drop('group', axis=1)
这会将最后 4 行合二为一。相反,我只想合并 2 行(比如任何 n 行),即使 'NAME' 具有相同的值。
感谢您对此的帮助。
谢谢
您可以按 grp
分组以获取组内的相关块:
blocks = df.NAME.ne(df.NAME.shift()).cumsum()
(df.groupby([blocks, df.groupby(blocks).cumcount()//2])
.agg({'NAME':'first', 'TEXT':' '.join,
'START':'min', 'END':'max'})
)
输出:
NAME TEXT START END
NAME
1 0 Tim Tim Wagner is a teacher. He is from Cleveland,... 10.0 40.0
2 0 Frank Frank is a musician. 40.0 50.0
3 0 Tim He like to travel with his family 50.0 62.0
4 0 Frank He is a performing artist who plays the cello.... 62.0 85.0
1 Frank It was fantastic listening to him. I really en... 85.0 93.0
我有一个输入数据框,它包含以下内容:
NAME TEXT START END
Tim Tim Wagner is a teacher. 10 20.5
Tim He is from Cleveland, Ohio. 20.5 40
Frank Frank is a musician. 40 50
Tim He like to travel with his family 50 62
Frank He is a performing artist who plays the cello. 62 70
Frank He performed at the Carnegie Hall last year. 70 85
Frank It was fantastic listening to him. 85 90
Frank I really enjoyed 90 93
想要输出dataframe如下:
NAME TEXT START END
Tim Tim Wagner is a teacher. He is from Cleveland, Ohio. 10 40
Frank Frank is a musician 40 50
Tim He like to travel with his family 50 62
Frank He is a performing artist who plays the cello. He performed at the Carnegie Hall last year. 62 85
Frank It was fantastic listening to him. I really enjoyed 85 93
我当前的代码:
grp = (df['NAME'] != df['NAME'].shift()).cumsum().rename('group')
df.groupby(['NAME', grp], sort=False)['TEXT','START','END']\
.agg({'TEXT':lambda x: ' '.join(x), 'START': 'min', 'END':'max'})\
.reset_index().drop('group', axis=1)
这会将最后 4 行合二为一。相反,我只想合并 2 行(比如任何 n 行),即使 'NAME' 具有相同的值。
感谢您对此的帮助。
谢谢
您可以按 grp
分组以获取组内的相关块:
blocks = df.NAME.ne(df.NAME.shift()).cumsum()
(df.groupby([blocks, df.groupby(blocks).cumcount()//2])
.agg({'NAME':'first', 'TEXT':' '.join,
'START':'min', 'END':'max'})
)
输出:
NAME TEXT START END
NAME
1 0 Tim Tim Wagner is a teacher. He is from Cleveland,... 10.0 40.0
2 0 Frank Frank is a musician. 40.0 50.0
3 0 Tim He like to travel with his family 50.0 62.0
4 0 Frank He is a performing artist who plays the cello.... 62.0 85.0
1 Frank It was fantastic listening to him. I really en... 85.0 93.0