如何在 pandas 数据框中进行条件分组

Question

我正在使用 AMI 转录 数据集 (link) 并将 Words 文件转换为数据帧。数据框示例：

index	speaker	word_id	word	start_time	end_time
0	E	0	'Kay	3.34	3.88
1	E	1	.	3.88	3.88
2	A	0	Okay	5.57	5.94
3	E	2	Gosh	5.6	6.01
4	A	1	.	5.94	5.94
5	E	3	.	6.01	6.01
6	E	4	'Kay	10.48	10.88
7	E	5	.	10.88	10.88
8	A	2	Does	11.09	11.25
9	A	3	anyone	11.25	11.5
10	A	4	want	11.5	11.65
11	A	5	to	11.65	11.71
12	A	6	see	11.71	12.15
13	A	7	uh	12.15	12.42
14	A	8	Steve's	12.42	12.94
15	A	9	feedback	12.94	13.5
16	A	10	from	13.5	13.71
17	A	11	the	13.71	14.73
18	A	12	specification	14.73	15.53
19	A	13	?	15.53	15.53
20	E	6	Is	16.77	16.94
21	E	7	there	16.94	17.04
22	E	8	much	17.04	17.25
23	D	0	I	17.08	17.34
24	E	9	more	17.25	17.53
25	D	1	I	17.34	17.47
26	D	2	dry-read	17.47	17.92
27	E	10	in	17.53	17.63
28	E	11	it	17.63	17.73
29	E	12	than	17.73	17.88
30	E	13	he	17.88	18.0

我对话语的定义如下：同一说话者的单词列表（序列），其中每个连续的单词之间的间隔不超过 0.5 秒。两个连续单词A、B之间的间距定义为A的结束时间和B的开始时间之间的差值。

例如，在上面的数据中，我们有 7 个话语：

['Kay, .] 演讲者 E（索引 0、1）
[好的，.] 演讲者 A（索引 2、4）
[Gosh, .] 演讲者 E（索引 3、5）
[Kay, .] 演讲者 E（索引 6、7）
[有没有人想看，呃，史蒂夫的……，?] 演讲者 A（索引 8-19）
[Is, there, much, more, in, it, than, he] 演讲者 E（索引 21-22、24、27-30）
[I, I, dry-read] 演讲者 D（索引 23、25-26）

我的目标是提取如上所示的话语 - 通过创建代表每个话语的单词列表，并指出该话语的说话者。此外，我需要指出在说话过程中是否有任何串音。具有连续指示的话语是那些没有串音的话语。在上面的示例中，这些是 1、4 和 5。

我尝试了几个方向，但没有找到正确执行分组的方法。

感谢您的帮助。

Answer 1

这个很棘手但很有趣：

我们可以从 groupby shift 开始，每个 speaker :

>>> df['end_time_shifted'] = df.groupby('speaker')['end_time'].shift(1)
>>> df
    speaker     word_id     word            start_time  end_time    end_time_shifted
0   E           0           'Kay            3.34        3.88        NaN
1   E           1           .               3.88        3.88        3.88
2   A           0           Okay            5.57        5.94        NaN
3   E           2           Gosh            5.60        6.01        3.88
4   A           1           .               5.94        5.94        5.94
5   E           3           .               6.01        6.01        6.01
6   E           4           'Kay            10.48       10.88       6.01
7   E           5           .               10.88       10.88       10.88
8   A           2           Does            11.09       11.25       5.94
9   A           3           anyone          11.25       11.50       11.25
10  A           4           want            11.50       11.65       11.50
11  A           5           to              11.65       11.71       11.65
12  A           6           see             11.71       12.15       11.71
13  A           7           uh              12.15       12.42       12.15
14  A           8           Steve's         12.42       12.94       12.42
15  A           9           feedback        12.94       13.50       12.94
16  A           10          from            13.50       13.71       13.50
17  A           11          the             13.71       14.73       13.71
18  A           12          specification   14.73       15.53       14.73
19  A           13          ?               15.53       15.53       15.53
20  E           6           Is              16.77       16.94       10.88
21  E           7           there           16.94       17.04       16.94
22  E           8           much            17.04       17.25       17.04
23  D           0           I               17.08       17.34       NaN
24  E           9           more            17.25       17.53       17.25
25  D           1           I               17.34       17.47       17.34
26  D           2           dry-read        17.47       17.92       17.47
27  E           10          in              17.53       17.63       17.53
28  E           11          it              17.63       17.73       17.63
29  E           12          than            17.73       17.88       17.73
30  E           13          he              17.88       18.00       17.88

然后，我们计算 time_delta :

>>> df['time_delta'] = df['end_time'] - df['end_time_shifted']
>>> df = df.fillna(0)
>>> df
    speaker     word_id     word            start_time  end_time    end_time_shifted    time_delta
0   E           0           'Kay            3.34        3.88        0.00                0.00
1   E           1           .               3.88        3.88        3.88                0.00
2   A           0           Okay            5.57        5.94        0.00                0.00
3   E           2           Gosh            5.60        6.01        3.88                1.72
4   A           1           .               5.94        5.94        5.94                0.00
5   E           3           .               6.01        6.01        6.01                0.00
6   E           4           'Kay            10.48       10.88       6.01                4.47
7   E           5           .               10.88       10.88       10.88               0.00
8   A           2           Does            11.09       11.25       5.94                5.15
9   A           3           anyone          11.25       11.50       11.25               0.00
10  A           4           want            11.50       11.65       11.50               0.00
11  A           5           to              11.65       11.71       11.65               0.00
12  A           6           see             11.71       12.15       11.71               0.00
13  A           7           uh              12.15       12.42       12.15               0.00
14  A           8           Steve's         12.42       12.94       12.42               0.00
15  A           9           feedback        12.94       13.50       12.94               0.00
16  A           10          from            13.50       13.71       13.50               0.00
17  A           11          the             13.71       14.73       13.71               0.00
18  A           12          specification   14.73       15.53       14.73               0.00
19  A           13          ?               15.53       15.53       15.53               0.00
20  E           6           Is              16.77       16.94       10.88               5.89
21  E           7           there           16.94       17.04       16.94               0.00
22  E           8           much            17.04       17.25       17.04               0.00
23  D           0           I               17.08       17.34       0.00                0.00
24  E           9           more            17.25       17.53       17.25               0.00
25  D           1           I               17.34       17.47       17.34               0.00
26  D           2           dry-read        17.47       17.92       17.47               0.00
27  E           10          in              17.53       17.63       17.53               0.00
28  E           11          it              17.63       17.73       17.63               0.00
29  E           12          than            17.73       17.88       17.73               0.00
30  E           13          he              17.88       18.00       17.88               0.00

接下来，如果单词属于给定约束 time_delta<=0.5 :

的 next_utterance，我们用 1 或 0 标记

>>> df.loc[df.time_delta <= 0.5, 'next_utterance'] = 0 
>>> df.loc[df.time_delta > 0.5, 'next_utterance'] = 1
>>> df
    speaker     word_id     word            start_time  end_time    end_time_shifted    time_delta  next_utterance
0   E           0           'Kay            3.34        3.88        0.00                0.00        0.0
1   E           1           .               3.88        3.88        3.88                0.00        0.0
2   A           0           Okay            5.57        5.94        0.00                0.00        0.0
3   E           2           Gosh            5.60        6.01        3.88                1.72        1.0
4   A           1           .               5.94        5.94        5.94                0.00        0.0
5   E           3           .               6.01        6.01        6.01                0.00        0.0
6   E           4           'Kay            10.48       10.88       6.01                4.47        1.0
7   E           5           .               10.88       10.88       10.88               0.00        0.0
8   A           2           Does            11.09       11.25       5.94                5.15        1.0
9   A           3           anyone          11.25       11.50       11.25               0.00        0.0
10  A           4           want            11.50       11.65       11.50               0.00        0.0
11  A           5           to              11.65       11.71       11.65               0.00        0.0
12  A           6           see             11.71       12.15       11.71               0.00        0.0
13  A           7           uh              12.15       12.42       12.15               0.00        0.0
14  A           8           Steve's         12.42       12.94       12.42               0.00        0.0
15  A           9           feedback        12.94       13.50       12.94               0.00        0.0
16  A           10          from            13.50       13.71       13.50               0.00        0.0
17  A           11          the             13.71       14.73       13.71               0.00        0.0
18  A           12          specification   14.73       15.53       14.73               0.00        0.0
19  A           13          ?               15.53       15.53       15.53               0.00        0.0
20  E           6           Is              16.77       16.94       10.88               5.89        1.0
21  E           7           there           16.94       17.04       16.94               0.00        0.0
22  E           8           much            17.04       17.25       17.04               0.00        0.0
23  D           0           I               17.08       17.34       0.00                0.00        0.0
24  E           9           more            17.25       17.53       17.25               0.00        0.0
25  D           1           I               17.34       17.47       17.34               0.00        0.0
26  D           2           dry-read        17.47       17.92       17.47               0.00        0.0
27  E           10          in              17.53       17.63       17.53               0.00        0.0
28  E           11          it              17.63       17.73       17.63               0.00        0.0
29  E           12          than            17.73       17.88       17.73               0.00        0.0
30  E           13          he              17.88       18.00       17.88               0.00        0.0

现在，我们可以使用 cumsum by speaker 在下一步中构建所需的列表：

>>> df['cumsum_by_group'] = df.groupby(['speaker'])['next_utterance'].cumsum()
>>> df
    speaker     word_id     word            start_time  end_time    end_time_shifted    time_delta  next_utterance  cumsum_by_group
0   E           0           'Kay            3.34        3.88        0.00                0.00        0.0             0.0
1   E           1           .               3.88        3.88        3.88                0.00        0.0             0.0
2   A           0           Okay            5.57        5.94        0.00                0.00        0.0             0.0
3   E           2           Gosh            5.60        6.01        3.88                1.72        1.0             1.0
4   A           1           .               5.94        5.94        5.94                0.00        0.0             0.0
5   E           3           .               6.01        6.01        6.01                0.00        0.0             1.0
6   E           4           'Kay            10.48       10.88       6.01                4.47        1.0             2.0
7   E           5           .               10.88       10.88       10.88               0.00        0.0             2.0
8   A           2           Does            11.09       11.25       5.94                5.15        1.0             1.0
9   A           3           anyone          11.25       11.50       11.25               0.00        0.0             1.0
10  A           4           want            11.50       11.65       11.50               0.00        0.0             1.0
11  A           5           to              11.65       11.71       11.65               0.00        0.0             1.0
12  A           6           see             11.71       12.15       11.71               0.00        0.0             1.0
13  A           7           uh              12.15       12.42       12.15               0.00        0.0             1.0
14  A           8           Steve's         12.42       12.94       12.42               0.00        0.0             1.0
15  A           9           feedback        12.94       13.50       12.94               0.00        0.0             1.0
16  A           10          from            13.50       13.71       13.50               0.00        0.0             1.0
17  A           11          the             13.71       14.73       13.71               0.00        0.0             1.0
18  A           12          specification   14.73       15.53       14.73               0.00        0.0             1.0
19  A           13          ?               15.53       15.53       15.53               0.00        0.0             1.0
20  E           6           Is              16.77       16.94       10.88               5.89        1.0             3.0
21  E           7           there           16.94       17.04       16.94               0.00        0.0             3.0
22  E           8           much            17.04       17.25       17.04               0.00        0.0             3.0
23  D           0           I               17.08       17.34       0.00                0.00        0.0             0.0
24  E           9           more            17.25       17.53       17.25               0.00        0.0             3.0
25  D           1           I               17.34       17.47       17.34               0.00        0.0             0.0
26  D           2           dry-read        17.47       17.92       17.47               0.00        0.0             0.0
27  E           10          in              17.53       17.63       17.53               0.00        0.0             3.0
28  E           11          it              17.63       17.73       17.63               0.00        0.0             3.0
29  E           12          than            17.73       17.88       17.73               0.00        0.0             3.0
30  E           13          he              17.88       18.00       17.88               0.00        0.0             3.0

最后，我们运行在 speaker 和 cumsum_by_group 上 groupby 以按预期生成列表：

>>> df_word = df.groupby(['speaker', 'cumsum_by_group'])['word'].apply(list).to_frame().reset_index()
>>> df_word 
    speaker     cumsum_by_group     word
0   A           0.0                 [Okay, .]
1   A           1.0                 [Does, anyone, want, to, see, uh, Steve's, fee...
2   D           0.0                 [I, I, dry-read]
3   E           0.0                 ['Kay, .]
4   E           1.0                 [Gosh, .]
5   E           2.0                 ['Kay, .]
6   E           3.0                 [Is, there, much, more, in, it, than, he]

要知道 utterance 是否如评论中所要求的那样干净，您可以执行以下操作：

>>> df_indice = df.groupby(['speaker', 'cumsum_by_group'])['index'].apply(list).to_frame().reset_index().rename(columns={'index': 'indice'})
>>> df_indice
    speaker     cumsum_by_group     indice
0   A           0.0                 [2, 4]
1   A           1.0                 [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
2   D           0.0                 [23, 25, 26]
3   E           0.0                 [0, 1]
4   E           1.0                 [3, 5]
5   E           2.0                 [6, 7]
6   E           3.0                 [20, 21, 22, 24, 27, 28, 29, 30]

我们像这样设置一个 check_continuity 函数 :

>>> def check_continuity(df):
...     my_list = df['indice']
...     return all(a+1==b for a, b in zip(my_list, my_list[1:]))
            
>>> df_indice["is_clean"] = df_indice.apply(check_continuity, axis=1)
>>> df_indice
    speaker     cumsum_by_group     indice                                          is_clean
0   A           0.0                 [2, 4]                                          False
1   A           1.0                 [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]  True
2   D           0.0                 [23, 25, 26]                                    False
3   E           0.0                 [0, 1]                                          True
4   E           1.0                 [3, 5]                                          False
5   E           2.0                 [6, 7]                                          True
6   E           3.0                 [20, 21, 22, 24, 27, 28, 29, 30]                False

通过合并两个 DataFrames，你得到最终的预期结果：

>>> df = pd.merge(df_word,
...               df_indice,
...               how='left',
...               left_on=['speaker', 'cumsum_by_group'],
...               right_on=['speaker', 'cumsum_by_group'])
>>> df
    speaker     cumsum_by_group     word                                                indice                                          is_clean
0   A           0.0                 [Okay, .]                                           [2, 4]                                          False
1   A           1.0                 [Does, anyone, want, to, see, uh, Steve's, fee...   [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]  True
2   D           0.0                 [I, I, dry-read]                                    [23, 25, 26]                                    False
3   E           0.0                 ['Kay, .]                                           [0, 1]                                          True
4   E           1.0                 [Gosh, .]                                           [3, 5]                                          False
5   E           2.0                 ['Kay, .]                                           [6, 7]                                          True
6   E           3.0                 [Is, there, much, more, in, it, than, he]           [20, 21, 22, 24, 27, 28, 29, 30]                False

如何在 pandas 数据框中进行条件分组

How to do conditional groupby in pandas dataframe

grouping

sequence

conditional-statements

pandas