如何在 pandas 数据框中进行条件分组

How to do conditional groupby in pandas dataframe

我正在使用 AMI 转录 数据集 (link) 并将 Words 文件转换为数据帧。数据框示例:

index speaker word_id word start_time end_time
0 E 0 'Kay 3.34 3.88
1 E 1 . 3.88 3.88
2 A 0 Okay 5.57 5.94
3 E 2 Gosh 5.6 6.01
4 A 1 . 5.94 5.94
5 E 3 . 6.01 6.01
6 E 4 'Kay 10.48 10.88
7 E 5 . 10.88 10.88
8 A 2 Does 11.09 11.25
9 A 3 anyone 11.25 11.5
10 A 4 want 11.5 11.65
11 A 5 to 11.65 11.71
12 A 6 see 11.71 12.15
13 A 7 uh 12.15 12.42
14 A 8 Steve's 12.42 12.94
15 A 9 feedback 12.94 13.5
16 A 10 from 13.5 13.71
17 A 11 the 13.71 14.73
18 A 12 specification 14.73 15.53
19 A 13 ? 15.53 15.53
20 E 6 Is 16.77 16.94
21 E 7 there 16.94 17.04
22 E 8 much 17.04 17.25
23 D 0 I 17.08 17.34
24 E 9 more 17.25 17.53
25 D 1 I 17.34 17.47
26 D 2 dry-read 17.47 17.92
27 E 10 in 17.53 17.63
28 E 11 it 17.63 17.73
29 E 12 than 17.73 17.88
30 E 13 he 17.88 18.0

我对话语的定义如下:同一说话者的单词列表(序列),其中每个连续的单词之间的间隔不超过 0.5 秒。两个连续单词A、B之间的间距定义为A的结束时间和B的开始时间之间的差值。

例如,在上面的数据中,我们有 7 个话语:

  1. ['Kay, .] 演讲者 E(索引 0、1)
  2. [好的,.] 演讲者 A(索引 2、4)
  3. [Gosh, .] 演讲者 E(索引 3、5)
  4. [Kay, .] 演讲者 E(索引 6、7)
  5. [有没有人想看,呃,史蒂夫的……,?] 演讲者 A(索引 8-19)
  6. [Is, there, much, more, in, it, than, he] 演讲者 E(索引 21-22、24、27-30)
  7. [I, I, dry-read] 演讲者 D(索引 23、25-26)

我的目标是提取如上所示的话语 - 通过创建代表每个话语的单词列表,并指出该话语的说话者。此外,我需要指出在说话过程中是否有任何串音。具有连续指示的话语是那些没有串音的话语。在上面的示例中,这些是 1、4 和 5。

我尝试了几个方向,但没有找到正确执行分组的方法。

感谢您的帮助。

这个很棘手但很有趣:

我们可以从 groupby shift 开始,每个 speaker :

>>> df['end_time_shifted'] = df.groupby('speaker')['end_time'].shift(1)
>>> df
    speaker     word_id     word            start_time  end_time    end_time_shifted
0   E           0           'Kay            3.34        3.88        NaN
1   E           1           .               3.88        3.88        3.88
2   A           0           Okay            5.57        5.94        NaN
3   E           2           Gosh            5.60        6.01        3.88
4   A           1           .               5.94        5.94        5.94
5   E           3           .               6.01        6.01        6.01
6   E           4           'Kay            10.48       10.88       6.01
7   E           5           .               10.88       10.88       10.88
8   A           2           Does            11.09       11.25       5.94
9   A           3           anyone          11.25       11.50       11.25
10  A           4           want            11.50       11.65       11.50
11  A           5           to              11.65       11.71       11.65
12  A           6           see             11.71       12.15       11.71
13  A           7           uh              12.15       12.42       12.15
14  A           8           Steve's         12.42       12.94       12.42
15  A           9           feedback        12.94       13.50       12.94
16  A           10          from            13.50       13.71       13.50
17  A           11          the             13.71       14.73       13.71
18  A           12          specification   14.73       15.53       14.73
19  A           13          ?               15.53       15.53       15.53
20  E           6           Is              16.77       16.94       10.88
21  E           7           there           16.94       17.04       16.94
22  E           8           much            17.04       17.25       17.04
23  D           0           I               17.08       17.34       NaN
24  E           9           more            17.25       17.53       17.25
25  D           1           I               17.34       17.47       17.34
26  D           2           dry-read        17.47       17.92       17.47
27  E           10          in              17.53       17.63       17.53
28  E           11          it              17.63       17.73       17.63
29  E           12          than            17.73       17.88       17.73
30  E           13          he              17.88       18.00       17.88

然后,我们计算 time_delta :

>>> df['time_delta'] = df['end_time'] - df['end_time_shifted']
>>> df = df.fillna(0)
>>> df
    speaker     word_id     word            start_time  end_time    end_time_shifted    time_delta
0   E           0           'Kay            3.34        3.88        0.00                0.00
1   E           1           .               3.88        3.88        3.88                0.00
2   A           0           Okay            5.57        5.94        0.00                0.00
3   E           2           Gosh            5.60        6.01        3.88                1.72
4   A           1           .               5.94        5.94        5.94                0.00
5   E           3           .               6.01        6.01        6.01                0.00
6   E           4           'Kay            10.48       10.88       6.01                4.47
7   E           5           .               10.88       10.88       10.88               0.00
8   A           2           Does            11.09       11.25       5.94                5.15
9   A           3           anyone          11.25       11.50       11.25               0.00
10  A           4           want            11.50       11.65       11.50               0.00
11  A           5           to              11.65       11.71       11.65               0.00
12  A           6           see             11.71       12.15       11.71               0.00
13  A           7           uh              12.15       12.42       12.15               0.00
14  A           8           Steve's         12.42       12.94       12.42               0.00
15  A           9           feedback        12.94       13.50       12.94               0.00
16  A           10          from            13.50       13.71       13.50               0.00
17  A           11          the             13.71       14.73       13.71               0.00
18  A           12          specification   14.73       15.53       14.73               0.00
19  A           13          ?               15.53       15.53       15.53               0.00
20  E           6           Is              16.77       16.94       10.88               5.89
21  E           7           there           16.94       17.04       16.94               0.00
22  E           8           much            17.04       17.25       17.04               0.00
23  D           0           I               17.08       17.34       0.00                0.00
24  E           9           more            17.25       17.53       17.25               0.00
25  D           1           I               17.34       17.47       17.34               0.00
26  D           2           dry-read        17.47       17.92       17.47               0.00
27  E           10          in              17.53       17.63       17.53               0.00
28  E           11          it              17.63       17.73       17.63               0.00
29  E           12          than            17.73       17.88       17.73               0.00
30  E           13          he              17.88       18.00       17.88               0.00

接下来,如果单词属于给定约束 time_delta<=0.5 :

next_utterance,我们用 10 标记
>>> df.loc[df.time_delta <= 0.5, 'next_utterance'] = 0 
>>> df.loc[df.time_delta > 0.5, 'next_utterance'] = 1
>>> df
    speaker     word_id     word            start_time  end_time    end_time_shifted    time_delta  next_utterance
0   E           0           'Kay            3.34        3.88        0.00                0.00        0.0
1   E           1           .               3.88        3.88        3.88                0.00        0.0
2   A           0           Okay            5.57        5.94        0.00                0.00        0.0
3   E           2           Gosh            5.60        6.01        3.88                1.72        1.0
4   A           1           .               5.94        5.94        5.94                0.00        0.0
5   E           3           .               6.01        6.01        6.01                0.00        0.0
6   E           4           'Kay            10.48       10.88       6.01                4.47        1.0
7   E           5           .               10.88       10.88       10.88               0.00        0.0
8   A           2           Does            11.09       11.25       5.94                5.15        1.0
9   A           3           anyone          11.25       11.50       11.25               0.00        0.0
10  A           4           want            11.50       11.65       11.50               0.00        0.0
11  A           5           to              11.65       11.71       11.65               0.00        0.0
12  A           6           see             11.71       12.15       11.71               0.00        0.0
13  A           7           uh              12.15       12.42       12.15               0.00        0.0
14  A           8           Steve's         12.42       12.94       12.42               0.00        0.0
15  A           9           feedback        12.94       13.50       12.94               0.00        0.0
16  A           10          from            13.50       13.71       13.50               0.00        0.0
17  A           11          the             13.71       14.73       13.71               0.00        0.0
18  A           12          specification   14.73       15.53       14.73               0.00        0.0
19  A           13          ?               15.53       15.53       15.53               0.00        0.0
20  E           6           Is              16.77       16.94       10.88               5.89        1.0
21  E           7           there           16.94       17.04       16.94               0.00        0.0
22  E           8           much            17.04       17.25       17.04               0.00        0.0
23  D           0           I               17.08       17.34       0.00                0.00        0.0
24  E           9           more            17.25       17.53       17.25               0.00        0.0
25  D           1           I               17.34       17.47       17.34               0.00        0.0
26  D           2           dry-read        17.47       17.92       17.47               0.00        0.0
27  E           10          in              17.53       17.63       17.53               0.00        0.0
28  E           11          it              17.63       17.73       17.63               0.00        0.0
29  E           12          than            17.73       17.88       17.73               0.00        0.0
30  E           13          he              17.88       18.00       17.88               0.00        0.0

现在,我们可以使用 cumsum by speaker 在下一步中构建所需的列表:

>>> df['cumsum_by_group'] = df.groupby(['speaker'])['next_utterance'].cumsum()
>>> df
    speaker     word_id     word            start_time  end_time    end_time_shifted    time_delta  next_utterance  cumsum_by_group
0   E           0           'Kay            3.34        3.88        0.00                0.00        0.0             0.0
1   E           1           .               3.88        3.88        3.88                0.00        0.0             0.0
2   A           0           Okay            5.57        5.94        0.00                0.00        0.0             0.0
3   E           2           Gosh            5.60        6.01        3.88                1.72        1.0             1.0
4   A           1           .               5.94        5.94        5.94                0.00        0.0             0.0
5   E           3           .               6.01        6.01        6.01                0.00        0.0             1.0
6   E           4           'Kay            10.48       10.88       6.01                4.47        1.0             2.0
7   E           5           .               10.88       10.88       10.88               0.00        0.0             2.0
8   A           2           Does            11.09       11.25       5.94                5.15        1.0             1.0
9   A           3           anyone          11.25       11.50       11.25               0.00        0.0             1.0
10  A           4           want            11.50       11.65       11.50               0.00        0.0             1.0
11  A           5           to              11.65       11.71       11.65               0.00        0.0             1.0
12  A           6           see             11.71       12.15       11.71               0.00        0.0             1.0
13  A           7           uh              12.15       12.42       12.15               0.00        0.0             1.0
14  A           8           Steve's         12.42       12.94       12.42               0.00        0.0             1.0
15  A           9           feedback        12.94       13.50       12.94               0.00        0.0             1.0
16  A           10          from            13.50       13.71       13.50               0.00        0.0             1.0
17  A           11          the             13.71       14.73       13.71               0.00        0.0             1.0
18  A           12          specification   14.73       15.53       14.73               0.00        0.0             1.0
19  A           13          ?               15.53       15.53       15.53               0.00        0.0             1.0
20  E           6           Is              16.77       16.94       10.88               5.89        1.0             3.0
21  E           7           there           16.94       17.04       16.94               0.00        0.0             3.0
22  E           8           much            17.04       17.25       17.04               0.00        0.0             3.0
23  D           0           I               17.08       17.34       0.00                0.00        0.0             0.0
24  E           9           more            17.25       17.53       17.25               0.00        0.0             3.0
25  D           1           I               17.34       17.47       17.34               0.00        0.0             0.0
26  D           2           dry-read        17.47       17.92       17.47               0.00        0.0             0.0
27  E           10          in              17.53       17.63       17.53               0.00        0.0             3.0
28  E           11          it              17.63       17.73       17.63               0.00        0.0             3.0
29  E           12          than            17.73       17.88       17.73               0.00        0.0             3.0
30  E           13          he              17.88       18.00       17.88               0.00        0.0             3.0

最后,我们 运行 在 speakercumsum_by_groupgroupby 以按预期生成列表:

>>> df_word = df.groupby(['speaker', 'cumsum_by_group'])['word'].apply(list).to_frame().reset_index()
>>> df_word 
    speaker     cumsum_by_group     word
0   A           0.0                 [Okay, .]
1   A           1.0                 [Does, anyone, want, to, see, uh, Steve's, fee...
2   D           0.0                 [I, I, dry-read]
3   E           0.0                 ['Kay, .]
4   E           1.0                 [Gosh, .]
5   E           2.0                 ['Kay, .]
6   E           3.0                 [Is, there, much, more, in, it, than, he]

要知道 utterance 是否如评论中所要求的那样干净,您可以执行以下操作:

>>> df_indice = df.groupby(['speaker', 'cumsum_by_group'])['index'].apply(list).to_frame().reset_index().rename(columns={'index': 'indice'})
>>> df_indice
    speaker     cumsum_by_group     indice
0   A           0.0                 [2, 4]
1   A           1.0                 [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
2   D           0.0                 [23, 25, 26]
3   E           0.0                 [0, 1]
4   E           1.0                 [3, 5]
5   E           2.0                 [6, 7]
6   E           3.0                 [20, 21, 22, 24, 27, 28, 29, 30]

我们像这样设置一个 check_continuity 函数 :

>>> def check_continuity(df):
...     my_list = df['indice']
...     return all(a+1==b for a, b in zip(my_list, my_list[1:]))
            
>>> df_indice["is_clean"] = df_indice.apply(check_continuity, axis=1)
>>> df_indice
    speaker     cumsum_by_group     indice                                          is_clean
0   A           0.0                 [2, 4]                                          False
1   A           1.0                 [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]  True
2   D           0.0                 [23, 25, 26]                                    False
3   E           0.0                 [0, 1]                                          True
4   E           1.0                 [3, 5]                                          False
5   E           2.0                 [6, 7]                                          True
6   E           3.0                 [20, 21, 22, 24, 27, 28, 29, 30]                False

通过合并两个 DataFrames,你得到最终的预期结果:

>>> df = pd.merge(df_word,
...               df_indice,
...               how='left',
...               left_on=['speaker', 'cumsum_by_group'],
...               right_on=['speaker', 'cumsum_by_group'])
>>> df
    speaker     cumsum_by_group     word                                                indice                                          is_clean
0   A           0.0                 [Okay, .]                                           [2, 4]                                          False
1   A           1.0                 [Does, anyone, want, to, see, uh, Steve's, fee...   [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]  True
2   D           0.0                 [I, I, dry-read]                                    [23, 25, 26]                                    False
3   E           0.0                 ['Kay, .]                                           [0, 1]                                          True
4   E           1.0                 [Gosh, .]                                           [3, 5]                                          False
5   E           2.0                 ['Kay, .]                                           [6, 7]                                          True
6   E           3.0                 [Is, there, much, more, in, it, than, he]           [20, 21, 22, 24, 27, 28, 29, 30]                False