如何在 pandas 数据框中进行条件分组
How to do conditional groupby in pandas dataframe
我正在使用 AMI 转录 数据集 (link) 并将 Words 文件转换为数据帧。数据框示例:
index
speaker
word_id
word
start_time
end_time
0
E
0
'Kay
3.34
3.88
1
E
1
.
3.88
3.88
2
A
0
Okay
5.57
5.94
3
E
2
Gosh
5.6
6.01
4
A
1
.
5.94
5.94
5
E
3
.
6.01
6.01
6
E
4
'Kay
10.48
10.88
7
E
5
.
10.88
10.88
8
A
2
Does
11.09
11.25
9
A
3
anyone
11.25
11.5
10
A
4
want
11.5
11.65
11
A
5
to
11.65
11.71
12
A
6
see
11.71
12.15
13
A
7
uh
12.15
12.42
14
A
8
Steve's
12.42
12.94
15
A
9
feedback
12.94
13.5
16
A
10
from
13.5
13.71
17
A
11
the
13.71
14.73
18
A
12
specification
14.73
15.53
19
A
13
?
15.53
15.53
20
E
6
Is
16.77
16.94
21
E
7
there
16.94
17.04
22
E
8
much
17.04
17.25
23
D
0
I
17.08
17.34
24
E
9
more
17.25
17.53
25
D
1
I
17.34
17.47
26
D
2
dry-read
17.47
17.92
27
E
10
in
17.53
17.63
28
E
11
it
17.63
17.73
29
E
12
than
17.73
17.88
30
E
13
he
17.88
18.0
我对话语的定义如下:同一说话者的单词列表(序列),其中每个连续的单词之间的间隔不超过 0.5 秒。两个连续单词A、B之间的间距定义为A的结束时间和B的开始时间之间的差值。
例如,在上面的数据中,我们有 7 个话语:
- ['Kay, .] 演讲者 E(索引 0、1)
- [好的,.] 演讲者 A(索引 2、4)
- [Gosh, .] 演讲者 E(索引 3、5)
- [Kay, .] 演讲者 E(索引 6、7)
- [有没有人想看,呃,史蒂夫的……,?] 演讲者 A(索引 8-19)
- [Is, there, much, more, in, it, than, he] 演讲者 E(索引 21-22、24、27-30)
- [I, I, dry-read] 演讲者 D(索引 23、25-26)
我的目标是提取如上所示的话语 - 通过创建代表每个话语的单词列表,并指出该话语的说话者。此外,我需要指出在说话过程中是否有任何串音。具有连续指示的话语是那些没有串音的话语。在上面的示例中,这些是 1、4 和 5。
我尝试了几个方向,但没有找到正确执行分组的方法。
感谢您的帮助。
这个很棘手但很有趣:
我们可以从 groupby shift
开始,每个 speaker
:
>>> df['end_time_shifted'] = df.groupby('speaker')['end_time'].shift(1)
>>> df
speaker word_id word start_time end_time end_time_shifted
0 E 0 'Kay 3.34 3.88 NaN
1 E 1 . 3.88 3.88 3.88
2 A 0 Okay 5.57 5.94 NaN
3 E 2 Gosh 5.60 6.01 3.88
4 A 1 . 5.94 5.94 5.94
5 E 3 . 6.01 6.01 6.01
6 E 4 'Kay 10.48 10.88 6.01
7 E 5 . 10.88 10.88 10.88
8 A 2 Does 11.09 11.25 5.94
9 A 3 anyone 11.25 11.50 11.25
10 A 4 want 11.50 11.65 11.50
11 A 5 to 11.65 11.71 11.65
12 A 6 see 11.71 12.15 11.71
13 A 7 uh 12.15 12.42 12.15
14 A 8 Steve's 12.42 12.94 12.42
15 A 9 feedback 12.94 13.50 12.94
16 A 10 from 13.50 13.71 13.50
17 A 11 the 13.71 14.73 13.71
18 A 12 specification 14.73 15.53 14.73
19 A 13 ? 15.53 15.53 15.53
20 E 6 Is 16.77 16.94 10.88
21 E 7 there 16.94 17.04 16.94
22 E 8 much 17.04 17.25 17.04
23 D 0 I 17.08 17.34 NaN
24 E 9 more 17.25 17.53 17.25
25 D 1 I 17.34 17.47 17.34
26 D 2 dry-read 17.47 17.92 17.47
27 E 10 in 17.53 17.63 17.53
28 E 11 it 17.63 17.73 17.63
29 E 12 than 17.73 17.88 17.73
30 E 13 he 17.88 18.00 17.88
然后,我们计算 time_delta
:
>>> df['time_delta'] = df['end_time'] - df['end_time_shifted']
>>> df = df.fillna(0)
>>> df
speaker word_id word start_time end_time end_time_shifted time_delta
0 E 0 'Kay 3.34 3.88 0.00 0.00
1 E 1 . 3.88 3.88 3.88 0.00
2 A 0 Okay 5.57 5.94 0.00 0.00
3 E 2 Gosh 5.60 6.01 3.88 1.72
4 A 1 . 5.94 5.94 5.94 0.00
5 E 3 . 6.01 6.01 6.01 0.00
6 E 4 'Kay 10.48 10.88 6.01 4.47
7 E 5 . 10.88 10.88 10.88 0.00
8 A 2 Does 11.09 11.25 5.94 5.15
9 A 3 anyone 11.25 11.50 11.25 0.00
10 A 4 want 11.50 11.65 11.50 0.00
11 A 5 to 11.65 11.71 11.65 0.00
12 A 6 see 11.71 12.15 11.71 0.00
13 A 7 uh 12.15 12.42 12.15 0.00
14 A 8 Steve's 12.42 12.94 12.42 0.00
15 A 9 feedback 12.94 13.50 12.94 0.00
16 A 10 from 13.50 13.71 13.50 0.00
17 A 11 the 13.71 14.73 13.71 0.00
18 A 12 specification 14.73 15.53 14.73 0.00
19 A 13 ? 15.53 15.53 15.53 0.00
20 E 6 Is 16.77 16.94 10.88 5.89
21 E 7 there 16.94 17.04 16.94 0.00
22 E 8 much 17.04 17.25 17.04 0.00
23 D 0 I 17.08 17.34 0.00 0.00
24 E 9 more 17.25 17.53 17.25 0.00
25 D 1 I 17.34 17.47 17.34 0.00
26 D 2 dry-read 17.47 17.92 17.47 0.00
27 E 10 in 17.53 17.63 17.53 0.00
28 E 11 it 17.63 17.73 17.63 0.00
29 E 12 than 17.73 17.88 17.73 0.00
30 E 13 he 17.88 18.00 17.88 0.00
接下来,如果单词属于给定约束 time_delta<=0.5
:
的 next_utterance
,我们用 1
或 0
标记
>>> df.loc[df.time_delta <= 0.5, 'next_utterance'] = 0
>>> df.loc[df.time_delta > 0.5, 'next_utterance'] = 1
>>> df
speaker word_id word start_time end_time end_time_shifted time_delta next_utterance
0 E 0 'Kay 3.34 3.88 0.00 0.00 0.0
1 E 1 . 3.88 3.88 3.88 0.00 0.0
2 A 0 Okay 5.57 5.94 0.00 0.00 0.0
3 E 2 Gosh 5.60 6.01 3.88 1.72 1.0
4 A 1 . 5.94 5.94 5.94 0.00 0.0
5 E 3 . 6.01 6.01 6.01 0.00 0.0
6 E 4 'Kay 10.48 10.88 6.01 4.47 1.0
7 E 5 . 10.88 10.88 10.88 0.00 0.0
8 A 2 Does 11.09 11.25 5.94 5.15 1.0
9 A 3 anyone 11.25 11.50 11.25 0.00 0.0
10 A 4 want 11.50 11.65 11.50 0.00 0.0
11 A 5 to 11.65 11.71 11.65 0.00 0.0
12 A 6 see 11.71 12.15 11.71 0.00 0.0
13 A 7 uh 12.15 12.42 12.15 0.00 0.0
14 A 8 Steve's 12.42 12.94 12.42 0.00 0.0
15 A 9 feedback 12.94 13.50 12.94 0.00 0.0
16 A 10 from 13.50 13.71 13.50 0.00 0.0
17 A 11 the 13.71 14.73 13.71 0.00 0.0
18 A 12 specification 14.73 15.53 14.73 0.00 0.0
19 A 13 ? 15.53 15.53 15.53 0.00 0.0
20 E 6 Is 16.77 16.94 10.88 5.89 1.0
21 E 7 there 16.94 17.04 16.94 0.00 0.0
22 E 8 much 17.04 17.25 17.04 0.00 0.0
23 D 0 I 17.08 17.34 0.00 0.00 0.0
24 E 9 more 17.25 17.53 17.25 0.00 0.0
25 D 1 I 17.34 17.47 17.34 0.00 0.0
26 D 2 dry-read 17.47 17.92 17.47 0.00 0.0
27 E 10 in 17.53 17.63 17.53 0.00 0.0
28 E 11 it 17.63 17.73 17.63 0.00 0.0
29 E 12 than 17.73 17.88 17.73 0.00 0.0
30 E 13 he 17.88 18.00 17.88 0.00 0.0
现在,我们可以使用 cumsum
by speaker
在下一步中构建所需的列表:
>>> df['cumsum_by_group'] = df.groupby(['speaker'])['next_utterance'].cumsum()
>>> df
speaker word_id word start_time end_time end_time_shifted time_delta next_utterance cumsum_by_group
0 E 0 'Kay 3.34 3.88 0.00 0.00 0.0 0.0
1 E 1 . 3.88 3.88 3.88 0.00 0.0 0.0
2 A 0 Okay 5.57 5.94 0.00 0.00 0.0 0.0
3 E 2 Gosh 5.60 6.01 3.88 1.72 1.0 1.0
4 A 1 . 5.94 5.94 5.94 0.00 0.0 0.0
5 E 3 . 6.01 6.01 6.01 0.00 0.0 1.0
6 E 4 'Kay 10.48 10.88 6.01 4.47 1.0 2.0
7 E 5 . 10.88 10.88 10.88 0.00 0.0 2.0
8 A 2 Does 11.09 11.25 5.94 5.15 1.0 1.0
9 A 3 anyone 11.25 11.50 11.25 0.00 0.0 1.0
10 A 4 want 11.50 11.65 11.50 0.00 0.0 1.0
11 A 5 to 11.65 11.71 11.65 0.00 0.0 1.0
12 A 6 see 11.71 12.15 11.71 0.00 0.0 1.0
13 A 7 uh 12.15 12.42 12.15 0.00 0.0 1.0
14 A 8 Steve's 12.42 12.94 12.42 0.00 0.0 1.0
15 A 9 feedback 12.94 13.50 12.94 0.00 0.0 1.0
16 A 10 from 13.50 13.71 13.50 0.00 0.0 1.0
17 A 11 the 13.71 14.73 13.71 0.00 0.0 1.0
18 A 12 specification 14.73 15.53 14.73 0.00 0.0 1.0
19 A 13 ? 15.53 15.53 15.53 0.00 0.0 1.0
20 E 6 Is 16.77 16.94 10.88 5.89 1.0 3.0
21 E 7 there 16.94 17.04 16.94 0.00 0.0 3.0
22 E 8 much 17.04 17.25 17.04 0.00 0.0 3.0
23 D 0 I 17.08 17.34 0.00 0.00 0.0 0.0
24 E 9 more 17.25 17.53 17.25 0.00 0.0 3.0
25 D 1 I 17.34 17.47 17.34 0.00 0.0 0.0
26 D 2 dry-read 17.47 17.92 17.47 0.00 0.0 0.0
27 E 10 in 17.53 17.63 17.53 0.00 0.0 3.0
28 E 11 it 17.63 17.73 17.63 0.00 0.0 3.0
29 E 12 than 17.73 17.88 17.73 0.00 0.0 3.0
30 E 13 he 17.88 18.00 17.88 0.00 0.0 3.0
最后,我们 运行 在 speaker
和 cumsum_by_group
上 groupby
以按预期生成列表:
>>> df_word = df.groupby(['speaker', 'cumsum_by_group'])['word'].apply(list).to_frame().reset_index()
>>> df_word
speaker cumsum_by_group word
0 A 0.0 [Okay, .]
1 A 1.0 [Does, anyone, want, to, see, uh, Steve's, fee...
2 D 0.0 [I, I, dry-read]
3 E 0.0 ['Kay, .]
4 E 1.0 [Gosh, .]
5 E 2.0 ['Kay, .]
6 E 3.0 [Is, there, much, more, in, it, than, he]
要知道 utterance
是否如评论中所要求的那样干净,您可以执行以下操作:
>>> df_indice = df.groupby(['speaker', 'cumsum_by_group'])['index'].apply(list).to_frame().reset_index().rename(columns={'index': 'indice'})
>>> df_indice
speaker cumsum_by_group indice
0 A 0.0 [2, 4]
1 A 1.0 [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
2 D 0.0 [23, 25, 26]
3 E 0.0 [0, 1]
4 E 1.0 [3, 5]
5 E 2.0 [6, 7]
6 E 3.0 [20, 21, 22, 24, 27, 28, 29, 30]
我们像这样设置一个 check_continuity
函数 :
>>> def check_continuity(df):
... my_list = df['indice']
... return all(a+1==b for a, b in zip(my_list, my_list[1:]))
>>> df_indice["is_clean"] = df_indice.apply(check_continuity, axis=1)
>>> df_indice
speaker cumsum_by_group indice is_clean
0 A 0.0 [2, 4] False
1 A 1.0 [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19] True
2 D 0.0 [23, 25, 26] False
3 E 0.0 [0, 1] True
4 E 1.0 [3, 5] False
5 E 2.0 [6, 7] True
6 E 3.0 [20, 21, 22, 24, 27, 28, 29, 30] False
通过合并两个 DataFrames
,你得到最终的预期结果:
>>> df = pd.merge(df_word,
... df_indice,
... how='left',
... left_on=['speaker', 'cumsum_by_group'],
... right_on=['speaker', 'cumsum_by_group'])
>>> df
speaker cumsum_by_group word indice is_clean
0 A 0.0 [Okay, .] [2, 4] False
1 A 1.0 [Does, anyone, want, to, see, uh, Steve's, fee... [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19] True
2 D 0.0 [I, I, dry-read] [23, 25, 26] False
3 E 0.0 ['Kay, .] [0, 1] True
4 E 1.0 [Gosh, .] [3, 5] False
5 E 2.0 ['Kay, .] [6, 7] True
6 E 3.0 [Is, there, much, more, in, it, than, he] [20, 21, 22, 24, 27, 28, 29, 30] False
我正在使用 AMI 转录 数据集 (link) 并将 Words 文件转换为数据帧。数据框示例:
index | speaker | word_id | word | start_time | end_time |
---|---|---|---|---|---|
0 | E | 0 | 'Kay | 3.34 | 3.88 |
1 | E | 1 | . | 3.88 | 3.88 |
2 | A | 0 | Okay | 5.57 | 5.94 |
3 | E | 2 | Gosh | 5.6 | 6.01 |
4 | A | 1 | . | 5.94 | 5.94 |
5 | E | 3 | . | 6.01 | 6.01 |
6 | E | 4 | 'Kay | 10.48 | 10.88 |
7 | E | 5 | . | 10.88 | 10.88 |
8 | A | 2 | Does | 11.09 | 11.25 |
9 | A | 3 | anyone | 11.25 | 11.5 |
10 | A | 4 | want | 11.5 | 11.65 |
11 | A | 5 | to | 11.65 | 11.71 |
12 | A | 6 | see | 11.71 | 12.15 |
13 | A | 7 | uh | 12.15 | 12.42 |
14 | A | 8 | Steve's | 12.42 | 12.94 |
15 | A | 9 | feedback | 12.94 | 13.5 |
16 | A | 10 | from | 13.5 | 13.71 |
17 | A | 11 | the | 13.71 | 14.73 |
18 | A | 12 | specification | 14.73 | 15.53 |
19 | A | 13 | ? | 15.53 | 15.53 |
20 | E | 6 | Is | 16.77 | 16.94 |
21 | E | 7 | there | 16.94 | 17.04 |
22 | E | 8 | much | 17.04 | 17.25 |
23 | D | 0 | I | 17.08 | 17.34 |
24 | E | 9 | more | 17.25 | 17.53 |
25 | D | 1 | I | 17.34 | 17.47 |
26 | D | 2 | dry-read | 17.47 | 17.92 |
27 | E | 10 | in | 17.53 | 17.63 |
28 | E | 11 | it | 17.63 | 17.73 |
29 | E | 12 | than | 17.73 | 17.88 |
30 | E | 13 | he | 17.88 | 18.0 |
我对话语的定义如下:同一说话者的单词列表(序列),其中每个连续的单词之间的间隔不超过 0.5 秒。两个连续单词A、B之间的间距定义为A的结束时间和B的开始时间之间的差值。
例如,在上面的数据中,我们有 7 个话语:
- ['Kay, .] 演讲者 E(索引 0、1)
- [好的,.] 演讲者 A(索引 2、4)
- [Gosh, .] 演讲者 E(索引 3、5)
- [Kay, .] 演讲者 E(索引 6、7)
- [有没有人想看,呃,史蒂夫的……,?] 演讲者 A(索引 8-19)
- [Is, there, much, more, in, it, than, he] 演讲者 E(索引 21-22、24、27-30)
- [I, I, dry-read] 演讲者 D(索引 23、25-26)
我的目标是提取如上所示的话语 - 通过创建代表每个话语的单词列表,并指出该话语的说话者。此外,我需要指出在说话过程中是否有任何串音。具有连续指示的话语是那些没有串音的话语。在上面的示例中,这些是 1、4 和 5。
我尝试了几个方向,但没有找到正确执行分组的方法。
感谢您的帮助。
这个很棘手但很有趣:
我们可以从 groupby shift
开始,每个 speaker
:
>>> df['end_time_shifted'] = df.groupby('speaker')['end_time'].shift(1)
>>> df
speaker word_id word start_time end_time end_time_shifted
0 E 0 'Kay 3.34 3.88 NaN
1 E 1 . 3.88 3.88 3.88
2 A 0 Okay 5.57 5.94 NaN
3 E 2 Gosh 5.60 6.01 3.88
4 A 1 . 5.94 5.94 5.94
5 E 3 . 6.01 6.01 6.01
6 E 4 'Kay 10.48 10.88 6.01
7 E 5 . 10.88 10.88 10.88
8 A 2 Does 11.09 11.25 5.94
9 A 3 anyone 11.25 11.50 11.25
10 A 4 want 11.50 11.65 11.50
11 A 5 to 11.65 11.71 11.65
12 A 6 see 11.71 12.15 11.71
13 A 7 uh 12.15 12.42 12.15
14 A 8 Steve's 12.42 12.94 12.42
15 A 9 feedback 12.94 13.50 12.94
16 A 10 from 13.50 13.71 13.50
17 A 11 the 13.71 14.73 13.71
18 A 12 specification 14.73 15.53 14.73
19 A 13 ? 15.53 15.53 15.53
20 E 6 Is 16.77 16.94 10.88
21 E 7 there 16.94 17.04 16.94
22 E 8 much 17.04 17.25 17.04
23 D 0 I 17.08 17.34 NaN
24 E 9 more 17.25 17.53 17.25
25 D 1 I 17.34 17.47 17.34
26 D 2 dry-read 17.47 17.92 17.47
27 E 10 in 17.53 17.63 17.53
28 E 11 it 17.63 17.73 17.63
29 E 12 than 17.73 17.88 17.73
30 E 13 he 17.88 18.00 17.88
然后,我们计算 time_delta
:
>>> df['time_delta'] = df['end_time'] - df['end_time_shifted']
>>> df = df.fillna(0)
>>> df
speaker word_id word start_time end_time end_time_shifted time_delta
0 E 0 'Kay 3.34 3.88 0.00 0.00
1 E 1 . 3.88 3.88 3.88 0.00
2 A 0 Okay 5.57 5.94 0.00 0.00
3 E 2 Gosh 5.60 6.01 3.88 1.72
4 A 1 . 5.94 5.94 5.94 0.00
5 E 3 . 6.01 6.01 6.01 0.00
6 E 4 'Kay 10.48 10.88 6.01 4.47
7 E 5 . 10.88 10.88 10.88 0.00
8 A 2 Does 11.09 11.25 5.94 5.15
9 A 3 anyone 11.25 11.50 11.25 0.00
10 A 4 want 11.50 11.65 11.50 0.00
11 A 5 to 11.65 11.71 11.65 0.00
12 A 6 see 11.71 12.15 11.71 0.00
13 A 7 uh 12.15 12.42 12.15 0.00
14 A 8 Steve's 12.42 12.94 12.42 0.00
15 A 9 feedback 12.94 13.50 12.94 0.00
16 A 10 from 13.50 13.71 13.50 0.00
17 A 11 the 13.71 14.73 13.71 0.00
18 A 12 specification 14.73 15.53 14.73 0.00
19 A 13 ? 15.53 15.53 15.53 0.00
20 E 6 Is 16.77 16.94 10.88 5.89
21 E 7 there 16.94 17.04 16.94 0.00
22 E 8 much 17.04 17.25 17.04 0.00
23 D 0 I 17.08 17.34 0.00 0.00
24 E 9 more 17.25 17.53 17.25 0.00
25 D 1 I 17.34 17.47 17.34 0.00
26 D 2 dry-read 17.47 17.92 17.47 0.00
27 E 10 in 17.53 17.63 17.53 0.00
28 E 11 it 17.63 17.73 17.63 0.00
29 E 12 than 17.73 17.88 17.73 0.00
30 E 13 he 17.88 18.00 17.88 0.00
接下来,如果单词属于给定约束 time_delta<=0.5
:
next_utterance
,我们用 1
或 0
标记
>>> df.loc[df.time_delta <= 0.5, 'next_utterance'] = 0
>>> df.loc[df.time_delta > 0.5, 'next_utterance'] = 1
>>> df
speaker word_id word start_time end_time end_time_shifted time_delta next_utterance
0 E 0 'Kay 3.34 3.88 0.00 0.00 0.0
1 E 1 . 3.88 3.88 3.88 0.00 0.0
2 A 0 Okay 5.57 5.94 0.00 0.00 0.0
3 E 2 Gosh 5.60 6.01 3.88 1.72 1.0
4 A 1 . 5.94 5.94 5.94 0.00 0.0
5 E 3 . 6.01 6.01 6.01 0.00 0.0
6 E 4 'Kay 10.48 10.88 6.01 4.47 1.0
7 E 5 . 10.88 10.88 10.88 0.00 0.0
8 A 2 Does 11.09 11.25 5.94 5.15 1.0
9 A 3 anyone 11.25 11.50 11.25 0.00 0.0
10 A 4 want 11.50 11.65 11.50 0.00 0.0
11 A 5 to 11.65 11.71 11.65 0.00 0.0
12 A 6 see 11.71 12.15 11.71 0.00 0.0
13 A 7 uh 12.15 12.42 12.15 0.00 0.0
14 A 8 Steve's 12.42 12.94 12.42 0.00 0.0
15 A 9 feedback 12.94 13.50 12.94 0.00 0.0
16 A 10 from 13.50 13.71 13.50 0.00 0.0
17 A 11 the 13.71 14.73 13.71 0.00 0.0
18 A 12 specification 14.73 15.53 14.73 0.00 0.0
19 A 13 ? 15.53 15.53 15.53 0.00 0.0
20 E 6 Is 16.77 16.94 10.88 5.89 1.0
21 E 7 there 16.94 17.04 16.94 0.00 0.0
22 E 8 much 17.04 17.25 17.04 0.00 0.0
23 D 0 I 17.08 17.34 0.00 0.00 0.0
24 E 9 more 17.25 17.53 17.25 0.00 0.0
25 D 1 I 17.34 17.47 17.34 0.00 0.0
26 D 2 dry-read 17.47 17.92 17.47 0.00 0.0
27 E 10 in 17.53 17.63 17.53 0.00 0.0
28 E 11 it 17.63 17.73 17.63 0.00 0.0
29 E 12 than 17.73 17.88 17.73 0.00 0.0
30 E 13 he 17.88 18.00 17.88 0.00 0.0
现在,我们可以使用 cumsum
by speaker
在下一步中构建所需的列表:
>>> df['cumsum_by_group'] = df.groupby(['speaker'])['next_utterance'].cumsum()
>>> df
speaker word_id word start_time end_time end_time_shifted time_delta next_utterance cumsum_by_group
0 E 0 'Kay 3.34 3.88 0.00 0.00 0.0 0.0
1 E 1 . 3.88 3.88 3.88 0.00 0.0 0.0
2 A 0 Okay 5.57 5.94 0.00 0.00 0.0 0.0
3 E 2 Gosh 5.60 6.01 3.88 1.72 1.0 1.0
4 A 1 . 5.94 5.94 5.94 0.00 0.0 0.0
5 E 3 . 6.01 6.01 6.01 0.00 0.0 1.0
6 E 4 'Kay 10.48 10.88 6.01 4.47 1.0 2.0
7 E 5 . 10.88 10.88 10.88 0.00 0.0 2.0
8 A 2 Does 11.09 11.25 5.94 5.15 1.0 1.0
9 A 3 anyone 11.25 11.50 11.25 0.00 0.0 1.0
10 A 4 want 11.50 11.65 11.50 0.00 0.0 1.0
11 A 5 to 11.65 11.71 11.65 0.00 0.0 1.0
12 A 6 see 11.71 12.15 11.71 0.00 0.0 1.0
13 A 7 uh 12.15 12.42 12.15 0.00 0.0 1.0
14 A 8 Steve's 12.42 12.94 12.42 0.00 0.0 1.0
15 A 9 feedback 12.94 13.50 12.94 0.00 0.0 1.0
16 A 10 from 13.50 13.71 13.50 0.00 0.0 1.0
17 A 11 the 13.71 14.73 13.71 0.00 0.0 1.0
18 A 12 specification 14.73 15.53 14.73 0.00 0.0 1.0
19 A 13 ? 15.53 15.53 15.53 0.00 0.0 1.0
20 E 6 Is 16.77 16.94 10.88 5.89 1.0 3.0
21 E 7 there 16.94 17.04 16.94 0.00 0.0 3.0
22 E 8 much 17.04 17.25 17.04 0.00 0.0 3.0
23 D 0 I 17.08 17.34 0.00 0.00 0.0 0.0
24 E 9 more 17.25 17.53 17.25 0.00 0.0 3.0
25 D 1 I 17.34 17.47 17.34 0.00 0.0 0.0
26 D 2 dry-read 17.47 17.92 17.47 0.00 0.0 0.0
27 E 10 in 17.53 17.63 17.53 0.00 0.0 3.0
28 E 11 it 17.63 17.73 17.63 0.00 0.0 3.0
29 E 12 than 17.73 17.88 17.73 0.00 0.0 3.0
30 E 13 he 17.88 18.00 17.88 0.00 0.0 3.0
最后,我们 运行 在 speaker
和 cumsum_by_group
上 groupby
以按预期生成列表:
>>> df_word = df.groupby(['speaker', 'cumsum_by_group'])['word'].apply(list).to_frame().reset_index()
>>> df_word
speaker cumsum_by_group word
0 A 0.0 [Okay, .]
1 A 1.0 [Does, anyone, want, to, see, uh, Steve's, fee...
2 D 0.0 [I, I, dry-read]
3 E 0.0 ['Kay, .]
4 E 1.0 [Gosh, .]
5 E 2.0 ['Kay, .]
6 E 3.0 [Is, there, much, more, in, it, than, he]
要知道 utterance
是否如评论中所要求的那样干净,您可以执行以下操作:
>>> df_indice = df.groupby(['speaker', 'cumsum_by_group'])['index'].apply(list).to_frame().reset_index().rename(columns={'index': 'indice'})
>>> df_indice
speaker cumsum_by_group indice
0 A 0.0 [2, 4]
1 A 1.0 [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
2 D 0.0 [23, 25, 26]
3 E 0.0 [0, 1]
4 E 1.0 [3, 5]
5 E 2.0 [6, 7]
6 E 3.0 [20, 21, 22, 24, 27, 28, 29, 30]
我们像这样设置一个 check_continuity
函数 :
>>> def check_continuity(df):
... my_list = df['indice']
... return all(a+1==b for a, b in zip(my_list, my_list[1:]))
>>> df_indice["is_clean"] = df_indice.apply(check_continuity, axis=1)
>>> df_indice
speaker cumsum_by_group indice is_clean
0 A 0.0 [2, 4] False
1 A 1.0 [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19] True
2 D 0.0 [23, 25, 26] False
3 E 0.0 [0, 1] True
4 E 1.0 [3, 5] False
5 E 2.0 [6, 7] True
6 E 3.0 [20, 21, 22, 24, 27, 28, 29, 30] False
通过合并两个 DataFrames
,你得到最终的预期结果:
>>> df = pd.merge(df_word,
... df_indice,
... how='left',
... left_on=['speaker', 'cumsum_by_group'],
... right_on=['speaker', 'cumsum_by_group'])
>>> df
speaker cumsum_by_group word indice is_clean
0 A 0.0 [Okay, .] [2, 4] False
1 A 1.0 [Does, anyone, want, to, see, uh, Steve's, fee... [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19] True
2 D 0.0 [I, I, dry-read] [23, 25, 26] False
3 E 0.0 ['Kay, .] [0, 1] True
4 E 1.0 [Gosh, .] [3, 5] False
5 E 2.0 ['Kay, .] [6, 7] True
6 E 3.0 [Is, there, much, more, in, it, than, he] [20, 21, 22, 24, 27, 28, 29, 30] False