Pandas groupby on text : 获取每组多个句子的句子编号
Pandas groupby on text : get sentence numbering for multiple sentences per group
我的数据框如下所示:
id sentence ind
747 A simple and convenient colorimetric method is... NaN
747 A simple and convenient colorimetric method is... NaN
747 A simple and convenient colorimetric method is... ulcerative
749 Of special significance was the increased acti... NaN
749 Of special significance was the increased acti... NaN
749 Of special significance was the increased acti... head injuries
749 Of special significance was the increased acti... NaN
858 Some patients with acute viral hepatitis or pr... acute viral
858 Some patients with acute viral hepatitis or pr... NaN
858 Some patients with acute viral hepatitis or pr... NaN
948 The other ALP isozyme of FL cells had properti... NaN
948 The other ALP isozyme of FL cells had properti... NaN
948 The other ALP isozyme of FL cells had properti... NaN
948 It was found that a human hepatoma-associated ... NaN
948 It was found that a human hepatoma-associated ... hepatoma
948 It was found that a human hepatoma-associated ... NaN
948 It was more heat stable and more sensitive to ... virus
948 It was more heat stable and more sensitive to ... NaN
948 It was more heat stable and more sensitive to ... NaN
我正在使用 df.groupby(['id', 'sentence']).first().head(20)
,我得到了这个:
pmid sentence ind
747 A simple and convenient colorimetric method is... NaN
749 Of special significance was the increased acti... NaN
858 Some patients with acute viral hepatitis or pr... acute viral
948 It was found that a human hepatoma-associated... hepatoma
It was more heat stable and more sensitive to... virus
正如我们所见,对于 id=948
,有不止一对 (id-sentence)。
我的问题是:有没有办法为我的数据框中的每个 id 获取一个句子编号,因为我有一个 id 的多个(id-句子)对?
例如,要有这样的东西:
id sentence_nr sentence ind
747 01 A simple and convenient colorimetric method is... NaN
749 01 Of special significance was the increased acti... NaN
858 01 Some patients with acute viral hepatitis or pr... acute viral
948 01 It was found that a human hepatoma-associated ... hepatoma
948 02 It was more heat stable and more sensitive to ... virus
您可以使用 GroupBy.cumcount
:
df_grouped = df.groupby(['id', 'sentence'], as_index=False).first()
df_grouped['sentence_nr'] = df_grouped.groupby(df_grouped['id']).cumcount() + 1
print(df_grouped)
id sentence ind sentence_nr
0 747 A simple and convenient colorimetric method is... ulcerative 1
1 749 Of special significance was the increased acti... head injuries 1
2 858 Some patients with acute viral hepatitis or pr... acute viral 1
3 948 It was found that a human hepatoma-associated ... hepatoma 1
4 948 It was more heat stable and more sensitive to ... virus 2
5 948 The other ALP isozyme of FL cells had properti... None 3
我的数据框如下所示:
id sentence ind
747 A simple and convenient colorimetric method is... NaN
747 A simple and convenient colorimetric method is... NaN
747 A simple and convenient colorimetric method is... ulcerative
749 Of special significance was the increased acti... NaN
749 Of special significance was the increased acti... NaN
749 Of special significance was the increased acti... head injuries
749 Of special significance was the increased acti... NaN
858 Some patients with acute viral hepatitis or pr... acute viral
858 Some patients with acute viral hepatitis or pr... NaN
858 Some patients with acute viral hepatitis or pr... NaN
948 The other ALP isozyme of FL cells had properti... NaN
948 The other ALP isozyme of FL cells had properti... NaN
948 The other ALP isozyme of FL cells had properti... NaN
948 It was found that a human hepatoma-associated ... NaN
948 It was found that a human hepatoma-associated ... hepatoma
948 It was found that a human hepatoma-associated ... NaN
948 It was more heat stable and more sensitive to ... virus
948 It was more heat stable and more sensitive to ... NaN
948 It was more heat stable and more sensitive to ... NaN
我正在使用 df.groupby(['id', 'sentence']).first().head(20)
,我得到了这个:
pmid sentence ind
747 A simple and convenient colorimetric method is... NaN
749 Of special significance was the increased acti... NaN
858 Some patients with acute viral hepatitis or pr... acute viral
948 It was found that a human hepatoma-associated... hepatoma
It was more heat stable and more sensitive to... virus
正如我们所见,对于 id=948
,有不止一对 (id-sentence)。
我的问题是:有没有办法为我的数据框中的每个 id 获取一个句子编号,因为我有一个 id 的多个(id-句子)对?
例如,要有这样的东西:
id sentence_nr sentence ind
747 01 A simple and convenient colorimetric method is... NaN
749 01 Of special significance was the increased acti... NaN
858 01 Some patients with acute viral hepatitis or pr... acute viral
948 01 It was found that a human hepatoma-associated ... hepatoma
948 02 It was more heat stable and more sensitive to ... virus
您可以使用 GroupBy.cumcount
:
df_grouped = df.groupby(['id', 'sentence'], as_index=False).first()
df_grouped['sentence_nr'] = df_grouped.groupby(df_grouped['id']).cumcount() + 1
print(df_grouped)
id sentence ind sentence_nr
0 747 A simple and convenient colorimetric method is... ulcerative 1
1 749 Of special significance was the increased acti... head injuries 1
2 858 Some patients with acute viral hepatitis or pr... acute viral 1
3 948 It was found that a human hepatoma-associated ... hepatoma 1
4 948 It was more heat stable and more sensitive to ... virus 2
5 948 The other ALP isozyme of FL cells had properti... None 3