查找关键字 +1 并创建新列

Finding keyword +1 and making new column

目标:

1) 找到关键字旁边的词(例如 brca

2) 用这个词创建一个新列

背景:

1) 我有一个列表 l,我在其中制作了一个数据框 df 并使用以下代码从中提取单词 brca

l = ['carcinoma brca positive completion mastectomy',
     'clinical brca gene mutation',
     'carcinoma brca positive chemotherapy']
df = pd.DataFrame(l, columns=['Text'])
df['Gene'] = df['Text'].str.extract(r"(brca)")

输出:

                                                Text    Gene
0   breast invasive lobular carcinoma brca positiv...   brca
1   clinical history brca gene mutation . gross de...   brca
2   left breast invasive ductal carcinoma brca pos...   brca

问题:

但是,我现在正在尝试为每一行找到单词 brca 旁边的单词并创建一个新列。

期望输出:

                                                Text    Gene  NextWord
0   breast invasive lobular carcinoma brca positiv...   brca  positive
1   clinical history brca gene mutation . gross de...   brca  gene
2   left breast invasive ductal carcinoma brca pos...   brca  positive

我看过 and ,但它们不太适合我。

问题:

我如何实现我的目标?

我们可以利用 python 的 built-in 方法调用 partition

df['NextWord'] = df['Text'].apply(lambda x: x.partition('brca')[2]).str.split().str[0]

输出

                                            Text  Gene  NextWord
0  carcinoma brca positive completion mastectomy  brca  positive
1                    clinical brca gene mutation  brca      gene
2           carcinoma brca positive chemotherapy  brca  positive

解释

.partitionreturns三个值:

  • 关键字前的字符串
  • 关键字本身
  • 关键字后的字符串
string = 'carcinoma brca positive completion mastectomy'

before, keyword, after = string.partition('brca')

print(before)
print(keyword)
print(after)

输出

carcinoma 
brca
 positive completion mastectomy

速度

我很好奇答案之间的速度比较,因为我使用了 .apply 但它是一种内置方法。没想到我的回答是最快的:

dfbig = pd.concat([df]*10000, ignore_index=True)
dfbig.shape

(30000, 2)
%%timeit
dfbig['Text'].apply(lambda x: x.partition('brca')[2]).str.split().str[0]
31.5 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
dfbig['NextWord'] = dfbig['Text'].str.split('brca').str[1].str.split('\s').str[1]
74.5 ms ± 2.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
dfbig['NextWord'] = dfbig['Text'].str.extract(r"(?<=brca)(.+?) ")
40.7 ms ± 2.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

大量使用 pandas Series.str 访问器:

df['NextWord'] = df['Text'].str.split('brca').str[1].str.split('\s').str[1]
df

                                            Text  Gene  NextWord
0  carcinoma brca positive completion mastectomy  brca  positive
1                    clinical brca gene mutation  brca      gene
2           carcinoma brca positive chemotherapy  brca  positive

使用:

import pandas as pd

l = ['carcinoma brca positive completion mastectomy',
     'clinical brca gene mutation',
     'carcinoma brca positive chemotherapy']
df = pd.DataFrame(l, columns=['Text'])

df['NextWord'] = df['Text'].str.extract(r"(?<=brca)(.+?) ")
print(df)

输出:

                                            Text   NextWord
0  carcinoma brca positive completion mastectomy   positive
1                    clinical brca gene mutation       gene
2           carcinoma brca positive chemotherapy   positive