查找关键字 +1 并创建新列
Finding keyword +1 and making new column
目标:
1) 找到关键字旁边的词(例如 brca
)
2) 用这个词创建一个新列
背景:
1) 我有一个列表 l
,我在其中制作了一个数据框 df
并使用以下代码从中提取单词 brca
:
l = ['carcinoma brca positive completion mastectomy',
'clinical brca gene mutation',
'carcinoma brca positive chemotherapy']
df = pd.DataFrame(l, columns=['Text'])
df['Gene'] = df['Text'].str.extract(r"(brca)")
输出:
Text Gene
0 breast invasive lobular carcinoma brca positiv... brca
1 clinical history brca gene mutation . gross de... brca
2 left breast invasive ductal carcinoma brca pos... brca
问题:
但是,我现在正在尝试为每一行找到单词 brca
旁边的单词并创建一个新列。
期望输出:
Text Gene NextWord
0 breast invasive lobular carcinoma brca positiv... brca positive
1 clinical history brca gene mutation . gross de... brca gene
2 left breast invasive ductal carcinoma brca pos... brca positive
我看过 and ,但它们不太适合我。
问题:
我如何实现我的目标?
我们可以利用 python 的 built-in 方法调用 partition
df['NextWord'] = df['Text'].apply(lambda x: x.partition('brca')[2]).str.split().str[0]
输出
Text Gene NextWord
0 carcinoma brca positive completion mastectomy brca positive
1 clinical brca gene mutation brca gene
2 carcinoma brca positive chemotherapy brca positive
解释
.partition
returns三个值:
- 关键字前的字符串
- 关键字本身
- 关键字后的字符串
string = 'carcinoma brca positive completion mastectomy'
before, keyword, after = string.partition('brca')
print(before)
print(keyword)
print(after)
输出
carcinoma
brca
positive completion mastectomy
速度
我很好奇答案之间的速度比较,因为我使用了 .apply
但它是一种内置方法。没想到我的回答是最快的:
dfbig = pd.concat([df]*10000, ignore_index=True)
dfbig.shape
(30000, 2)
%%timeit
dfbig['Text'].apply(lambda x: x.partition('brca')[2]).str.split().str[0]
31.5 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
dfbig['NextWord'] = dfbig['Text'].str.split('brca').str[1].str.split('\s').str[1]
74.5 ms ± 2.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
dfbig['NextWord'] = dfbig['Text'].str.extract(r"(?<=brca)(.+?) ")
40.7 ms ± 2.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
大量使用 pandas Series.str
访问器:
df['NextWord'] = df['Text'].str.split('brca').str[1].str.split('\s').str[1]
df
Text Gene NextWord
0 carcinoma brca positive completion mastectomy brca positive
1 clinical brca gene mutation brca gene
2 carcinoma brca positive chemotherapy brca positive
使用:
import pandas as pd
l = ['carcinoma brca positive completion mastectomy',
'clinical brca gene mutation',
'carcinoma brca positive chemotherapy']
df = pd.DataFrame(l, columns=['Text'])
df['NextWord'] = df['Text'].str.extract(r"(?<=brca)(.+?) ")
print(df)
输出:
Text NextWord
0 carcinoma brca positive completion mastectomy positive
1 clinical brca gene mutation gene
2 carcinoma brca positive chemotherapy positive
目标:
1) 找到关键字旁边的词(例如 brca
)
2) 用这个词创建一个新列
背景:
1) 我有一个列表 l
,我在其中制作了一个数据框 df
并使用以下代码从中提取单词 brca
:
l = ['carcinoma brca positive completion mastectomy',
'clinical brca gene mutation',
'carcinoma brca positive chemotherapy']
df = pd.DataFrame(l, columns=['Text'])
df['Gene'] = df['Text'].str.extract(r"(brca)")
输出:
Text Gene
0 breast invasive lobular carcinoma brca positiv... brca
1 clinical history brca gene mutation . gross de... brca
2 left breast invasive ductal carcinoma brca pos... brca
问题:
但是,我现在正在尝试为每一行找到单词 brca
旁边的单词并创建一个新列。
期望输出:
Text Gene NextWord
0 breast invasive lobular carcinoma brca positiv... brca positive
1 clinical history brca gene mutation . gross de... brca gene
2 left breast invasive ductal carcinoma brca pos... brca positive
我看过
问题:
我如何实现我的目标?
我们可以利用 python 的 built-in 方法调用 partition
df['NextWord'] = df['Text'].apply(lambda x: x.partition('brca')[2]).str.split().str[0]
输出
Text Gene NextWord
0 carcinoma brca positive completion mastectomy brca positive
1 clinical brca gene mutation brca gene
2 carcinoma brca positive chemotherapy brca positive
解释
.partition
returns三个值:
- 关键字前的字符串
- 关键字本身
- 关键字后的字符串
string = 'carcinoma brca positive completion mastectomy'
before, keyword, after = string.partition('brca')
print(before)
print(keyword)
print(after)
输出
carcinoma
brca
positive completion mastectomy
速度
我很好奇答案之间的速度比较,因为我使用了 .apply
但它是一种内置方法。没想到我的回答是最快的:
dfbig = pd.concat([df]*10000, ignore_index=True)
dfbig.shape
(30000, 2)
%%timeit
dfbig['Text'].apply(lambda x: x.partition('brca')[2]).str.split().str[0]
31.5 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
dfbig['NextWord'] = dfbig['Text'].str.split('brca').str[1].str.split('\s').str[1]
74.5 ms ± 2.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
dfbig['NextWord'] = dfbig['Text'].str.extract(r"(?<=brca)(.+?) ")
40.7 ms ± 2.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
大量使用 pandas Series.str
访问器:
df['NextWord'] = df['Text'].str.split('brca').str[1].str.split('\s').str[1]
df
Text Gene NextWord
0 carcinoma brca positive completion mastectomy brca positive
1 clinical brca gene mutation brca gene
2 carcinoma brca positive chemotherapy brca positive
使用:
import pandas as pd
l = ['carcinoma brca positive completion mastectomy',
'clinical brca gene mutation',
'carcinoma brca positive chemotherapy']
df = pd.DataFrame(l, columns=['Text'])
df['NextWord'] = df['Text'].str.extract(r"(?<=brca)(.+?) ")
print(df)
输出:
Text NextWord
0 carcinoma brca positive completion mastectomy positive
1 clinical brca gene mutation gene
2 carcinoma brca positive chemotherapy positive