使用从字符串列中选择的关键字构成 pandas DataFrame 的一个热编码类型列

Question

为了证明我的问题，请考虑以下示例。假设我有以下数据框

index	ignore_x	ignore_y	phrases
0	43	23	cat eats mice
1	1.3	33	water is pure
2	13	63	machine learning
3	15	35	where there is a will, there is a way

现在考虑我有一些词我想为这些词形成虚拟变量仅。

keywords = [cat, is]

为此，为每个关键字填充单独的列

index	x_ignore	y_ignore	phrases
0	43	23	cat eats mice
1	1.3	33	water is pure
2	13	63	machine learning
3	15	35	where there is a will, there is a way

扫描每个短语的单词，如果存在，则 returns 列为真或得到 1。（另一种方法也可以计算出现次数，但现在让我们保持简单)

index	x_ignore	y_ignore	phrases	kw_cat	kw_is
0	43	23	cat eats mice	1	0
1	1.3	33	water is pure	0	1
2	13	63	machine learning	0	0
3	15	35	where there is a will, there is a way	0	1

我一直在尝试什么？松散地，我一直在尝试做这样的事情

for row, element in enumerate(df):
    for item in keywords:
        if item in df['phrases'].str.split(' '):
            df.loc[row, element] = 1

但这对我没有帮助。它宁愿在那些虚拟变量上给我一个 1s 的对角线。

谢谢:)

编辑：只是加粗了关键字以帮助你们快速浏览 :)

Answer 1

您可以使用nltk.tokenizer.word_tokenize()将句子拆分成单词列表

import nltk

keywords = ['cat', 'is']

tokenize = df['phrases'].apply(nltk.tokenize.word_tokenize)

print(tokenize)

0                                    [cat, eats, mice]
1                                    [water, is, pure]
2                                  [machine, learning]
3    [where, there, is, a, will, ,, there, is, a, way]

然后遍历keywords并检查关键字是否在生成的单词列表中。

for keyword in keywords:
    df[f'kw_{keyword}'] = tokenize.apply(lambda lst: int(keyword in lst))

print(df)

   index  ignore_x  ignore_y                                phrases  kw_cat  kw_is
0      0      43.0        23                          cat eats mice       1      0
1      1       1.3        33                          water is pure       0      1
2      2      13.0        63                       machine learning       0      0
3      3      15.0        35  where there is a will, there is a way       0      1

Answer 2

a_cat = df['phrases'].str.find('cat') != -1
a_is = df['phrases'].str.find('is') != -1

df.loc[df[a_cat == True].index, 'kw_cat'] = 1
df.loc[df[a_is == True].index, 'kw_is'] = 1

输出

   index  x_ignore  ...  kw_cat kw_is
0      0      43.0  ...       1     0
1      1       1.3  ...       0     1
2      2      13.0  ...       0     0
3      3      15.0  ...       0     1

如果有很多值，下面是代码。

keywords = ['cat', 'is']
ttt = 'kw_'
for i in keywords:
    a = df['phrases'].str.find(i)
    df.loc[df[a >= 0].index, ttt+i] = 1

这里使用了搜索必要的字符串，returns判断真假，据此形成索引设置值。

Answer 3

这是一种解决方法。由于短语是字符串，将它们转换为在新列下列出（在我的例子中是 phrases2）。 Explode 将列表元素转换为单独的行，这些元素根据关键字进行过滤。 get_dummies 将分类数据转换为列，最后删除重复项

df2 = df
df2['phrases2'] = df2['phrases'].apply(lambda x:   x.split(' ') )
df2=df2.explode('phrases2' )
df2=df2[df2['phrases2'].isin(keywords)]
pd.get_dummies(df2, columns=['phrases2']).drop_duplicates()

    index   ignore_x    ignore_y    phrases                             phrases2_cat phrases2_is
0       0   43            23        cat eats mice                               1     0
1       1   1.3           33        water is pure                               0     1
3       3   15            35        where there is a will, there is a way       0     1

使用从字符串列中选择的关键字构成 pandas DataFrame 的一个热编码类型列

Using selected keywords from a string column to form one hot encoding type columns of pandas DataFrame

python

pandas

dummy-variable