如何创建一个包含字符串中单词的列?

How to create a column contains words from a string?

我有一个包含多个标题的列(字符串)的数据集。如果我想做一个操作,例如,我创建一个新列 (colB),用 contains or like(例如 like("%hello%") )进行操作,这样如果它有 discounts 在字符串中它将填充 discount 在 colB 中或在 have hello 中该字符串将在 colB 中填充 hello。如果两者都有,我会将它们用逗号分隔。

下面table所需的示例

colA colB
Hello World hello
Hi World null
Discounts for apples discount
Check this discount! discount
Hello World and discount! discount, hello

我该如何创建这种数据集?非常感谢!

这是一种方法:

import pandas as pd
import numpy as np
df = pd.DataFrame({'colA' : 'Hello World,Hi World,Discounts for apples,Check this discount!,Hello World and discount!'.split(',')})
df['colB'] = df['colA'].apply(lambda x: ', '.join(y for y in ['discount', 'hello'] if y in x.lower())).apply(lambda x: x if x else np.NaN)
print(df)

输出:

                        colA             colB
0                Hello World            hello
1                   Hi World              NaN
2       Discounts for apples         discount
3       Check this discount!         discount
4  Hello World and discount!  discount, hello

您可以使用 concat_ws 将您的文本添加到一起[=13​​=]

from pyspark.sql import functions as F

(df
    .withColumn('b', F.concat_ws(', ', F.array(
        F.when(F.lower('a').like('%hello%'), F.lit('hello')),
        F.when(F.lower('a').like('%discount%'), F.lit('discount')),
    )))
    .show(10, False)
)

+-------------------------+---------------+
|a                        |b              |
+-------------------------+---------------+
|Hello World              |hello          |
|Hi World                 |               |
|Discounts for apples     |discount       |
|Check this discount!     |discount       |
|Hello World and discount!|hello, discount|
+-------------------------+---------------+