如何使用 pySpark 确定列中是否存在特定的 string/pattern

Question

下面是我的家居用品示例数据框。

这里W代表Wooden G代表Glass和P代表Plastic，不同项目被归入该类别。所以我想确定哪些项目属于 W,G,P 类别。作为第一步，我尝试将其分类为 Chair

M = sqlContext.createDataFrame([('W-Chair-Shelf;G-Vase;P-Cup',''),
                                ('W-Chair',''),
                                ('W-Shelf;G-Cup;P-Chair',''),
                                  ('G-Cup;P-ShowerCap;W-Board','')],
                                 ['Household_chores_arrangements','Chair'])

M.createOrReplaceTempView('M')
+-----------------------------+-----+
|Household_chores_arrangements|Chair|
+-----------------------------+-----+
|   W-Chair-Shelf;G-Vase;P-Cup|     |
|                      W-Chair|     |
|        W-Shelf;G-Cup;P-Chair|     |
|    G-Cup;P-ShowerCap;W-Board|     |
+-----------------------------+-----+

我尝试在一个可以标记为 W 的条件下执行此操作，但我没有得到预期的结果，可能是我的条件不对。

df = sqlContext.sql("select * from M where Household_chores_arrangements like '%W%Chair%'")
display(df)

在 pySpark 中是否有更好的方法来做到这一点

预期产出

+-----------------------------+-----+
|Household_chores_arrangements|Chair|
+-----------------------------+-----+
|   W-Chair-Shelf;G-Vase;P-Cup|    W|
|                      W-Chair|    W|
|        W-Shelf;G-Cup;P-Chair|    P|
|    G-Cup;P-ShowerCap;W-Board| NULL|
+-----------------------------+-----+

感谢@mck - 提供解决方案。

更新除此之外，我还试图分析更多关于 regexp_extract option.So 改变样本集

M = sqlContext.createDataFrame([('Wooden|Chair',''),
                                ('Wooden|Cup;Glass|Chair',''),
                                ('Wooden|Cup;Glass|Showercap;Plastic|Chair','')        ],
                                 ['Household_chores_arrangements','Chair'])
M.createOrReplaceTempView('M')
df = spark.sql("""
    select 
        Household_chores_arrangements, 
        nullif(regexp_extract(Household_chores_arrangements, '(Wooden|Glass|Plastic)(|Chair)', 1), '') as Chair 
    from M
""")
display(df)

结果：

+-----------------------------+-----------------+
|Household_chores_arrangements|            Chair|
+-----------------------------+-----------------+
|                 Wooden|Chair           |Wooden|
|       Wooden|Cup;Glass|Chair           |Wooden|
|Wooden|Cup;Glass|Showercap;Plastic|Chair|Wooden|
+-----------------------------+----------------+

分隔符更改为 |而不是 - 并在查询中进行了更改。期待如下结果，但得到了错误的结果

+-----------------------------+-----------------+
|Household_chores_arrangements|            Chair|
+-----------------------------+-----------------+
|                 Wooden|Chair           |Wooden|
|       Wooden|Cup;Glass|Chair           |Glass |
|Wooden|Cup;Glass|Showercap;Plastic|Chair|Plastic|
+-----------------------------+----------------+

如果单独更改分隔符，我们是否需要更改任何其他值？

更新 - 2

我已经找到了上述更新的解决方案。

对于管道分隔符，我们必须使用 4 \

来转义它们

Answer 1

您可以使用 regexp_extract 来提取类别，如果找不到匹配项，请使用 nullif 将空字符串替换为 null。

df = spark.sql("""
    select 
        Household_chores_arrangements, 
        nullif(regexp_extract(Household_chores_arrangements, '([A-Z])-Chair', 1), '') as Chair 
    from M
""")

df.show(truncate=False)
+-----------------------------+-----+
|Household_chores_arrangements|Chair|
+-----------------------------+-----+
|W-Chair-Shelf;G-Vase;P-Cup   |W    |
|W-Chair                      |W    |
|W-Shelf;G-Cup;P-Chair        |P    |
|G-Cup;P-ShowerCap;W-Board    |null |
+-----------------------------+-----+

如何使用 pySpark 确定列中是否存在特定的 string/pattern

How to identify if a particular string/pattern exist in a column using pySpark

string

sql-like

apache-spark

apache-spark-sql

pyspark