无法在 pyspark 数据框中执行行操作

Question

我有一个这种形式的数据集：

Store_Name         Items                                      Ratings

Cartmax         Cosmetics, Clothing, Perfumes                  4.6/5
DollarSmart     Watches, Clothing                              NEW
Megaplex        Shoes, Cosmetics, Medicines, Sports            4.2/5

我想创建一个包含商店中商品数量的新列。例如，在第一行中，项目列有 3 个项目，因此该列的第一行值为 3。
在评级列中，几行有 'NEW' 和 'NULL' 值。我想删除所有这些行。

Answer 1

先过滤掉Ratings为null和NEW的行，然后用size函数和split函数得到items.[=14=的个数]

import pyspark.sql.functions as F

......
df = df.filter('Ratings is not null and Ratings != "NEW"').withColumn('num_items', F.size(F.split('Items', ',')))

Answer 2

您可以使用 filter and split 实现此目的，如下所示 -

数据准备

s = StringIO("""
Store_Name  Items   Ratings
Cartmax Cosmetics, Clothing, Perfumes   4.6/5
DollarSmart Watches, Clothing   NEW
Megaplex    Shoes, Cosmetics, Medicines, Sports 4.2/5
""")

df = pd.read_csv(s,delimiter='\t')

sparkDF = sql.createDataFrame(df)

sparkDF.show(truncate=False)

+-----------+-----------------------------------+-------+
|Store_Name |Items                              |Ratings|
+-----------+-----------------------------------+-------+
|Cartmax    |Cosmetics, Clothing, Perfumes      |4.6/5  |
|DollarSmart|Watches, Clothing                  |NEW    |
|Megaplex   |Shoes, Cosmetics, Medicines, Sports|4.2/5  |
+-----------+-----------------------------------+-------+

过滤和分割

sparkDF = sparkDF.filter(~(F.col('Ratings').isin(['NEW','NULL'])) | F.col('Ratings').isNotNull())\
                 .withColumn('NumberOfItems',F.size(F.split(F.col('Items'),',')))


sparkDF.show(truncate=False)

+----------+-----------------------------------+-------+-------------+
|Store_Name|Items                              |Ratings|NumberOfItems|
+----------+-----------------------------------+-------+-------------+
|Cartmax   |Cosmetics, Clothing, Perfumes      |4.6/5  |3            |
|Megaplex  |Shoes, Cosmetics, Medicines, Sports|4.2/5  |4            |
+----------+-----------------------------------+-------+-------------+

无法在 pyspark 数据框中执行行操作

Unable to perform row operations in pyspark dataframe

pyspark

数据准备

过滤和分割