无法在 pyspark 数据框中执行行操作
Unable to perform row operations in pyspark dataframe
我有一个这种形式的数据集:
Store_Name Items Ratings
Cartmax Cosmetics, Clothing, Perfumes 4.6/5
DollarSmart Watches, Clothing NEW
Megaplex Shoes, Cosmetics, Medicines, Sports 4.2/5
我想创建一个包含商店中商品数量的新列。例如,在第一行中,项目列有 3 个项目,因此该列的第一行值为 3。
在评级列中,几行有 'NEW' 和 'NULL' 值。我想删除所有这些行。
先过滤掉Ratings为null和NEW的行,然后用size
函数和split
函数得到items
.[=14=的个数]
import pyspark.sql.functions as F
......
df = df.filter('Ratings is not null and Ratings != "NEW"').withColumn('num_items', F.size(F.split('Items', ',')))
您可以使用 filter and split 实现此目的,如下所示 -
数据准备
s = StringIO("""
Store_Name Items Ratings
Cartmax Cosmetics, Clothing, Perfumes 4.6/5
DollarSmart Watches, Clothing NEW
Megaplex Shoes, Cosmetics, Medicines, Sports 4.2/5
""")
df = pd.read_csv(s,delimiter='\t')
sparkDF = sql.createDataFrame(df)
sparkDF.show(truncate=False)
+-----------+-----------------------------------+-------+
|Store_Name |Items |Ratings|
+-----------+-----------------------------------+-------+
|Cartmax |Cosmetics, Clothing, Perfumes |4.6/5 |
|DollarSmart|Watches, Clothing |NEW |
|Megaplex |Shoes, Cosmetics, Medicines, Sports|4.2/5 |
+-----------+-----------------------------------+-------+
过滤和分割
sparkDF = sparkDF.filter(~(F.col('Ratings').isin(['NEW','NULL'])) | F.col('Ratings').isNotNull())\
.withColumn('NumberOfItems',F.size(F.split(F.col('Items'),',')))
sparkDF.show(truncate=False)
+----------+-----------------------------------+-------+-------------+
|Store_Name|Items |Ratings|NumberOfItems|
+----------+-----------------------------------+-------+-------------+
|Cartmax |Cosmetics, Clothing, Perfumes |4.6/5 |3 |
|Megaplex |Shoes, Cosmetics, Medicines, Sports|4.2/5 |4 |
+----------+-----------------------------------+-------+-------------+
我有一个这种形式的数据集:
Store_Name Items Ratings
Cartmax Cosmetics, Clothing, Perfumes 4.6/5
DollarSmart Watches, Clothing NEW
Megaplex Shoes, Cosmetics, Medicines, Sports 4.2/5
我想创建一个包含商店中商品数量的新列。例如,在第一行中,项目列有 3 个项目,因此该列的第一行值为 3。
在评级列中,几行有 'NEW' 和 'NULL' 值。我想删除所有这些行。
先过滤掉Ratings为null和NEW的行,然后用size
函数和split
函数得到items
.[=14=的个数]
import pyspark.sql.functions as F
......
df = df.filter('Ratings is not null and Ratings != "NEW"').withColumn('num_items', F.size(F.split('Items', ',')))
您可以使用 filter and split 实现此目的,如下所示 -
数据准备
s = StringIO("""
Store_Name Items Ratings
Cartmax Cosmetics, Clothing, Perfumes 4.6/5
DollarSmart Watches, Clothing NEW
Megaplex Shoes, Cosmetics, Medicines, Sports 4.2/5
""")
df = pd.read_csv(s,delimiter='\t')
sparkDF = sql.createDataFrame(df)
sparkDF.show(truncate=False)
+-----------+-----------------------------------+-------+
|Store_Name |Items |Ratings|
+-----------+-----------------------------------+-------+
|Cartmax |Cosmetics, Clothing, Perfumes |4.6/5 |
|DollarSmart|Watches, Clothing |NEW |
|Megaplex |Shoes, Cosmetics, Medicines, Sports|4.2/5 |
+-----------+-----------------------------------+-------+
过滤和分割
sparkDF = sparkDF.filter(~(F.col('Ratings').isin(['NEW','NULL'])) | F.col('Ratings').isNotNull())\
.withColumn('NumberOfItems',F.size(F.split(F.col('Items'),',')))
sparkDF.show(truncate=False)
+----------+-----------------------------------+-------+-------------+
|Store_Name|Items |Ratings|NumberOfItems|
+----------+-----------------------------------+-------+-------------+
|Cartmax |Cosmetics, Clothing, Perfumes |4.6/5 |3 |
|Megaplex |Shoes, Cosmetics, Medicines, Sports|4.2/5 |4 |
+----------+-----------------------------------+-------+-------------+