Pyspark 替换 DF 列中的字符并转换为浮点数
Pyspark replace characters in DF column and cast as float
对 Pyspark 中的这个有任何想法吗?
我在薪水栏中的薪水如下所示。我试图删除 $
df = df.withColumn('clean_salary', regexp_replace(col("Salary"), '$', ''))
df.show()
如您所见,它什么也没做 - 知道为什么吗?
谢谢
+---+----------+----------+------+---------------+--------------------+---------+----------+-----------+------------+
| id|first_name| last_name|gender| City| Job Title| Salary| Latitude| Longitude|clean_salary|
+---+----------+----------+------+---------------+--------------------+---------+----------+-----------+------------+
| 1| Melinde| Shilburne|Female| Nowa Ruda| Assistant Professor|438.18|50.5774075| 16.4967184| 438.18|
| 2| Kimberly|Von Welden|Female| Bulgan| Programmer II|846.60|48.8231572|103.5218199| 846.60|
| 3| Alvera| Di Boldi|Female| null| null|576.52|39.9947462|116.3397725| 576.52|
| 4| Shannon| O'Griffin| Male| Divnomorskoye|Budget/Accounting...|489.23|44.5047212| 38.1300171| 489.23|
| 5| Sherwood| Macieja| Male| Mytishchi| VP Sales|863.09| null| 37.6489954| 863.09|
| 6| Maris| Folk|Female|Kinsealy-Drinan| Civil Engineer|101.16|53.4266145| -6.1644997| 101.16|
| 7| Masha| Divers|Female| Dachun| null|090.87| 24.879416| 118.930111| 090.87|
| 8| Goddart| Flear| Male| Trélissac|Desktop Support T...|116.36|45.1905186| 0.7423124| 116.36|
| 9| Roth|O'Cannavan| Male| Heitan|VP Product Manage...|697.10| 32.027934| 106.657113| 697.10|
试试下面的 regexp_replace 代码
updatedDF = df.withColumn('clean_salary', regexp_replace(col("Salary"), "[$]", ""))
updatedDF.show()
与正则表达式相比,仅删除第一个字符更容易(除非工资列值不是那么简单)
>>> df = sc.parallelize([('3',),('3',)]).toDF(['salary'])
>>> df.show()
+------+
|salary|
+------+
| 3|
| 3|
+------+
>>> df.select(df.salary.substr(2,100).cast('float').alias('salary')).show() #Float
+------+
|salary|
+------+
| 123.0|
| 873.0|
+------+
>>> df.select(df.salary.substr(2,100).cast('decimal(10,2)').alias('salary')).show() #Decimal
+------+
|salary|
+------+
|123.00|
|873.00|
+------+
对 Pyspark 中的这个有任何想法吗?
我在薪水栏中的薪水如下所示。我试图删除 $
df = df.withColumn('clean_salary', regexp_replace(col("Salary"), '$', ''))
df.show()
如您所见,它什么也没做 - 知道为什么吗?
谢谢
+---+----------+----------+------+---------------+--------------------+---------+----------+-----------+------------+
| id|first_name| last_name|gender| City| Job Title| Salary| Latitude| Longitude|clean_salary|
+---+----------+----------+------+---------------+--------------------+---------+----------+-----------+------------+
| 1| Melinde| Shilburne|Female| Nowa Ruda| Assistant Professor|438.18|50.5774075| 16.4967184| 438.18|
| 2| Kimberly|Von Welden|Female| Bulgan| Programmer II|846.60|48.8231572|103.5218199| 846.60|
| 3| Alvera| Di Boldi|Female| null| null|576.52|39.9947462|116.3397725| 576.52|
| 4| Shannon| O'Griffin| Male| Divnomorskoye|Budget/Accounting...|489.23|44.5047212| 38.1300171| 489.23|
| 5| Sherwood| Macieja| Male| Mytishchi| VP Sales|863.09| null| 37.6489954| 863.09|
| 6| Maris| Folk|Female|Kinsealy-Drinan| Civil Engineer|101.16|53.4266145| -6.1644997| 101.16|
| 7| Masha| Divers|Female| Dachun| null|090.87| 24.879416| 118.930111| 090.87|
| 8| Goddart| Flear| Male| Trélissac|Desktop Support T...|116.36|45.1905186| 0.7423124| 116.36|
| 9| Roth|O'Cannavan| Male| Heitan|VP Product Manage...|697.10| 32.027934| 106.657113| 697.10|
试试下面的 regexp_replace 代码
updatedDF = df.withColumn('clean_salary', regexp_replace(col("Salary"), "[$]", ""))
updatedDF.show()
与正则表达式相比,仅删除第一个字符更容易(除非工资列值不是那么简单)
>>> df = sc.parallelize([('3',),('3',)]).toDF(['salary'])
>>> df.show()
+------+
|salary|
+------+
| 3|
| 3|
+------+
>>> df.select(df.salary.substr(2,100).cast('float').alias('salary')).show() #Float
+------+
|salary|
+------+
| 123.0|
| 873.0|
+------+
>>> df.select(df.salary.substr(2,100).cast('decimal(10,2)').alias('salary')).show() #Decimal
+------+
|salary|
+------+
|123.00|
|873.00|
+------+