Pyspark 替换 DF 列中的字符并转换为浮点数

Pyspark replace characters in DF column and cast as float

对 Pyspark 中的这个有任何想法吗?

我在薪水栏中的薪水如下所示。我试图删除 $

df = df.withColumn('clean_salary', regexp_replace(col("Salary"), '$', ''))
df.show()

如您所见,它什么也没做 - 知道为什么吗?

谢谢

+---+----------+----------+------+---------------+--------------------+---------+----------+-----------+------------+
| id|first_name| last_name|gender|           City|           Job Title|   Salary|  Latitude|  Longitude|clean_salary|
+---+----------+----------+------+---------------+--------------------+---------+----------+-----------+------------+
|  1|   Melinde| Shilburne|Female|      Nowa Ruda| Assistant Professor|438.18|50.5774075| 16.4967184|   438.18|
|  2|  Kimberly|Von Welden|Female|         Bulgan|       Programmer II|846.60|48.8231572|103.5218199|   846.60|
|  3|    Alvera|  Di Boldi|Female|           null|                null|576.52|39.9947462|116.3397725|   576.52|
|  4|   Shannon| O'Griffin|  Male|  Divnomorskoye|Budget/Accounting...|489.23|44.5047212| 38.1300171|   489.23|
|  5|  Sherwood|   Macieja|  Male|      Mytishchi|            VP Sales|863.09|      null| 37.6489954|   863.09|
|  6|     Maris|      Folk|Female|Kinsealy-Drinan|      Civil Engineer|101.16|53.4266145| -6.1644997|   101.16|
|  7|     Masha|    Divers|Female|         Dachun|                null|090.87| 24.879416| 118.930111|   090.87|
|  8|   Goddart|     Flear|  Male|      Trélissac|Desktop Support T...|116.36|45.1905186|  0.7423124|   116.36|
|  9|      Roth|O'Cannavan|  Male|         Heitan|VP Product Manage...|697.10| 32.027934| 106.657113|   697.10|

试试下面的 regexp_replace 代码

updatedDF = df.withColumn('clean_salary', regexp_replace(col("Salary"), "[$]", ""))
updatedDF.show()

与正则表达式相比,仅删除第一个字符更容易(除非工资列值不是那么简单)

>>> df = sc.parallelize([('3',),('3',)]).toDF(['salary'])
>>> df.show()
+------+
|salary|
+------+
|  3|
|  3|
+------+

>>> df.select(df.salary.substr(2,100).cast('float').alias('salary')).show() #Float
+------+
|salary|
+------+
| 123.0|
| 873.0|
+------+

>>> df.select(df.salary.substr(2,100).cast('decimal(10,2)').alias('salary')).show() #Decimal
+------+
|salary|
+------+
|123.00|
|873.00|
+------+