使用 pyspark 将特定单词删除到数据框中

Question

我有一个 DataFrame

+------+--------------------+-----------------+----
|   id| titulo       |tipo      | formacion       |
+------+--------------------+-----------------+----
|32084|A             | Material | VION00001 TRADE |
|32350|B             | Curso    | CUS11222  LEADER|
|32362|C             | Curso    | ITIN9876  EVALUA|   
|32347|D             | Curso    | CUMPLI VION1234 |      
|32036|E             | Curso    | EVAN1111  INFORM|

我需要在 formacion 列中删除以 VION|CUS|ITIN|VION|EVAN 开头的字符，这样 Dataframe 看起来像

+------+--------------------+-----------------+----
|   id| titulo       |tipo      | formacion       |
+------+--------------------+-----------------+----
|32084|A             | Material |  TRADE          |
|32350|B             | Curso    |  LEADER         |
|32362|C             | Curso    |  EVALUA         |   
|32347|D             | Curso    |  CUMPLI         |      
|32036|E             | Curso    |  INFORM         |  
+------+--------------------+-----------------+----

感谢您的帮助

Answer 1

抱歉各位，这是来自 DataFrame 的原始列

formacion = [VION00001 贸易，CUS11222 领导者，ITIN9876 EVALUA，VION1234 CUMPLI，EVAN11 FR]

这是预期的

formacion = [贸易，领导者，EVALUA，CUMPLI，FR]

Answer 2

使用split函数按space拆分列，然后得到数组的最后一个元素。

从Spark2.4+使用element_at函数
对于 Spark < 2.4 使用 reverse(split(array))[0]

#using element_at
df.withColumn("formacion",element_at(split(col("formacion"),"\s"),-1)).show() 

#or using array_index
df.withColumn("formacion",split(col("formacion"),"\s")[1]).show()

#split reverse and get first index value
df.withColumn("formacion",reverse(split(col("formacion"),"\s"))[0]).show()

#+-----+--------------+----------+-------------+
#|   id|titulo        |tipo      | formacion   |
#+------+--------------------+-----------------+
#|32084|A             | Material |  TRADE      |
#|32350|B             | Curso    |  LEADER     |
#|32362|C             | Curso    |  EVALUA     |   
#|32347|D             | Curso    |  CUMPLI     |      
#|32036|E             | Curso    |  INFORM     |  
#+-----+--------------+----------+-------------+

使用 pyspark 将特定单词删除到数据框中

remove specific words into a dataframe with pyspark

helper

word

delete-row

pyspark

pyspark-dataframes