如何基于多个 space 字符将文本文件拆分为 2 列作为带有 scala spark 的分隔符

How to split a text file into 2 columns based on multiple space chars as seperator with scala spark

我很难将带有分隔符“ ”(多个空格)的文本数据文件拆分为数据框列。我加载的数据文件如下所示:

results1.show()

+--------------------+
|                 all|
+--------------------+
|1     hjvh hjk 9 gkk|
|2     yjg vv 87 9bh |
|3     kjn 90j jn kjn|
|4     hb jkbkj j jb |
|....                |
|....                |
|....                |
|9997  jn kjn kjn jkn|
|9998  njkj jn8 98 in|
|9999  nkj kjnkn kjnk|

我希望它分成 2 个单独的列,如下所示:

|     No|          Address |
+-------+------------------|
|      1|    hjvh hjk 9 gkk|  
|      2|    yjg vv 87 9bh |      
|      3|    kjn 90j jn kjn|     
|      4|    hb jkbkj j jb |  
|     ..|             
|     ..|             
|     ..|             
|   9997|    jn kjn kjn jkn| 
|   9998|    njkj jn8 98 in|
|   9999|    nkj kjnkn kjnk|

您可以使用 split.

df.withColumn('all', f.expr("split(all, '[ ]{2,}')")) \
  .select(f.col('all')[0], f.col('all')[1]) \
  .toDF('No', 'Address').show()

+----+--------------+
|  id|         value|
+----+--------------+
|   1|hjvh hjk 9 gkk|
|   2|yjg vv 87 9bh |
|   3|kjn 90j jn kjn|
|   4|hb jkbkj j jb |
|9997|jn kjn kjn jkn|
|9998|njkj jn8 98 in|
|9999|nkj kjnkn kjnk|
+----+--------------+

您想使用正则表达式在第一次出现 space 时拆分列。

详见。 区别在于你的分隔符是 space (\s)

results1.withColumn("Temp", split($"all", "(?<=^[^\s]*)\s"))
.withColumn("No", $"Temp"(0))
.withColumn("Address", $"Temp"(1))
.drop("all","Temp")
.show()

输出

+----+--------------------+
|  No|             Address|
+----+--------------------+
|   1|      hjvh hjk 9 gkk|
|   2|      yjg vv 87 9bh |
|   3|      kjn 90j jn kjn|
|   4|    hb jkbkj j jb...|
|9997|      jn kjn kjn jkn|
|9998|      njkj jn8 98 in|
|9999|      nkj kjnkn kjnk|
+----+--------------------+