如何基于多个 space 字符将文本文件拆分为 2 列作为带有 scala spark 的分隔符
How to split a text file into 2 columns based on multiple space chars as seperator with scala spark
我很难将带有分隔符“ ”(多个空格)的文本数据文件拆分为数据框列。我加载的数据文件如下所示:
results1.show()
+--------------------+
| all|
+--------------------+
|1 hjvh hjk 9 gkk|
|2 yjg vv 87 9bh |
|3 kjn 90j jn kjn|
|4 hb jkbkj j jb |
|.... |
|.... |
|.... |
|9997 jn kjn kjn jkn|
|9998 njkj jn8 98 in|
|9999 nkj kjnkn kjnk|
我希望它分成 2 个单独的列,如下所示:
| No| Address |
+-------+------------------|
| 1| hjvh hjk 9 gkk|
| 2| yjg vv 87 9bh |
| 3| kjn 90j jn kjn|
| 4| hb jkbkj j jb |
| ..|
| ..|
| ..|
| 9997| jn kjn kjn jkn|
| 9998| njkj jn8 98 in|
| 9999| nkj kjnkn kjnk|
您可以使用 split
.
df.withColumn('all', f.expr("split(all, '[ ]{2,}')")) \
.select(f.col('all')[0], f.col('all')[1]) \
.toDF('No', 'Address').show()
+----+--------------+
| id| value|
+----+--------------+
| 1|hjvh hjk 9 gkk|
| 2|yjg vv 87 9bh |
| 3|kjn 90j jn kjn|
| 4|hb jkbkj j jb |
|9997|jn kjn kjn jkn|
|9998|njkj jn8 98 in|
|9999|nkj kjnkn kjnk|
+----+--------------+
您想使用正则表达式在第一次出现 space
时拆分列。
详见。
区别在于你的分隔符是 space (\s)
results1.withColumn("Temp", split($"all", "(?<=^[^\s]*)\s"))
.withColumn("No", $"Temp"(0))
.withColumn("Address", $"Temp"(1))
.drop("all","Temp")
.show()
输出
+----+--------------------+
| No| Address|
+----+--------------------+
| 1| hjvh hjk 9 gkk|
| 2| yjg vv 87 9bh |
| 3| kjn 90j jn kjn|
| 4| hb jkbkj j jb...|
|9997| jn kjn kjn jkn|
|9998| njkj jn8 98 in|
|9999| nkj kjnkn kjnk|
+----+--------------------+
我很难将带有分隔符“ ”(多个空格)的文本数据文件拆分为数据框列。我加载的数据文件如下所示:
results1.show()
+--------------------+
| all|
+--------------------+
|1 hjvh hjk 9 gkk|
|2 yjg vv 87 9bh |
|3 kjn 90j jn kjn|
|4 hb jkbkj j jb |
|.... |
|.... |
|.... |
|9997 jn kjn kjn jkn|
|9998 njkj jn8 98 in|
|9999 nkj kjnkn kjnk|
我希望它分成 2 个单独的列,如下所示:
| No| Address |
+-------+------------------|
| 1| hjvh hjk 9 gkk|
| 2| yjg vv 87 9bh |
| 3| kjn 90j jn kjn|
| 4| hb jkbkj j jb |
| ..|
| ..|
| ..|
| 9997| jn kjn kjn jkn|
| 9998| njkj jn8 98 in|
| 9999| nkj kjnkn kjnk|
您可以使用 split
.
df.withColumn('all', f.expr("split(all, '[ ]{2,}')")) \
.select(f.col('all')[0], f.col('all')[1]) \
.toDF('No', 'Address').show()
+----+--------------+
| id| value|
+----+--------------+
| 1|hjvh hjk 9 gkk|
| 2|yjg vv 87 9bh |
| 3|kjn 90j jn kjn|
| 4|hb jkbkj j jb |
|9997|jn kjn kjn jkn|
|9998|njkj jn8 98 in|
|9999|nkj kjnkn kjnk|
+----+--------------+
您想使用正则表达式在第一次出现 space
时拆分列。
详见space (\s)
results1.withColumn("Temp", split($"all", "(?<=^[^\s]*)\s"))
.withColumn("No", $"Temp"(0))
.withColumn("Address", $"Temp"(1))
.drop("all","Temp")
.show()
输出
+----+--------------------+
| No| Address|
+----+--------------------+
| 1| hjvh hjk 9 gkk|
| 2| yjg vv 87 9bh |
| 3| kjn 90j jn kjn|
| 4| hb jkbkj j jb...|
|9997| jn kjn kjn jkn|
|9998| njkj jn8 98 in|
|9999| nkj kjnkn kjnk|
+----+--------------------+