使用 Pyspark 查找行以及在第二个数据框中找不到的第一个数据框的行号

Question

我想检查包含 2 个 CSV 的 GB 中的大量数据。 CSV 文件没有 headers，并且只包含包含数字和字母的复杂字符串混合的列，例如

+--------------------------------+
| _c0                            |
+---+---------------------------+
| Hello | world | 1.3123.412 | B |
+---+----------------------------+

到目前为止，我能够转换成数据帧但不确定，有没有办法获取 df2 中未找到的 df1 的行号和行

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

file1 = 'file_path'
file2 = 'file_path'


df1 = spark.read.csv(file1)
df2 = spark.read.csv(file2)


df1.show(truncate=False)

Answer 1

学习中一步步来吧 df1

+------------------------------+
|_c0                           |
+------------------------------+
|Hello | world | 1.3123.412 | B|
|Hello | world | 1.3123.412 | C|
+------------------------------+

df2

+------------------------------+
|_c0                           |
+------------------------------+
|Hello | world | 1.3123.412 | D|
|Hello | world | 1.3123.412 | C|
+------------------------------+

使用 window 函数生成行号

df1= df1.withColumn('id', row_number().over(Window.orderBy('_c0')))

df2= df2.withColumn('id', row_number().over(Window.orderBy('_c0')))

使用左半连接。这些连接不会保留来自正确数据帧的任何值。他们只比较值并保留在右数据帧中也找到的左数据帧值

df1.join(df2, how='left_semi', on='_c0').show(truncate=False)

+------------------------------+---+
|_c0                           |id |
+------------------------------+---+
|Hello | world | 1.3123.412 | C|2  |
+------------------------------+---+

使用 Pyspark 查找行以及在第二个数据框中找不到的第一个数据框的行号

Finding the rows along with the row number of first dataframe not found in second dataframe using Pyspark

python

dataframe

pandas

apache-spark

pyspark