在 pyspark 数据框中，当我重命名列时，以前的名称仍可用于过滤。错误或功能？

Question

我使用包含字符串类型列的 PySpark 数据框处理 DataBricks。我使用 .withColumnRenamed() 重命名其中一个。在此过程的后期，我使用 .filter() 到 select 包含特定子字符串的行。我不小心使用了旧的列名，它仍然运行过滤器并产生了 'correct' 结果，就好像我使用了新的列名一样。我的问题是：这是错误还是功能？

我在简单的情况下复现了问题：

_test = sqlContext.createDataFrame([("abcd","efgh"), ("kalp","quarto"), ("aceg","egik")], [ 'x1', 'x2'])
_test.show()

+----+------+
|  x1|    x2|
+----+------+
|abcd|  efgh|
|kalp|quarto|
|aceg|  egik|
+----+------+

_test2 = _test.withColumnRenamed('x1', 'new')

_test2.filter("x1 == 'aceg'").show()

+----+----+
| new|  x2|
+----+----+
|aceg|egik|
+----+----+

_test2.filter("substring(x1,1,2) == 'ka'").show()
+----+------+
| new|    x2|
+----+------+
|kalp|quarto|
+----+------+

我预计过滤器命令会出错，因为“_test2”中不再存在列 x1。奇怪的是输出显示的是新名称 ('new').

另一个例子：

_test2.filter("substring(x1,1,1) == 'a'").show()

给予

+----+----+
| new|  x2|
+----+----+
|abcd|efgh|
|aceg|egik|
+----+----+

和_test2.filter("substring(x1,1,1) == 'a'").filter(F.col('x1') == 'abcd').show()给出

+----+----+
| new|  x2|
+----+----+
|abcd|efgh|
+----+----+

但是 _test2.select(['x1', 'x2']).show() 会抛出 'x1' 不存在的错误。

Answer 1

这是Spark的已知问题。社区决定不修复它。有关详细信息，请参阅此相关 jira。

在 pyspark 数据框中，当我重命名列时，以前的名称仍可用于过滤。错误或功能？

In a pyspark dataframe, when I rename a column, the previous name can still be used for filtering. Bug or feature?

python

pyspark

azure-databricks