Pyspark Dataframes 已解决属性错误,没有匹配的列名
Pyspark Dataframes Resolved attribute(s) error with no matching column names
我有一个数据框 graphcounts,其英雄 ID 和连接如下
+------+-----------+
|heroId|connections|
+------+-----------+
| 691| 7|
| 1159| 12|
| 3959| 143|
| 1572| 36|
| 2294| 15|
| 1090| 5|
| 3606| 172|
| 3414| 8|
| 296| 18|
| 4821| 17|
| 2162| 42|
| 1436| 10|
| 1512| 12|
我有另一个数据框 graph_names,英雄 ID 和名称如下。
+---+--------------------+
| id| name|
+---+--------------------+
| 1|24-HOUR MAN/EMMANUEL|
| 2|3-D MAN/CHARLES CHAN|
| 3| 4-D MAN/MERCURIO|
| 4| 8-BALL/|
| 5| A|
| 6| A'YIN|
| 7| ABBOTT, JACK|
| 8| ABCISSA|
| 9| ABEL|
| 10|ABOMINATION/EMIL BLO|
| 11|ABOMINATION | MUTANT|
| 12| ABOMINATRIX|
| 13| ABRAXAS|
| 14| ADAM 3,031|
| 15| ABSALOM|
我正在尝试创建一个地图列,该列可用于在 graphcounts 中查找 heroId,名称为 in graph_names 这让我陷入错误。我在另一个线程中提到了这个问题 https://issues.apache.org/jira/browse/SPARK-10925,但我的列名不一样。我不明白异常消息也不知道如何调试它。
>>> mapper = fn.create_map([graph_names.id, graph_names.name])
>>> mapper
Column<b'map(id, name)'>
>>>
>>> graphcounts.printSchema()
root
|-- heroId: string (nullable = true)
|-- connections: long (nullable = true)
>>>
>>> graph_names.printSchema()
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
>>>
>>>
>>> graphcounts.withColumn('name', mapper[graphcounts['heroId']]).show()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/Cellar/apache-spark/3.0.1/libexec/python/pyspark/sql/dataframe.py", line 2096, in withColumn
return DataFrame(self._jdf.withColumn(colName, col._jc), self.sql_ctx)
File "/usr/local/Cellar/apache-spark/3.0.1/libexec/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/usr/local/Cellar/apache-spark/3.0.1/libexec/python/pyspark/sql/utils.py", line 134, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.AnalysisException: Resolved attribute(s) id#242,name#243 missing from heroId#189,connections#203L in operator !Project [heroId#189, connections#203L, map(id#242, name#243)[heroId#189] AS name#286].;;
!Project [heroId#189, connections#203L, map(id#242, name#243)[heroId#189] AS name#286]
+- Project [heroId#189, sum(connections)#200L AS connections#203L]
+- Aggregate [heroId#189], [heroId#189, sum(cast(connections#192 as bigint)) AS sum(connections)#200L]
+- Project [value#7, heroId#189, (size(split(value#7, , -1), true) - 1) AS connections#192]
+- Project [value#7, split(value#7, , 2)[0] AS heroId#189]
+- Relation[value#7] text
错误的问题是数据框的每一列都有 header。但是,当我使用架构阅读并且不包括 header=True 时,header 列名称成为列值之一。查找失败,因为该列没有名称。
我有一个数据框 graphcounts,其英雄 ID 和连接如下
+------+-----------+
|heroId|connections|
+------+-----------+
| 691| 7|
| 1159| 12|
| 3959| 143|
| 1572| 36|
| 2294| 15|
| 1090| 5|
| 3606| 172|
| 3414| 8|
| 296| 18|
| 4821| 17|
| 2162| 42|
| 1436| 10|
| 1512| 12|
我有另一个数据框 graph_names,英雄 ID 和名称如下。
+---+--------------------+
| id| name|
+---+--------------------+
| 1|24-HOUR MAN/EMMANUEL|
| 2|3-D MAN/CHARLES CHAN|
| 3| 4-D MAN/MERCURIO|
| 4| 8-BALL/|
| 5| A|
| 6| A'YIN|
| 7| ABBOTT, JACK|
| 8| ABCISSA|
| 9| ABEL|
| 10|ABOMINATION/EMIL BLO|
| 11|ABOMINATION | MUTANT|
| 12| ABOMINATRIX|
| 13| ABRAXAS|
| 14| ADAM 3,031|
| 15| ABSALOM|
我正在尝试创建一个地图列,该列可用于在 graphcounts 中查找 heroId,名称为 in graph_names 这让我陷入错误。我在另一个线程中提到了这个问题 https://issues.apache.org/jira/browse/SPARK-10925,但我的列名不一样。我不明白异常消息也不知道如何调试它。
>>> mapper = fn.create_map([graph_names.id, graph_names.name])
>>> mapper
Column<b'map(id, name)'>
>>>
>>> graphcounts.printSchema()
root
|-- heroId: string (nullable = true)
|-- connections: long (nullable = true)
>>>
>>> graph_names.printSchema()
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
>>>
>>>
>>> graphcounts.withColumn('name', mapper[graphcounts['heroId']]).show()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/Cellar/apache-spark/3.0.1/libexec/python/pyspark/sql/dataframe.py", line 2096, in withColumn
return DataFrame(self._jdf.withColumn(colName, col._jc), self.sql_ctx)
File "/usr/local/Cellar/apache-spark/3.0.1/libexec/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/usr/local/Cellar/apache-spark/3.0.1/libexec/python/pyspark/sql/utils.py", line 134, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.AnalysisException: Resolved attribute(s) id#242,name#243 missing from heroId#189,connections#203L in operator !Project [heroId#189, connections#203L, map(id#242, name#243)[heroId#189] AS name#286].;;
!Project [heroId#189, connections#203L, map(id#242, name#243)[heroId#189] AS name#286]
+- Project [heroId#189, sum(connections)#200L AS connections#203L]
+- Aggregate [heroId#189], [heroId#189, sum(cast(connections#192 as bigint)) AS sum(connections)#200L]
+- Project [value#7, heroId#189, (size(split(value#7, , -1), true) - 1) AS connections#192]
+- Project [value#7, split(value#7, , 2)[0] AS heroId#189]
+- Relation[value#7] text
错误的问题是数据框的每一列都有 header。但是,当我使用架构阅读并且不包括 header=True 时,header 列名称成为列值之一。查找失败,因为该列没有名称。