Pyspark Dataframes 已解决属性错误，没有匹配的列名

Question

我有一个数据框 graphcounts，其英雄 ID 和连接如下

+------+-----------+
|heroId|connections|
+------+-----------+
|   691|          7|
|  1159|         12|
|  3959|        143|
|  1572|         36|
|  2294|         15|
|  1090|          5|
|  3606|        172|
|  3414|          8|
|   296|         18|
|  4821|         17|
|  2162|         42|
|  1436|         10|
|  1512|         12|

我有另一个数据框 graph_names，英雄 ID 和名称如下。

+---+--------------------+
| id|                name|
+---+--------------------+
|  1|24-HOUR MAN/EMMANUEL|
|  2|3-D MAN/CHARLES CHAN|
|  3|    4-D MAN/MERCURIO|
|  4|             8-BALL/|
|  5|                   A|
|  6|               A'YIN|
|  7|        ABBOTT, JACK|
|  8|             ABCISSA|
|  9|                ABEL|
| 10|ABOMINATION/EMIL BLO|
| 11|ABOMINATION | MUTANT|
| 12|         ABOMINATRIX|
| 13|             ABRAXAS|
| 14|          ADAM 3,031|
| 15|             ABSALOM|

我正在尝试创建一个地图列，该列可用于在 graphcounts 中查找 heroId，名称为 in graph_names 这让我陷入错误。我在另一个线程中提到了这个问题 https://issues.apache.org/jira/browse/SPARK-10925，但我的列名不一样。我不明白异常消息也不知道如何调试它。

>>> mapper = fn.create_map([graph_names.id, graph_names.name])
>>> mapper
Column<b'map(id, name)'>
>>>
>>> graphcounts.printSchema()
root
 |-- heroId: string (nullable = true)
 |-- connections: long (nullable = true)
>>>
>>> graph_names.printSchema()
root
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
>>>
>>>
>>> graphcounts.withColumn('name', mapper[graphcounts['heroId']]).show()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/apache-spark/3.0.1/libexec/python/pyspark/sql/dataframe.py", line 2096, in withColumn
    return DataFrame(self._jdf.withColumn(colName, col._jc), self.sql_ctx)
  File "/usr/local/Cellar/apache-spark/3.0.1/libexec/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/usr/local/Cellar/apache-spark/3.0.1/libexec/python/pyspark/sql/utils.py", line 134, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from
pyspark.sql.utils.AnalysisException: Resolved attribute(s) id#242,name#243 missing from heroId#189,connections#203L in operator !Project [heroId#189, connections#203L, map(id#242, name#243)[heroId#189] AS name#286].;;
!Project [heroId#189, connections#203L, map(id#242, name#243)[heroId#189] AS name#286]
+- Project [heroId#189, sum(connections)#200L AS connections#203L]
   +- Aggregate [heroId#189], [heroId#189, sum(cast(connections#192 as bigint)) AS sum(connections)#200L]
      +- Project [value#7, heroId#189, (size(split(value#7,  , -1), true) - 1) AS connections#192]
         +- Project [value#7, split(value#7,  , 2)[0] AS heroId#189]
            +- Relation[value#7] text

Answer 1

错误的问题是数据框的每一列都有 header。但是，当我使用架构阅读并且不包括 header=True 时，header 列名称成为列值之一。查找失败，因为该列没有名称。

Pyspark Dataframes 已解决属性错误，没有匹配的列名

Pyspark Dataframes Resolved attribute(s) error with no matching column names

pyspark

pyspark-dataframes