TypeError: tuple indices must be integers or slices, not str using Python Core API?

Question

我正在尝试使用 Python Core API 过滤一些数据，这是通过 Apache Spark 实现的，但我遇到了这个错误，而且我无法通过以下方式解决它我拥有的数据：

TypeError: tuple indices must be integers or slices, not str

现在，这是我的数据结构示例：

这是我用来过滤数据的代码，但它一直给我这个错误。我只是想从我的数据集中 return business_id、城市和星星。

(my_rdd
    .filter(lambda x: x['city']=='Toronto')
    .map(lambda x: (x['business_id'], x['city'], x['stars']))
).take(5)

有关如何过滤我的数据的任何指导都会有所帮助。

谢谢。

Answer 1

我认为您在这里误用了 filter 和 map。它们都用于更新列表，returns 列表。

它们都以一个函数作为参数（object版本就是这样，你也可以找到一个函数版本将输入列表作为第二个参数）并将其应用于输入列表的每个项目以构建输出列表。但不同的是他们对函数的使用：

filter 用它来过滤输入列表。该函数应该 return 一个布尔值，指示是否将项目包含在输出列表中。
map 使用它来构建与旧列表长度相同的新列表，但使用提供的函数更新值。

话虽如此，我相信您在尝试过滤列表时遇到了错误 TypeError: tuple indices must be integers or slices, not str。

在第一个循环中，filter 函数将尝试运行函数针对列表的第一个元素。第一个元素是元组 ('7v91woy8IpLrqXsRvxj_vw', (({'average_stars': 3.41, 'compliment_cool': 9, ...})))。问题是您正在尝试使用字符串访问此元组的值，就好像它是字典一样，这在 python 中是不允许的（并且没有多大意义）。

要提取您需要的数据，我会采用更简单的方法：

item = my_rdd[0]
(item[1][1]['business_id'], item[1][1]['city'], item[1][1]['stars'])

Answer 2

由于您的数据嵌套在元组中，因此您需要在 filter 和 map:

中指定元组索引

result = (my_rdd
    .filter(lambda x: x[1][1]['city']=='Toronto')
    .map(lambda x: (x[1][1]['business_id'], x[1][1]['city'], x[1][1]['stars']))
)

print(result.collect())
[('7v91woy8IpLrqXsRvxj_vw', 'Toronto', 3.0)]

TypeError: tuple indices must be integers or slices, not str using Python Core API?

TypeError: tuple indices must be integers or slices, not str using Python Core API?

python

hadoop

apache-spark

rdd

pyspark