带有 SQLContext::IndexError 的 Apache SPARK

Question

我正在尝试执行 Apache SPARK 文档的 Inferring the Schema Using Reflection 部分中提供的基本示例。

我正在 Cloudera Quickstart VM(CDH5) 上执行此操作

我要执行的例子如下::

# sc is an existing SparkContext.
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)

# Load a text file and convert each line to a Row.
lines = sc.textFile("/user/cloudera/analytics/book6_sample.csv")
parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))

# Infer the schema, and register the DataFrame as a table.
schemaPeople = sqlContext.createDataFrame(people)
schemaPeople.registerTempTable("people")

# SQL can be run over DataFrames that have been registered as a table.
teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

# The results of SQL queries are RDDs and support all the normal RDD operations.
teenNames = teenagers.map(lambda p: "Name: " + p.name)
for teenName in teenNames.collect():
  print(teenName)

I 运行代码完全如上所示，但在执行时总是出现错误“IndexError: list index out of 运行ge”最后一个命令（for 循环）。

输入文件 book6_sample 位于 book6_sample.csv.

我运行代码完全如上所示，但是当我执行最后一个命令（for 循环）时总是得到错误"IndexError: list index out of range"。

请指出我哪里出错了。

提前致谢。

此致，斯里

Answer 1

你的文件末尾有一个空行，这导致了这个error.Open你的文件在文本编辑器中并删除该行希望它能工作

带有 SQLContext::IndexError 的 Apache SPARK

Apache SPARK with SQLContext:: IndexError

apache-spark

apache-spark-sql

pyspark

pyspark-sql