Spark 2.3.0 使用 Header 选项读取文本文件不起作用

Question

下面的代码正在运行，并从文本文件创建了一个 Spark 数据框。但是，我正在尝试使用 header 选项将第一列用作 header 并且由于某种原因它似乎没有发生。我不明白为什么！这一定是愚蠢的事情，但我无法解决这个问题。

>>>from pyspark.sql import SparkSession
>>>spark = SparkSession.builder.master("local").appName("Word Count")\
    .config("spark.some.config.option", "some-value")\
    .getOrCreate()
>>>df = spark.read.option("header", "true")\
    .option("delimiter", ",")\
    .option("inferSchema", "true")\
    .text("StockData/ETFs/aadr.us.txt")
>>>df.take(3)

Returns 以下:

[Row(value=u'Date,Open,High,Low,Close,Volume,OpenInt'), Row(value=u'2010-07-21,24.333,24.333,23.946,23.946,43321,0'), Row(value=u'2010-07-22,24.644,24.644,24.362,24.487,18031,0')]

>>>df.columns

Returns 以下:

['value']

Answer 1

问题

问题是您使用的是 .text api 调用而不是 .csv 或 .load。如果您阅读 .text api 文档 ，它会显示

def text(self, paths): """Loads text files and returns a :class:DataFrame whose schema starts with a string column named "value", and followed by partitioned columns if there are any. Each line in the text file is a new row in the resulting DataFrame. :param paths: string, or list of strings, for input path(s). df = spark.read.text('python/test_support/sql/text-test.txt') df.collect() [Row(value=u'hello'), Row(value=u'this')] """

使用 .csv 的解决方案

将 .text 函数调用更改为 .csv，您应该没问题，因为

df = spark.read.option("header", "true") \
    .option("delimiter", ",") \
    .option("inferSchema", "true") \
    .csv("StockData/ETFs/aadr.us.txt")

df.show(2, truncate=False)

哪个应该给你

+-------------------+------+------+------+------+------+-------+
|Date               |Open  |High  |Low   |Close |Volume|OpenInt|
+-------------------+------+------+------+------+------+-------+
|2010-07-21 00:00:00|24.333|24.333|23.946|23.946|43321 |0      |
|2010-07-22 00:00:00|24.644|24.644|24.362|24.487|18031 |0      |
+-------------------+------+------+------+------+------+-------+

解决方案使用.load

如果未定义格式选项，

.load 会假定文件为 parquet 格式。所以你还需要定义一个格式选项

df = spark.read\
    .format("com.databricks.spark.csv")\
    .option("header", "true") \
    .option("delimiter", ",") \
    .option("inferSchema", "true") \
    .load("StockData/ETFs/aadr.us.txt")

df.show(2, truncate=False)

希望回答对你有帮助

Answer 2

尝试以下操作：

from pyspark.sql import SparkSession
spark=SparkSession.builder.appName('CaseStudy').getOrCreate()

df = spark.read.format("csv").option("header", "true").option("delimiter", "|").option("inferSchema", "true").load("file name")
df.show()

Spark 2.3.0 使用 Header 选项读取文本文件不起作用

Spark 2.3.0 Read Text File With Header Option Not Working

header

text-files

python-2.7

apache-spark

spark-dataframe