com.univocity.parsers.common.TextParsingException: 已解析输入的长度 (1000001) 超过最大字符数

Question

我正在尝试运行 pyspark 脚本但出现上述错误。我使用了 option("maxCharsPerCol","1100000") 但无法解决问题。

你能帮我解决这个问题吗？ Pyspark 版本 - 2.0.0

我在读取和写入文件时使用了以下代码：

阅读：

df_read_file = sqlContext.read.format('com.databricks.spark.csv').option("delimiter", '[=11=]1').option("maxCharsPerCol","1000001L").options(header='true',inferSchema='false').load(row['Source File Name Lnd'])

写作：

df.repartition(1).write.format('com.databricks.spark.csv').mode('overwrite').save(output_path, sep='[=12=]1',header='True',nullValue=None)

错误：

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in stage 6.0 failed 1 times, most recent failure: Lost task 9.0 in stage 6.0 (TID 15, localhost): com.univocity.parsers.common.TextParsingException: Length of parsed input (1000001) exceeds the maximum number of characters defined in your parser settings (1000000). 
Identified line separator characters in the parsed content. This may be the cause of the error. The line separator in your parser settings is set to '\n'. Parsed content:

Answer 1

我使用以下选项修复了问题：

在我正在使用的某些文件中 quote=''

df_read_file = sqlContext.read.format('com.databricks.spark.csv').option("delimiter", '[=10=]1').options(header='true', quote='',inferSchema='false').load(row['Source File Name Lnd'])

在某些文件中我使用 escape=''

df_read_file = sqlContext.read.format('com.databricks.spark.csv').option("delimiter", '[=11=]1').options(header='true', escape='',inferSchema='false').load(row['Source File Name Lnd'])

运行使用这个选项我可以解决我的问题。

com.univocity.parsers.common.TextParsingException: 已解析输入的长度 (1000001) 超过最大字符数

com.univocity.parsers.common.TextParsingException: Length of parsed input (1000001) exceeds the maximum number of character

python

dataframe

pyspark

pyspark-sql

pyspark-dataframes