HDInsight 群集中的 UTF-8 文本具有 spark 结果编码错误 'ascii' 编解码器无法对位置中的字符进行编码：序号不在范围内 (128)

Question

尝试在 HDInsight 集群中使用希伯来语字符 UTF-8 TSV 文件并在 Linux 上使用 spark 时出现编码错误，有什么建议吗？

这是我的 pyspark 笔记本代码：

from pyspark.sql import *
# Create an RDD from sample data
transactionsText = sc.textFile("/people.txt")

header = transactionsText.first()

# Create a schema for our data
Entry = Row('id','name','age')

# Parse the data and create a schema
transactionsParts = transactionsText.filter(lambda x:x !=header) .map(lambda l: l.encode('utf-8').split("\t"))
transactions = transactionsParts.map(lambda p: Entry(str(p[0]),str(p[1]),int(p[2])))

# Infer the schema and create a table       
transactionsTable = sqlContext.createDataFrame(transactions)

# SQL can be run over DataFrames that have been registered as a table.
results = sqlContext.sql("SELECT name FROM transactionsTempTable")

# The results of SQL queries are RDDs and support all the normal RDD operations.
names = results.map(lambda p: "name: " + p.name)

for name in names.collect():
  print(name)

错误：

'ascii' codec can't encode characters in position 6-11: ordinal not in range(128) Traceback (most recent call last): UnicodeEncodeError: 'ascii' codec can't encode characters in position 6-11: ordinal not in range(128)

希伯来文本文件内容：

id  name    age 
1   גיא 37
2   maor    32 
3   danny   55

当我尝试英文文件时它工作正常：

英文文本文件内容：

id  name    age
1   guy     37
2   maor    32
3   danny   55

输出：

name: guy
name: maor
name: danny

Answer 1

如果您运行以下代码带有希伯来文文本：

from pyspark.sql import *

path = "/people.txt"
transactionsText = sc.textFile(path)

header = transactionsText.first()

# Create a schema for our data
Entry = Row('id','name','age')

# Parse the data and create a schema
transactionsParts = transactionsText.filter(lambda x:x !=header).map(lambda l: l.split("\t"))

transactions = transactionsParts.map(lambda p: Entry(unicode(p[0]), unicode(p[1]), unicode(p[2])))

transactions.collect()

你会注意到你得到的名字是 unicode 类型的列表：

[Row(id=u'1', name=u'\u05d2\u05d9\u05d0', age=u'37'), Row(id=u'2', name=u'maor', age=u'32 '), Row(id=u'3', name=u'danny', age=u'55')]

现在，我们将使用事务 RDD 注册一个 table：

table_name = "transactionsTempTable"

# Infer the schema and create a table       
transactionsDf = sqlContext.createDataFrame(transactions)
transactionsDf.registerTempTable(table_name)

# SQL can be run over DataFrames that have been registered as a table.
results = sqlContext.sql("SELECT name FROM {}".format(table_name))

results.collect()

您会注意到 Pyspark DataFrame 中从 sqlContext.sql(... 返回的所有字符串都是 Python unicode 类型：

[Row(name=u'\u05d2\u05d9\u05d0'), Row(name=u'maor'), Row(name=u'danny')]

现在运行宁:

%%sql
SELECT * FROM transactionsTempTable

会得到预期的结果：

name: גיא
name: maor
name: danny

请注意，如果您想对这些名称进行一些处理，您需要将它们作为 unicode 字符串使用。来自 this article:

When you’re dealing with text manipulations (finding the number of characters in a string or cutting a string on word boundaries) you should be dealing with unicode strings as they abstract characters in a manner that’s appropriate for thinking of them as a sequence of letters that you will see on a page. When dealing with I/O, reading to and from the disk, printing to a terminal, sending something over a network link, etc, you should be dealing with byte str as those devices are going to need to deal with concrete implementations of what bytes represent your abstract characters.

HDInsight 群集中的 UTF-8 文本具有 spark 结果编码错误 'ascii' 编解码器无法对位置中的字符进行编码：序号不在范围内 (128)

UTF-8 text in HDInsight cluster with spark result encoding error 'ascii' codec can't encode characters in position: ordinal not in range(128)

python

encoding

ascii

pyspark

azure-hdinsight