HDInsight 群集中的 UTF-8 文本具有 spark 结果编码错误 'ascii' 编解码器无法对位置中的字符进行编码:序号不在范围内 (128)
UTF-8 text in HDInsight cluster with spark result encoding error 'ascii' codec can't encode characters in position: ordinal not in range(128)
尝试在 HDInsight 集群中使用希伯来语字符 UTF-8 TSV 文件并在 Linux 上使用 spark 时出现编码错误,有什么建议吗?
这是我的 pyspark 笔记本代码:
from pyspark.sql import *
# Create an RDD from sample data
transactionsText = sc.textFile("/people.txt")
header = transactionsText.first()
# Create a schema for our data
Entry = Row('id','name','age')
# Parse the data and create a schema
transactionsParts = transactionsText.filter(lambda x:x !=header) .map(lambda l: l.encode('utf-8').split("\t"))
transactions = transactionsParts.map(lambda p: Entry(str(p[0]),str(p[1]),int(p[2])))
# Infer the schema and create a table
transactionsTable = sqlContext.createDataFrame(transactions)
# SQL can be run over DataFrames that have been registered as a table.
results = sqlContext.sql("SELECT name FROM transactionsTempTable")
# The results of SQL queries are RDDs and support all the normal RDD operations.
names = results.map(lambda p: "name: " + p.name)
for name in names.collect():
print(name)
错误:
'ascii' codec can't encode characters in position 6-11: ordinal not in
range(128) Traceback (most recent call last): UnicodeEncodeError:
'ascii' codec can't encode characters in position 6-11: ordinal not in
range(128)
希伯来文本文件内容:
id name age
1 גיא 37
2 maor 32
3 danny 55
当我尝试英文文件时它工作正常:
英文文本文件内容:
id name age
1 guy 37
2 maor 32
3 danny 55
输出:
name: guy
name: maor
name: danny
如果您 运行 以下代码带有希伯来文文本:
from pyspark.sql import *
path = "/people.txt"
transactionsText = sc.textFile(path)
header = transactionsText.first()
# Create a schema for our data
Entry = Row('id','name','age')
# Parse the data and create a schema
transactionsParts = transactionsText.filter(lambda x:x !=header).map(lambda l: l.split("\t"))
transactions = transactionsParts.map(lambda p: Entry(unicode(p[0]), unicode(p[1]), unicode(p[2])))
transactions.collect()
你会注意到你得到的名字是 unicode
类型的列表:
[Row(id=u'1', name=u'\u05d2\u05d9\u05d0', age=u'37'), Row(id=u'2', name=u'maor', age=u'32 '), Row(id=u'3', name=u'danny', age=u'55')]
现在,我们将使用事务 RDD 注册一个 table:
table_name = "transactionsTempTable"
# Infer the schema and create a table
transactionsDf = sqlContext.createDataFrame(transactions)
transactionsDf.registerTempTable(table_name)
# SQL can be run over DataFrames that have been registered as a table.
results = sqlContext.sql("SELECT name FROM {}".format(table_name))
results.collect()
您会注意到 Pyspark DataFrame
中从 sqlContext.sql(...
返回的所有字符串都是 Python unicode
类型:
[Row(name=u'\u05d2\u05d9\u05d0'), Row(name=u'maor'), Row(name=u'danny')]
现在运行宁:
%%sql
SELECT * FROM transactionsTempTable
会得到预期的结果:
name: גיא
name: maor
name: danny
请注意,如果您想对这些名称进行一些处理,您需要将它们作为 unicode
字符串使用。来自 this article:
When you’re dealing with text manipulations (finding the number of
characters in a string or cutting a string on word boundaries) you
should be dealing with unicode strings as they abstract characters in
a manner that’s appropriate for thinking of them as a sequence of
letters that you will see on a page. When dealing with I/O, reading to
and from the disk, printing to a terminal, sending something over a
network link, etc, you should be dealing with byte str as those
devices are going to need to deal with concrete implementations of
what bytes represent your abstract characters.
尝试在 HDInsight 集群中使用希伯来语字符 UTF-8 TSV 文件并在 Linux 上使用 spark 时出现编码错误,有什么建议吗?
这是我的 pyspark 笔记本代码:
from pyspark.sql import *
# Create an RDD from sample data
transactionsText = sc.textFile("/people.txt")
header = transactionsText.first()
# Create a schema for our data
Entry = Row('id','name','age')
# Parse the data and create a schema
transactionsParts = transactionsText.filter(lambda x:x !=header) .map(lambda l: l.encode('utf-8').split("\t"))
transactions = transactionsParts.map(lambda p: Entry(str(p[0]),str(p[1]),int(p[2])))
# Infer the schema and create a table
transactionsTable = sqlContext.createDataFrame(transactions)
# SQL can be run over DataFrames that have been registered as a table.
results = sqlContext.sql("SELECT name FROM transactionsTempTable")
# The results of SQL queries are RDDs and support all the normal RDD operations.
names = results.map(lambda p: "name: " + p.name)
for name in names.collect():
print(name)
错误:
'ascii' codec can't encode characters in position 6-11: ordinal not in range(128) Traceback (most recent call last): UnicodeEncodeError: 'ascii' codec can't encode characters in position 6-11: ordinal not in range(128)
希伯来文本文件内容:
id name age
1 גיא 37
2 maor 32
3 danny 55
当我尝试英文文件时它工作正常:
英文文本文件内容:
id name age
1 guy 37
2 maor 32
3 danny 55
输出:
name: guy
name: maor
name: danny
如果您 运行 以下代码带有希伯来文文本:
from pyspark.sql import *
path = "/people.txt"
transactionsText = sc.textFile(path)
header = transactionsText.first()
# Create a schema for our data
Entry = Row('id','name','age')
# Parse the data and create a schema
transactionsParts = transactionsText.filter(lambda x:x !=header).map(lambda l: l.split("\t"))
transactions = transactionsParts.map(lambda p: Entry(unicode(p[0]), unicode(p[1]), unicode(p[2])))
transactions.collect()
你会注意到你得到的名字是 unicode
类型的列表:
[Row(id=u'1', name=u'\u05d2\u05d9\u05d0', age=u'37'), Row(id=u'2', name=u'maor', age=u'32 '), Row(id=u'3', name=u'danny', age=u'55')]
现在,我们将使用事务 RDD 注册一个 table:
table_name = "transactionsTempTable"
# Infer the schema and create a table
transactionsDf = sqlContext.createDataFrame(transactions)
transactionsDf.registerTempTable(table_name)
# SQL can be run over DataFrames that have been registered as a table.
results = sqlContext.sql("SELECT name FROM {}".format(table_name))
results.collect()
您会注意到 Pyspark DataFrame
中从 sqlContext.sql(...
返回的所有字符串都是 Python unicode
类型:
[Row(name=u'\u05d2\u05d9\u05d0'), Row(name=u'maor'), Row(name=u'danny')]
现在运行宁:
%%sql
SELECT * FROM transactionsTempTable
会得到预期的结果:
name: גיא
name: maor
name: danny
请注意,如果您想对这些名称进行一些处理,您需要将它们作为 unicode
字符串使用。来自 this article:
When you’re dealing with text manipulations (finding the number of characters in a string or cutting a string on word boundaries) you should be dealing with unicode strings as they abstract characters in a manner that’s appropriate for thinking of them as a sequence of letters that you will see on a page. When dealing with I/O, reading to and from the disk, printing to a terminal, sending something over a network link, etc, you should be dealing with byte str as those devices are going to need to deal with concrete implementations of what bytes represent your abstract characters.