处理特殊字符，如 "\000"、"\n"、"\r"、"bellchars" ；在 pyspark 数据框中

Question

我有以下 python 数据框：

d = [{'name': ' Alice', 'age': "1 ''  2"}, {'name': '  "   ', 'age': "â"}, {'name': '', 'age': "ây"}, {'name': '', 'age': "null"}]

我已经解决了 space 问题，但我还想删除任何特殊字符，如“\000”、“\n”、“\r”、"bellchars" 进入数据框。

我尝试了以下代码来处理特殊字符：

for col_i in df_test.columns:
        df_ascii = df_test.withColumn(col_i, unidecode(unicode(col_i, encoding = "utf-8")))

但它给出了以下内容：

我还使用了以下代码：

def nonasciitoascii(unicodestring):
    return unicodestring.encode("ascii","ignore")

convertedudf = udf(nonasciitoascii)

for cols in df_test.columns:
   print(cols)
   converted = df_test.withColumn(cols,convertedudf(df_test[cols]))

但输出是：

有什么办法可以解决这个问题吗？我尝试了一些其他代码示例，但无法处理上述字符（处理我的意思是删除）。

Answer 1

同样尝试 pyspark.sql.function.regexp_replace。

df = sqlContext.createDataFrame(
    [{'name': ' Alice', 'age': "1 ''  2"}, 
     {'name': '  "   ', 'age': "â"}, 
     {'name': '', 'age': "ây"}, 
     {'name': '', 'age': "null"}])

df.select([
    F.regexp_replace(col, '[(\n)(\r)([=10=]0)( )]', "").alias(col) 
    for col in df.columns]).collect()

输出：

[Row(age="1''2", name='Alice'),
 Row(age='â', name='"'),
 Row(age='ây', name=''),
 Row(age='null', name='')]

如何删除所有非字母数字字符？

df = sqlContext.createDataFrame(
    [{'name': ' Ali.|ce', 'age': "1 ''  2"}, 
     {'name': '  "   ', 'age': "â"}, 
     {'name': '', 'age': "ây"}, 
     {'name': '', 'age': "null"}])

# This expression will keep all the alphanumeric values 
# plus whatever special symbol we would like to keep 
# ex '.' and '|' are kept in this example.
df.select([
    F.regexp_replace(col, '[^(\w)+(.|)]', "").alias(col) 
    for col in df.columns]).collect()

输出：

[Row(age='12', name='Ali.|ce'),
 Row(age='', name=''),
 Row(age='y', name=''),
 Row(age='null', name='')]

处理特殊字符，如 "\000"、"\n"、"\r"、"bellchars" ；在 pyspark 数据框中

Handling Special Characters like "\000", "\n", "\r", "bellchars" ; in the pyspark dataframe

python-3.x

pyspark

pyspark-dataframes

如何删除所有非字母数字字符？