错误转换 unicode 的值 - databrick notebook

Question

在 Databricks Notebook 中创建函数以删除单词中的重音符号

import unicodedata
import sys

from pyspark.sql.functions import translate, regexp_replace

def make_trans():
    matching_string = ""
    replace_string = ""

    for i in range(ord(" "), sys.maxunicode):
        name = unicodedata.name(chr(i), "")
        if "WITH" in name:
            try:
                base = unicodedata.lookup(name.split(" WITH")[0])
                matching_string += chr(i)
                replace_string += base
            except KeyError:
                pass

    return matching_string, replace_string

def clean_text(c):
    matching_string, replace_string = make_trans()
    return translate(
        regexp_replace(c, "\p{M}", ""), 
        matching_string, replace_string
    ).alias(c)

但是我无法更改数据框中的值，如果我执行命令 select 它有效，但是当我应用此命令时出现以下错误

Command error: df['productName'] = clean_text(df['productName'])

TypeError: Column is not iterable

这条命令执行成功

df.select(clean_text("productName"))

我必须一次循环一行吗？这是使用 spark + databricks 的正确方法吗？

Answer 1

Dataframes 是不可变的，因此您无法更改值。但是，您可以添加一个新列。所以在你的情况下：

df = df.withColumn("cleanProductName", clean_text(df['productName']))

"feels" 一开始喜欢重复。但请记住，数据框是不可变的，因此大小始终相同。将其视为 SQL 数据库中的视图。因此 Select 有效。

如果您确实需要，可以从数据框中删除旧列。但除非您实际使用该列（select * 来自示例），否则它对整体性能没有影响。

错误转换 unicode 的值 - databrick notebook

Error convert value for unicode - databrick notebook

jupyter-notebook

databricks