从标记化数组中删除奇怪的字符

Remove strange character from tokenization array


我有一个非常脏的 Pyspark 数据框,即充满了奇怪的字符,例如:

我正在进行数据处理和清理(标记化停用词删除、...),这是我的数据框:

content score label classWeigth words filtered terms_stemmed
absolutely love d... 5 1 0.48 [absolutely, love... [absolutely, love... [absolut, love, d...
absolutely love t... 5 1 0.48 [absolutely, love... [absolutely, love... [absolut, love, g...
absolutely phenom... 5 1 0.48 [absolutely, phen... [absolutely, phen... [absolut, phenome...
absolutely shocki... 1 0 0.52 [absolutely, shoc... [absolutely, shoc... [absolut, shock, ...
accept the phone ... 1 0 0.52 [accept, the, pho... [accept, phone, n... [accept, phone, n...

如何访问 word 列并删除所有奇怪的字符,如上面提到的那些?

试试这个 UDF。

>>> @udf('array<string>')
... def filter_udf(a):
...     from builtins import filter
...     return list(filter(lambda s: s.isascii(), a))
... 

>>> df = spark.createDataFrame([(['pyspark','பரமசிவம்'],)])
>>> df.select(filter_udf('_1')).show()
+--------------+
|filter_udf(_1)|
+--------------+
|     [pyspark]|
+--------------+