从标记化数组中删除奇怪的字符
Remove strange character from tokenization array
我有一个非常脏的 Pyspark 数据框,即充满了奇怪的字符,例如:
- ɴɪᴄᴇ ᴏɴᴇ ᴀᴩᴩ
- பரமசிவம்
- 和许多其他人
我正在进行数据处理和清理(标记化、停用词删除、...),这是我的数据框:
content
score
label
classWeigth
words
filtered
terms_stemmed
absolutely love d...
5
1
0.48
[absolutely, love...
[absolutely, love...
[absolut, love, d...
absolutely love t...
5
1
0.48
[absolutely, love...
[absolutely, love...
[absolut, love, g...
absolutely phenom...
5
1
0.48
[absolutely, phen...
[absolutely, phen...
[absolut, phenome...
absolutely shocki...
1
0
0.52
[absolutely, shoc...
[absolutely, shoc...
[absolut, shock, ...
accept the phone ...
1
0
0.52
[accept, the, pho...
[accept, phone, n...
[accept, phone, n...
如何访问 word
列并删除所有奇怪的字符,如上面提到的那些?
试试这个 UDF。
>>> @udf('array<string>')
... def filter_udf(a):
... from builtins import filter
... return list(filter(lambda s: s.isascii(), a))
...
>>> df = spark.createDataFrame([(['pyspark','பரமசிவம்'],)])
>>> df.select(filter_udf('_1')).show()
+--------------+
|filter_udf(_1)|
+--------------+
| [pyspark]|
+--------------+
我有一个非常脏的 Pyspark 数据框,即充满了奇怪的字符,例如:
- ɴɪᴄᴇ ᴏɴᴇ ᴀᴩᴩ
- பரமசிவம்
- 和许多其他人
我正在进行数据处理和清理(标记化、停用词删除、...),这是我的数据框:
content | score | label | classWeigth | words | filtered | terms_stemmed |
---|---|---|---|---|---|---|
absolutely love d... | 5 | 1 | 0.48 | [absolutely, love... | [absolutely, love... | [absolut, love, d... |
absolutely love t... | 5 | 1 | 0.48 | [absolutely, love... | [absolutely, love... | [absolut, love, g... |
absolutely phenom... | 5 | 1 | 0.48 | [absolutely, phen... | [absolutely, phen... | [absolut, phenome... |
absolutely shocki... | 1 | 0 | 0.52 | [absolutely, shoc... | [absolutely, shoc... | [absolut, shock, ... |
accept the phone ... | 1 | 0 | 0.52 | [accept, the, pho... | [accept, phone, n... | [accept, phone, n... |
如何访问 word
列并删除所有奇怪的字符,如上面提到的那些?
试试这个 UDF。
>>> @udf('array<string>')
... def filter_udf(a):
... from builtins import filter
... return list(filter(lambda s: s.isascii(), a))
...
>>> df = spark.createDataFrame([(['pyspark','பரமசிவம்'],)])
>>> df.select(filter_udf('_1')).show()
+--------------+
|filter_udf(_1)|
+--------------+
| [pyspark]|
+--------------+