在 pandas df 中用它们的 unicode 名称替换特殊字符的更有效方法
More efficient way to replace special chars with their unicode name in pandas df
我有一个很大的 pandas 数据框,想对其执行彻底的文本清理。为此,我编写了以下代码来评估字符是表情符号、数字、罗马数字还是货币符号,并将它们替换为 unicodedata
包中的 unidode 名称。
虽然代码使用了双 for 循环,但我相信一定有比这更有效的解决方案,但我还没有弄清楚如何以矢量化方式实现它。
我目前的代码如下:
from unicodedata import name as unicodename
def clean_text(text):
for item in text:
for char in item:
# Simple space
if char == ' ':
newtext += char
# Letters
elif category(char)[0] == 'L':
newtext += char
# Other symbols: emojis
elif category(char) == 'So':
newtext += f" {unicodename(char)} "
# Decimal numbers
elif category(char) == 'Nd':
newtext += f" {unicodename(char).replace('DIGIT ', '').lower()} "
# Letterlike numbers e.g. Roman numerals
elif category(char) == 'Nl':
newtext += f" {unicodename(char)} "
# Currency symbols
elif category(char) == 'Sc':
newtext += f" {unicodename(char).replace(' SIGN', '').lower()} "
# Punctuation, invisibles (separator, control chars), maths symbols...
else:
newtext += " "
目前我在我的数据框上使用此函数并应用:
df['Texts'] = df['Texts'].apply(lambda x: clean_text(x))
示例数据:
l = [
"thumbs ups should be replaced: ",
"hearts also should be replaced: ❤️️❤️️❤️️❤️️",
"also other emojis: ☺️☺️",
"numbers and digits should also go: 40/40",
"Ⅰ, Ⅱ, Ⅲ these are roman numerals, change 'em"
]
df = pd.DataFrame(l, columns=['Texts'])
一个好的开始是不要做太多的工作:
- 一旦你解析了一个字符的表示,就缓存它。 (
lru_cache()
为您完成)
- 不要多次调用
category()
和 name()
from functools import lru_cache
from unicodedata import name as unicodename, category
@lru_cache(maxsize=None)
def map_char(char: str) -> str:
if char == " ": # Simple space
return char
cat = category(char)
if cat[0] == "L": # Letters
return char
name = unicodename(char)
if cat == "So": # Other symbols: emojis
return f" {name} "
if cat == "Nd": # Decimal numbers
return f" {name.replace('DIGIT ', '').lower()} "
if cat == "Nl": # Letterlike numbers e.g. Roman numerals
return f" {name} "
if cat == "Sc": # Currency symbols
return f" {name.replace(' SIGN', '').lower()} "
# Punctuation, invisibles (separator, control chars), maths symbols...
return " "
def clean_text(text):
for item in text:
new_text = "".join(map_char(char) for char in item)
# ...
我有一个很大的 pandas 数据框,想对其执行彻底的文本清理。为此,我编写了以下代码来评估字符是表情符号、数字、罗马数字还是货币符号,并将它们替换为 unicodedata
包中的 unidode 名称。
虽然代码使用了双 for 循环,但我相信一定有比这更有效的解决方案,但我还没有弄清楚如何以矢量化方式实现它。
我目前的代码如下:
from unicodedata import name as unicodename
def clean_text(text):
for item in text:
for char in item:
# Simple space
if char == ' ':
newtext += char
# Letters
elif category(char)[0] == 'L':
newtext += char
# Other symbols: emojis
elif category(char) == 'So':
newtext += f" {unicodename(char)} "
# Decimal numbers
elif category(char) == 'Nd':
newtext += f" {unicodename(char).replace('DIGIT ', '').lower()} "
# Letterlike numbers e.g. Roman numerals
elif category(char) == 'Nl':
newtext += f" {unicodename(char)} "
# Currency symbols
elif category(char) == 'Sc':
newtext += f" {unicodename(char).replace(' SIGN', '').lower()} "
# Punctuation, invisibles (separator, control chars), maths symbols...
else:
newtext += " "
目前我在我的数据框上使用此函数并应用:
df['Texts'] = df['Texts'].apply(lambda x: clean_text(x))
示例数据:
l = [
"thumbs ups should be replaced: ",
"hearts also should be replaced: ❤️️❤️️❤️️❤️️",
"also other emojis: ☺️☺️",
"numbers and digits should also go: 40/40",
"Ⅰ, Ⅱ, Ⅲ these are roman numerals, change 'em"
]
df = pd.DataFrame(l, columns=['Texts'])
一个好的开始是不要做太多的工作:
- 一旦你解析了一个字符的表示,就缓存它。 (
lru_cache()
为您完成) - 不要多次调用
category()
和name()
from functools import lru_cache
from unicodedata import name as unicodename, category
@lru_cache(maxsize=None)
def map_char(char: str) -> str:
if char == " ": # Simple space
return char
cat = category(char)
if cat[0] == "L": # Letters
return char
name = unicodename(char)
if cat == "So": # Other symbols: emojis
return f" {name} "
if cat == "Nd": # Decimal numbers
return f" {name.replace('DIGIT ', '').lower()} "
if cat == "Nl": # Letterlike numbers e.g. Roman numerals
return f" {name} "
if cat == "Sc": # Currency symbols
return f" {name.replace(' SIGN', '').lower()} "
# Punctuation, invisibles (separator, control chars), maths symbols...
return " "
def clean_text(text):
for item in text:
new_text = "".join(map_char(char) for char in item)
# ...