考虑 ASCII 中的特殊字符

Question

我正在尝试从数据集中过滤掉我正在处理的问题的非英语应用程序。

如何从数据集中删除非英语应用程序？最初的方法是检查字符串是否可以仅使用 ASCII 字符进行编码。如果字符串不能仅用 ASCII 字符编码，则该字符串包含来自其他字母表或特殊字符的字符。

在一些玩具示例上测试此方法会产生：

def is_english(app_name):
try:
    app_name.encode(encoding='utf-8').decode('ascii')
except UnicodeDecodeError:
    return False
else:
    return True

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat '))

显然，初始方法存在一个问题，即 'Docs To Go™ Free Office Suite' 和 'Instachat ' 这两个英文应用程序被识别为非英文应用程序，因为它们具有特殊字符（即 ' ™' 和 '').

关于如何允许使用“™”、表情符号等特殊字符有什么建议吗？

Answer 1

您可以定义一个函数来计算有多少个字符可能是英文字符并且 return 在特定阈值以上为真。仍然不是 100% 完美（想想例如共享相同字母的德语单词，如 Tastatur [keyboard]）但也许是一个开始：

import re
def is_probably_english(app_name, threshold=0.9):
    rx = re.compile(r'[-a-zA-Z0-9_ ]')
    ascii = [char for char in app_name if rx.search(char)]
    quotient = len(ascii) / len(app_name)
    passed = True if quotient >= threshold else False
    return passed, quotient


print(is_probably_english('Instagram'))
print(is_probably_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_probably_english('Docs To Go™ Free Office Suite'))
print(is_probably_english('Instachat '))

这会产生

(True, 1.0)
(False, 0.3157894736842105)
(True, 0.9655172413793104)
(True, 0.9090909090909091)

Answer 2

这是一种方法：

def is_english(app_name):
   y = app_name.encode()
   return len(app_name) == len(y)

Answer 3

您可以调查是否可以获取各种世界语言的词典（例如拼写检查词典）。如果应用程序名称中的单词不是英文单词，请查看它是否在外语词典中。如果是这样，则更有可能是国外的应用。
你可以看看这个名字是用什么脚本写的。这将排除你的例子，其中名字主要由 CJK 字符组成。
您可以应用当前的方法，但首先从某些 Unicode 类别中过滤掉字符（例如“符号”字符）。

Answer 4

您可以使用 unicode 数据库获取字符 class 和名称。例如，“T”是字母大写的类别“Lu”，名称为“LATIN CAPITAL LETTER T”。完整的类别集记录在 https://unicodebook.readthedocs.io/unicode.html。此示例接受拉丁字母、数字和所有其他类型。它可能需要改进才能捕获更多案例。

import unicodedata

# See unicode categories at 
# https://unicodebook.readthedocs.io/unicode.html#categories

def is_englishy(c):
    """Is character expected in english text"""
    category = unicodedata.category(c)
    if category.startswith("L"):
        # letter. accept latin
        name = unicodedata.name(c)
        return name.startswith("LATIN")
    if category.startswith("N"):
        # number. accept digit
        name = unicodedata.name(c)
        return name.startswith("DIGIT")
    # accepting everything else
    return True

def is_english(app_name):
    return all(is_englishy(c) for c in app_name)

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat '))

考虑 ASCII 中的特殊字符

Accounting for Special Characters in ASCII

python

ascii

utf-8

data-science