如何将 unicode 文本转换为 python 可以读取的文本，以便我可以在网络抓取结果中找到该特定单词？

Question

我正在尝试在 instagram 中抓取文本并检查我是否可以在 bio 中找到一些关键字但是用户使用特殊字体，所以我无法识别特定单词，如何删除字体或格式文本以便我可以搜索单词？

import re
test="      . "


x = re.findall(re.compile('past'), test)
if x:    
    print("TEXT FOUND")
else:
    print("TEXT NOT FOUND")

TEXT NOT FOUND

另一个例子：

import re
test="ғʀᴇᴇʟᴀɴᴄᴇ ɢʀᴀᴘʜɪᴄ ᴅᴇsɪɢɴᴇʀ"
test=test.lower()

x = re.findall(re.compile('graphic'), test)
if x:    
    print("TEXT FOUND")
else:
    print("TEXT NOT FOUND")

TEXT NOT FOUND

Answer 1

你可以使用unicodedata.normalize即Return Unicode 字符串的正常形式。对于您的示例，请参见以下代码片段：

import re
import unicodedata

test="      . "
 
formatted_test = unicodedata.normalize('NFKD', test).encode('ascii', 'ignore').decode('utf-8')

x = re.findall(re.compile('past'), formatted_test)
if x:    
    print("TEXT FOUND")
else:
    print("TEXT NOT FOUND")

输出将是：

TEXT FOUND

如何将 unicode 文本转换为 python 可以读取的文本，以便我可以在网络抓取结果中找到该特定单词？

How do I convert a unicode text to a text that python can read so that I could find that specific word in webscraping results?

python

web-scraping

python-3.x

python-unicode

python-re