从文件中删除非 Unicode 字符

Question

我知道这是一个重复的问题，但到目前为止我真的很努力地尝试了所有的解决方案。任何人都可以帮助如何从文件中删除像 \xc3\xa2\xc2\x84\xc2\xa2 这样的字符？

我目前正在尝试清理的文件内容是： b'烤洋葱酱',"b"['2 磅大黄洋葱，切成薄片', '3 个大红葱，切成薄片', '4 枝百里香', '1/4 杯橄榄油', 'Kosher salt and freshly ground black pepper', '1 杯白葡萄酒', '2 汤匙香槟醋', '2 杯酸奶油', '1/2 杯切碎的新鲜细香葱', '1/4 杯原味希腊酸奶', 'Everything seasoning and thyme to garnish', 'Cape Cod Waves\xc3\xa2\xc2\x84\xc2\xa2 Potato Chips for serving']"""

我试过使用 re.sub('[^\x00-\x7F]+',' ',whatevertext) 但似乎无处可去。我怀疑这里的\没有被当作特殊字符对待。

Answer 1

你可以这样做：

>>> f = open("test.txt","r")
>>> whatevertext = f.read()
>>> print whatevertext
b'Roasted Onion Dip',"b""['2 pounds large yellow onions, thinly sliced', '3 large shallots, thinly sliced', '4 sprigs thyme', '1/4 cup olive oil', 'Kosher salt and freshly ground black pepper', '1 cup white wine', '2 tablespoons champagne vinegar', '2 cups sour cream', '1/2 cup chopped fresh chives', '1/4 cup plain Greek yogurt', 'Everything seasoning and thyme to garnish', 'Cape Cod Waves\xc3\xa2\xc2\x84\xc2\xa2 Potato Chips for serving']"""

>>> import re
>>> result = re.sub('\\x[a-f|0-9]+','',whatevertext)
>>> print result
b'Roasted Onion Dip',"b""['2 pounds large yellow onions, thinly sliced', '3 large shallots, thinly sliced', '4 sprigs thyme', '1/4 cup olive oil', 'Kosher salt and freshly ground black pepper', '1 cup white wine', '2 tablespoons champagne vinegar', '2 cups sour cream', '1/2 cup chopped fresh chives', '1/4 cup plain Greek yogurt', 'Everything seasoning and thyme to garnish', 'Cape Cod Waves Potato Chips for serving']"""

>>>

'\\x[a-f|0-9]+' 在这个正则表达式中，每个斜线都用斜线转义，在 x 之后我们知道可以有 0-9 的数字或字母来自 a-f.

从文件中删除非 Unicode 字符

Removing Non Unicode characters from a file

ascii

non-ascii-characters

non-unicode

python-2.7

python-unicode