从 unicode 列表中获取正确的列表

Question

我有一个以列表形式包含 unicode 字符串的列表。

my_list = [u'[James, Williams, Kevin, Parker, Alex, Emma, Katie\xa0, Annie]']

我想要一个可以迭代的列表，例如；

name_list = [James, Williams, Kevin, Parker, Alex, Emma, Katie, Annie]

我已经尝试了几种可能的解决方案 here，但其中 none 对我有效。

# Tried
name_list =  name_list.encode('ascii', 'ignore').decode('utf-8')

#Gives unicode return type

# Tried
ast.literal_eval(name_list)

#Gives me invalid token error

Answer 1

import unicodedata

lst = [u'[James, Williams, Kevin, Parker, Alex, Emma, Katie\xa0, Annie]']
lst = unicodedata.normalize("NFKD", lst[0])
lst2 = lst[1:-1].split(", ") # remove open and close brackets
print(lst2)

输出将是：

["James", "Williams", "Kevin", "Parker", "Alex", "Emma", "Katie ", "Annie"]

如果要删除所有空格 leading/trailing 空格：

lst3 = [i.strip() for i in lst2]
print(lst3)

输出将是：

["James", "Williams", "Kevin", "Parker", "Alex", "Emma", "Katie", "Annie"]

Answer 2

首先，列表没有 encode 方法，您必须对列表中的项目应用任何字符串方法。

其次，如果您正在考虑规范化字符串，您可以使用 Python 的 unicodedata 库中的 normalize 函数，阅读更多 here，这个删除不需要的字符 '\xa0' 并将帮助您规范化任何其他字符。

然后不要使用通常不安全的 eval，而是使用列表理解来构建列表：

import unicodedata

li = [u'[James, Williams, Kevin, Parker, Alex, Emma, Katie\xa0, Annie]']
inner_li = unicodedata.normalize("NFKD", li[0]) #<--- notice the list selection

#get only part of the string you want to convert into a list
new_li = [i.strip() for i in inner_li[1:-1].split(',')] 
new_li
>> ['James', 'Williams', 'Kevin', 'Parker', 'Alex', 'Emma', 'Katie', 'Annie']

在你预期的输出中，它们实际上是一个变量列表，除非事先声明，否则会给你一个错误。

Answer 3

这是正则表达式的一个很好的应用：

import re
body = re.findall(r"\[\s*(.+)\s*]", my_list[0])[0] # extract the stuff in []s
names = re.split("\s*,\s*", body) # extract the names
#['James', 'Williams', 'Kevin', 'Parker', 'Alex', 'Emma', 'Katie', 'Annie']

从 unicode 列表中获取正确的列表

Get proper list from list of unicode list

python

unicode

list

unicode-escapes