Python 从字符串列表中删除空格变体的最佳实践

Question

是否有从 Python 中的字符串中删除奇怪的空白 unicode 字符的最佳做法？

例如，如果一个字符串在此 table 中包含以下 unicode 之一，我想将其删除。

我正在考虑将 unicode 放入一个列表中，然后使用 replace 进行循环，但我确信有一种更 pythonic 的方式来这样做。

Answer 1

在这种情况下，正则表达式是您的好帮手，您可以简单地使用正则表达式替换来遍历您的列表

import re
r = re.compile(r"^\s+")

dirty_list = [...]
# iterate over dirty_list substituting
# any whitespace with an empty string
clean_list = [
  r.sub("", s)
  for s in dirty_list
]

Answer 2

你应该可以使用这个

[''.join(letter for letter in word if not letter.isspace()) for word in word_list]

因为如果您阅读 str.isspace 的文档，它会说：

Return True if there are only whitespace characters in the string and there is at least one character, False otherwise.

A character is whitespace if in the Unicode character database (see unicodedata), either its general category is Zs (“Separator, space”), or its bidirectional class is one of WS, B, or S.

如果您查看类别 Zs 的 unicode character list。

Python 从字符串列表中删除空格变体的最佳实践

Python Best Practice to Remove Whitespace Variations from List of Strings

python

removing-whitespace

data-cleaning