从 Python 列表中删除 punctuation/symbols,句点、逗号除外
Removing punctuation/symbols from a list with Python except periods, commas
在 Python 中,我需要从列表中删除几乎所有标点符号,但保留句号和逗号。我应该创建一个函数来执行此操作还是创建一个变量?基本上我想删除除字母(我已经将大写字母转换为小写字母)和句点和逗号(可能还有撇号)之外的所有符号。
#Clean tokens up (remove symbols except ',' and '.')
def depunctuate()
clean_tokens = []
for i in lc_tokens:
if (i not in [a-z.,])
...
您可以从 string.punctuation
构建一组不需要的标点符号 - 它提供包含标点符号的字符串,然后使用 列表理解 过滤掉包含在集合:
import string
to_delete = set(string.punctuation) - {'.', ','} # remove comma and fullstop
clean_tokens = [x for x in lc_tokens if x not in to_delete]
import string
# Create a set of all allowed characters.
# {...} is the syntax for a set literal in Python.
allowed = {",", "."}.union(string.ascii_lowercase)
# This is our starting string.
lc_tokens = 'hello, "world!"'
# Now we use list comprehension to only allow letters in our allowed set.
# The result of list comprehension is a list, so we use "".join(...) to
# turn it back into a string.
filtered = "".join([letter for letter in lc_tokens if letter in allowed])
# Our final result has everything but lowercase letters, commas, and
# periods removed.
assert filtered == "hello,world"
在 Python 中,我需要从列表中删除几乎所有标点符号,但保留句号和逗号。我应该创建一个函数来执行此操作还是创建一个变量?基本上我想删除除字母(我已经将大写字母转换为小写字母)和句点和逗号(可能还有撇号)之外的所有符号。
#Clean tokens up (remove symbols except ',' and '.')
def depunctuate()
clean_tokens = []
for i in lc_tokens:
if (i not in [a-z.,])
...
您可以从 string.punctuation
构建一组不需要的标点符号 - 它提供包含标点符号的字符串,然后使用 列表理解 过滤掉包含在集合:
import string
to_delete = set(string.punctuation) - {'.', ','} # remove comma and fullstop
clean_tokens = [x for x in lc_tokens if x not in to_delete]
import string
# Create a set of all allowed characters.
# {...} is the syntax for a set literal in Python.
allowed = {",", "."}.union(string.ascii_lowercase)
# This is our starting string.
lc_tokens = 'hello, "world!"'
# Now we use list comprehension to only allow letters in our allowed set.
# The result of list comprehension is a list, so we use "".join(...) to
# turn it back into a string.
filtered = "".join([letter for letter in lc_tokens if letter in allowed])
# Our final result has everything but lowercase letters, commas, and
# periods removed.
assert filtered == "hello,world"