如何 return 给定一个字符串列表和一个列表列表的字数?
How to return the word count given one list of strings and one list of lists?
假设我有一长串包含标点符号、空格等的列表,如下所示:
list_1 = [[the guy was plaguy but unable to play football, but he was able to play tennis],[That was absolute cool],...,[This is an implicit living.]]
我还有一个像这样的长列表:
list_2 =['unable', 'unquestioning', 'implicit',...,'living', 'relative', 'comparative']
如何为 list_1
的每个子列表提取出现在 list_2
中的所有单词的计数或频率?例如上面的列表:
list_2 =['unable', 'unquestioning', 'implicit',...,'living', 'relative', 'comparative']
[the guy was unable to play football, but he was able to play tennis]
由于 unable 出现在 list_2
的前一个子列表中,因此此列表的计数为 1
。
list_2 =['unable', 'unquestioning', 'implicit',...,'living', 'relative', 'comparative']
[That was absolute cool]
由于list_2
的词没有出现在上一个子列表中,因此计数为0
。
list_2 =['unable', 'unquestioning', 'implicit',...,'living', 'relative', 'comparative']
[This is an implicit living.]
由于 implicit 和 living 出现在 list_2
的前一个子列表中,因此此列表的计数为 2
。
所需的输出是 [1,0,2]
。
知道如何处理此任务以 return 计数列表吗?提前谢谢大家。
例如:
>>> [sum(1 for word in list_2 if word in sentence) for sublist in list_1 for sentence in sublist]
是错误的,因为混淆了两个词 guy
和 playguy
。知道如何解决这个问题吗?
使用内置函数 sum
和列表理解
>>> list_1 = [['the guy was unable to play football, but he was able to play tennis'],['That was absolute cool'],['This is implicit living.']]
>>> list_2 =['unable', 'unquestioning', 'implicit','living', 'relative', 'comparative']
>>> [sum(1 for word in list_2 if word in sentence) for sublist in list_1 for sentence in sublist]
[1, 0, 2]
诀窍是使用 split() 方法和列表理解。如果只用空格分隔:
list_1 = ["the guy was unable to play football but he was able to play tennis", "That was absolute cool", "This is implicit living"]
list_2 =['unable', 'unquestioning', 'implicit','living', 'relative', 'comparative']
print([sum(sum(1 for j in list_2 if j in i.split()) for i in k for k) inlist_1])
但是,如果你想使用所有非字母数字来分词,你应该使用re
:
import re
list_1 = ["the guy was unable to play football,but he was able to play tennis", "That was absolute cool", "This is implicit living"]
list_2 =['unable', 'unquestioning', 'implicit','living', 'relative', 'comparative']
print(sum([sum(1 for j in list_2 if re.split("\W",i)) for i in k) for k in list_1])
\W
字符集全部为非字母数字。
我宁愿使用正则表达式。首先,因为你需要匹配一个完整的单词,这与其他字符串搜索方法相比很复杂。而且,即使它看起来像火箭筒,它通常也非常有效。
您首先从 list_2
生成一个正则表达式,然后使用它搜索 list_1
的句子。正则表达式的构造如下:"(\bword1\b|\bword2\b|...)"
表示 "either whole word1 or whole word2 or..."。 \b
表示在单词的开头或结尾匹配。
我假设您想要 list_1
的每个子列表的结果,而不是每个子列表的每个句子的结果。
_regex = re.compile(r"(\b{}\b)".format(r"\b|\b".join(list_2)))
word_counts = [
sum(
sum(1 for occurence in _regex.findall(sentence))
for sentence in sublist
) for sublist in list_1
]
Here you can find a whole sample code 与普通字符串搜索的性能比较,知道匹配整个单词需要更多的工作,因此效率会更低。
假设我有一长串包含标点符号、空格等的列表,如下所示:
list_1 = [[the guy was plaguy but unable to play football, but he was able to play tennis],[That was absolute cool],...,[This is an implicit living.]]
我还有一个像这样的长列表:
list_2 =['unable', 'unquestioning', 'implicit',...,'living', 'relative', 'comparative']
如何为 list_1
的每个子列表提取出现在 list_2
中的所有单词的计数或频率?例如上面的列表:
list_2 =['unable', 'unquestioning', 'implicit',...,'living', 'relative', 'comparative']
[the guy was unable to play football, but he was able to play tennis]
由于 unable 出现在 list_2
的前一个子列表中,因此此列表的计数为 1
。
list_2 =['unable', 'unquestioning', 'implicit',...,'living', 'relative', 'comparative']
[That was absolute cool]
由于list_2
的词没有出现在上一个子列表中,因此计数为0
。
list_2 =['unable', 'unquestioning', 'implicit',...,'living', 'relative', 'comparative']
[This is an implicit living.]
由于 implicit 和 living 出现在 list_2
的前一个子列表中,因此此列表的计数为 2
。
所需的输出是 [1,0,2]
。
知道如何处理此任务以 return 计数列表吗?提前谢谢大家。
例如:
>>> [sum(1 for word in list_2 if word in sentence) for sublist in list_1 for sentence in sublist]
是错误的,因为混淆了两个词 guy
和 playguy
。知道如何解决这个问题吗?
使用内置函数 sum
和列表理解
>>> list_1 = [['the guy was unable to play football, but he was able to play tennis'],['That was absolute cool'],['This is implicit living.']]
>>> list_2 =['unable', 'unquestioning', 'implicit','living', 'relative', 'comparative']
>>> [sum(1 for word in list_2 if word in sentence) for sublist in list_1 for sentence in sublist]
[1, 0, 2]
诀窍是使用 split() 方法和列表理解。如果只用空格分隔:
list_1 = ["the guy was unable to play football but he was able to play tennis", "That was absolute cool", "This is implicit living"]
list_2 =['unable', 'unquestioning', 'implicit','living', 'relative', 'comparative']
print([sum(sum(1 for j in list_2 if j in i.split()) for i in k for k) inlist_1])
但是,如果你想使用所有非字母数字来分词,你应该使用re
:
import re
list_1 = ["the guy was unable to play football,but he was able to play tennis", "That was absolute cool", "This is implicit living"]
list_2 =['unable', 'unquestioning', 'implicit','living', 'relative', 'comparative']
print(sum([sum(1 for j in list_2 if re.split("\W",i)) for i in k) for k in list_1])
\W
字符集全部为非字母数字。
我宁愿使用正则表达式。首先,因为你需要匹配一个完整的单词,这与其他字符串搜索方法相比很复杂。而且,即使它看起来像火箭筒,它通常也非常有效。
您首先从 list_2
生成一个正则表达式,然后使用它搜索 list_1
的句子。正则表达式的构造如下:"(\bword1\b|\bword2\b|...)"
表示 "either whole word1 or whole word2 or..."。 \b
表示在单词的开头或结尾匹配。
我假设您想要 list_1
的每个子列表的结果,而不是每个子列表的每个句子的结果。
_regex = re.compile(r"(\b{}\b)".format(r"\b|\b".join(list_2)))
word_counts = [
sum(
sum(1 for occurence in _regex.findall(sentence))
for sentence in sublist
) for sublist in list_1
]
Here you can find a whole sample code 与普通字符串搜索的性能比较,知道匹配整个单词需要更多的工作,因此效率会更低。