创建由词对组成的元组
Create tuples consisting of pairs of words
我有一个字符串(或一个单词列表)。我想创建每个可能的词对组合的元组,以便将它们传递给 Counter 以进行字典创建和频率计算。频率按以下方式计算:如果该对存在于字符串中(无论顺序如何或它们之间是否有任何其他单词)频率 = 1(即使 word1 的频率为 7,word2 的频率为 3一对 word1 和 word2 仍然是 1)
我正在使用循环创建所有对的元组但卡住了
tweetList = ('I went to work but got delayed at other work and got stuck in a traffic and I went to drink some coffee but got no money and asked for money from work', 'We went to get our car but the car was not ready. We tried to expedite our car but were told it is not ready')
words = set(tweetList.split())
n = 10
for tweet in tweetList:
for word1 in words:
for word2 in words:
pairW = [(word1, word2)]
c1 = Counter(pairW for pairW in tweet)
c1.most_common(n)
然而,输出很奇怪:
[('k', 1)]
它似乎在字母上迭代而不是单词
如何解决这个问题?使用 split() 将字符串转换为单词列表?
另一个问题:如何避免创建重复的元组,例如:(word1, word2) 和(word2, word1)?枚举?
作为输出,我希望有一个字典,其中键 = 所有词对(不过请参阅重复评论),值 = 列表中词对的频率
谢谢!
tweet
是一个字符串,因此 Counter(pairW for pairW in tweet)
将计算 tweet
中字母的频率,这可能不是您想要的。
不知道这是不是你想要的:
import itertools, collections
tweets = ['I went to work but got delayed at other work and got stuck in a traffic and I went to drink some coffee but got no money and asked for money from work',
'We went to get our car but the car was not ready. We tried to expedite our car but were told it is not ready']
words = set(word.lower() for tweet in tweets for word in tweet.split())
_pairs = list(itertools.permutations(words, 2))
# We need to clean up similar pairs: sort words in each pair and then convert
# them to tuple so we can convert whole list into set.
pairs = set(map(tuple, map(sorted, _pairs)))
c = collections.Counter()
for tweet in tweets:
for pair in pairs:
if pair[0] in tweet and pair[1] in tweet:
c.update({pair: 1})
print c.most_common(10)
结果是:[(('a', 'went'), 2), (('a', 'the'), 2), (('but', 'i'), 2), (('i', 'the'), 2), (('but', 'the'), 2), (('a', 'i'), 2), (('a', 'we'), 2), (('but', 'we'), 2), (('no', 'went'), 2), (('but', 'went'), 2)]
我有一个字符串(或一个单词列表)。我想创建每个可能的词对组合的元组,以便将它们传递给 Counter 以进行字典创建和频率计算。频率按以下方式计算:如果该对存在于字符串中(无论顺序如何或它们之间是否有任何其他单词)频率 = 1(即使 word1 的频率为 7,word2 的频率为 3一对 word1 和 word2 仍然是 1)
我正在使用循环创建所有对的元组但卡住了
tweetList = ('I went to work but got delayed at other work and got stuck in a traffic and I went to drink some coffee but got no money and asked for money from work', 'We went to get our car but the car was not ready. We tried to expedite our car but were told it is not ready')
words = set(tweetList.split())
n = 10
for tweet in tweetList:
for word1 in words:
for word2 in words:
pairW = [(word1, word2)]
c1 = Counter(pairW for pairW in tweet)
c1.most_common(n)
然而,输出很奇怪:
[('k', 1)]
它似乎在字母上迭代而不是单词
如何解决这个问题?使用 split() 将字符串转换为单词列表?
另一个问题:如何避免创建重复的元组,例如:(word1, word2) 和(word2, word1)?枚举?
作为输出,我希望有一个字典,其中键 = 所有词对(不过请参阅重复评论),值 = 列表中词对的频率
谢谢!
tweet
是一个字符串,因此 Counter(pairW for pairW in tweet)
将计算 tweet
中字母的频率,这可能不是您想要的。
不知道这是不是你想要的:
import itertools, collections
tweets = ['I went to work but got delayed at other work and got stuck in a traffic and I went to drink some coffee but got no money and asked for money from work',
'We went to get our car but the car was not ready. We tried to expedite our car but were told it is not ready']
words = set(word.lower() for tweet in tweets for word in tweet.split())
_pairs = list(itertools.permutations(words, 2))
# We need to clean up similar pairs: sort words in each pair and then convert
# them to tuple so we can convert whole list into set.
pairs = set(map(tuple, map(sorted, _pairs)))
c = collections.Counter()
for tweet in tweets:
for pair in pairs:
if pair[0] in tweet and pair[1] in tweet:
c.update({pair: 1})
print c.most_common(10)
结果是:[(('a', 'went'), 2), (('a', 'the'), 2), (('but', 'i'), 2), (('i', 'the'), 2), (('but', 'the'), 2), (('a', 'i'), 2), (('a', 'we'), 2), (('but', 'we'), 2), (('no', 'went'), 2), (('but', 'went'), 2)]