统计一个唯一数据double出现在double列表中的次数python 3

Counting the number of times a unique data double appears in double list python 3

假设我在 python [[],[]]:

中有一个双重列表
doublelist = [["all", "the", "big", "dogs", "eat", "chicken", "all", "the", "small", "kids", "eat", "paste"], 
              ["the", "big", "dogs", "eat", "chicken", "all", "the", "small", "kids", "eat", "paste", "lumps"]]

我想数一数doublelist[0][0] & doublelist[1][0] = all, the在双榜中出现了多少次。第二个 [] 是索引。

例如,您在 doublelist[0][0] doublelist[1][0] 处看到一个计数,在 doublelist[0][6] doublelist[1][6] 处看到另一个计数。

我将在 Python 3 中使用什么代码来遍历 doublelist[i][i] 获取每个值集 ex。 [["all"],["the"]] 以及一个整数值,表示该值集在列表中存在的次数。

理想情况下,我想将它输出到包含 [i][i] 值和第三个 [i].

中的整数的三重列表 triplelist[[i],[i],[i]]

示例代码:

for i in triplelist[0]:
    print(triplelist[0][i])
    print(triplelist[1][i])
    print(triplelist[2][i])

输出:

>"all"
>"the"
>2
>"the"
>"big"
>1
>"big"
>"dogs"
>1

等...

此外,它最好跳过重复项,因此 [i][i][i] = [[all],[the],[2]] 列表中不会有 2 个索引,因为原始列表中有 2 个实例 ([0][0] [1][0] & [0][6] [1][6])。我只想要所有唯一的双词集及其在原文中出现的次数。

代码的目的是查看给定文本中一个词跟在另一个词后面的频率。它本质上是为了构建一个智能马尔可夫链生成器来加权单词值。为此,我已经有了将文本分成双列表的代码,其中包含第一个列表中的单词和第二个列表中的以下单词。

这是我目前的参考代码(问题是在我初始化wordlisttriple之后,我不知道如何让它做我上面描述的):

#import
import re #for regex expression below

#main
with open("text.txt") as rawdata:    #open text file and create a datastream
    rawtext = rawdata.read()    #read through the stream and create a string containing the text
rawdata.close()    #close the datastream
rawtext = rawtext.replace('\n', ' ')    #remove newline characters from text
rawtext = rawtext.replace('\r', ' ')    #remove newline characters from text
rawtext = rawtext.replace('--', ' -- ')    #break up blah--blah words so it can read 2 separate words blah -- blah
pat = re.compile(r'([A-Z][^\.!?]*[\.!?])', re.M)    #regex pattern for grabbing everthing before a sentence ending punctuation
sentencelist = []    #initialize list for sentences in text
sentencelist = pat.findall(rawtext)    #apply regex pattern to string to create a list of all the sentences in the text
firstwordlist = []    #initialize the list for the first word in each sentence
for index, firstword in enumerate(sentencelist):    #enumerate through the sentence list
    sentenceindex = int(index)    #get the index for below operation
    firstword = sentencelist[sentenceindex].split(' ')[0]    #use split to only grab the first word in each sentence
    firstwordlist.append(firstword)    #append each sentence starting word to first word list
rawtext = rawtext.replace(', ', ' , ')    #break up punctuation so they are not considered part of words
rawtext = rawtext.replace('. ', ' . ')    #break up punctuation so they are not considered part of words
rawtext = rawtext.replace('"', ' " ')    #break up punctuation so they are not considered part of words
sentencelistforwords = []    #initialize sentence list for parsing words
sentencelistforwords = pat.findall(rawtext)    #run the regex pattern again this time with the punctuation broken up by spaces
wordsinsentencelist = []    #initialize list for all of the words that appear in each sentence
for index, words in enumerate(sentencelist):    #enumerate through sentence list
    sentenceindex = int(index)    #grab the index for below operation
    words = sentencelist[sentenceindex].split(' ')    #split up the words in each sentence so we have a nested lists that contain each word in each sentence
    wordsinsentencelist.append(words)    #append above described to the list
wordlist = []    #initialize list of all words
wordlist = rawtext.split(' ')    #create list of all words by splitting the entire text by spaces
wordlist = list(filter(None, wordlist))    #use filter to get rid of empty strings in the list
wordlistdouble = [[], []]    #initialize the word list double to contain words and the words that follow them in sentences
for index, word in enumerate(wordlist):    #enumerate through word list
    if(int(index) < int(len(wordlist))-1):    #only go to 1 before the end of list so we don't get an index out of bounds error
        wordlistindex1 = int(index)    #grab index for first word
        wordlistindex2 = int(index)+1    #grab index for following word
        wordlistdouble[0].append(wordlist[wordlistindex1])    #append first word to first list of word list double
        wordlistdouble[1].append(wordlist[wordlistindex2])    #append following word to second list of word list double
wordlisttriple = [[], [], []]    #initialize word list triple
for index, unit in enumerate(wordlistdouble[0]):    #enumerate through word list double
    word1 = wordlistdouble[0][index]    #grab word at first list of word list double at the current index
    word2 = wordlistdouble[1][index]    #grab word at second list of word list double at the current index
    count = 0    #initialize word double data set counter
    wordlisttriple[0].append(word1)    #these need to be encapsulated in some kind of loop/if/for idk
    wordlisttriple[1].append(word2)    #these need to be encapsulated in some kind of loop/if/for idk
    wordlisttriple[2].append(count)    #these need to be encapsulated in some kind of loop/if/for idk
    #for index, unit1 in enumerate(wordlistdouble[0]):
        #if(wordlistdouble[0][int(index)] == word1 && wordlistdouble[1][int(index)+1] == word2):
            #count++

#sentencelist = list of all sentences
#firstwordlist = list of words that start sentencelist
#sentencelistforwords = list of all sentences mutated for ease of extracting words
#wordsinsentencelist = list of lists containing all of the words in each sentence
#wordlist = list of all words
#wordlistdouble = dual list of all words plus the words that follow them

如有任何建议,我们将不胜感激。如果我以错误的方式解决这个问题并且有一种更简单的方法来完成同样的事情,那也将是惊人的。谢谢!

假设您已经将文本解析为单词列表,您可以创建从第二个单词开始的迭代器,zip it with words and run it through Counter:

from collections import Counter

words = ["all", "the", "big", "dogs", "eat", "chicken", "all", "the", "small", "kids", "eat", "paste", "lumps"]
nxt = iter(words)
next(nxt, None)

print(*Counter(zip(words, nxt)).items(), sep='\n')

输出:

(('big', 'dogs'), 1)
(('kids', 'eat'), 1)
(('small', 'kids'), 1)
(('the', 'big'), 1)
(('dogs', 'eat'), 1)
(('eat', 'paste'), 1)
(('all', 'the'), 2)
(('chicken', 'all'), 1)
(('paste', 'lumps'), 1)
(('eat', 'chicken'), 1)
(('the', 'small'), 1)

上面的nxt是一个遍历单词列表的迭代器。因为我们希望它从第二个单词开始,所以我们在使用它之前用 next 拉出一个单词:

>>> nxt = iter(words)
>>> next(nxt)
'all'
>>> list(nxt)
['the', 'big', 'dogs', 'eat', 'chicken', 'all', 'the', 'small', 'kids', 'eat', 'paste', 'lumps']

然后我们将原始列表和迭代器传递给 zip,它将 return 可迭代的元组,其中每个元组都有一个来自两个的项目:

>>> list(zip(words, nxt))
[('all', 'the'), ('the', 'big'), ('big', 'dogs'), ('dogs', 'eat'), ('eat', 'chicken'), ('chicken', 'all'), ('all', 'the'), ('the', 'small'), ('small', 'kids'), ('kids', 'eat'), ('eat', 'paste'), ('paste', 'lumps')]

最后 zip 的输出被传递给 Counter 计算每对和 returns dict 像对象,其中键是对,值是计数:

>>> Counter(zip(words, nxt))
Counter({('all', 'the'): 2, ('eat', 'chicken'): 1, ('big', 'dogs'): 1, ('small', 'kids'): 1, ('kids', 'eat'): 1, ('paste', 'lumps'): 1, ('chicken', 'all'): 1, ('dogs', 'eat'): 1, ('the', 'big'): 1, ('the', 'small'): 1, ('eat', 'paste'): 1})

如果您只查找 allthe 词,这可能对您有所帮助。

代码:

from collections import Counter
doublelist = [["all", "the", "big", "dogs", "eat", "chicken", "all", "the", "small", "kids", "eat", "paste"], ["the", "big", "dogs", "eat", "chicken", "all", "the", "small", "kids", "eat", "paste", "lumps"]]
for i in range(len(doublelist)):
    count = Counter(doublelist[i])
    print "List {} - all = {},the = {}".format(i,count['all'],count['the'])

输出:

List 0 - all = 2,the = 2
List 1 - all = 1,the = 2

所以,最初我打算采用一种直接的方法来生成 ngram:

>>> from collections import Counter
>>> from itertools import chain, islice
>>> from pprint import pprint
>>> def ngram_generator(token_sequence, order):
...     for i in range(len(token_sequence) + 1 - order):
...         yield tuple(token_sequence[i: i + order])
...
>>> counts = Counter(chain.from_iterable(ngram_generator(sub, 2) for sub in doublelist))
>>> pprint(counts)
Counter({('all', 'the'): 3,
         ('the', 'big'): 2,
         ('chicken', 'all'): 2,
         ('eat', 'paste'): 2,
         ('the', 'small'): 2,
         ('kids', 'eat'): 2,
         ('dogs', 'eat'): 2,
         ('eat', 'chicken'): 2,
         ('small', 'kids'): 2,
         ('big', 'dogs'): 2,
         ('paste', 'lumps'): 1})

但我受到 niemmi 的启发,写了一个看起来比现在更有效的方法,可以推广到更高阶的 ngram:

>>> def efficient_ngrams(tokens_sequence, n):
...     iterators = []
...     for i in range(n):
...         it = iter(tokens_sequence)
...         tuple(islice(it, 0, i))
...         iterators.append(it)
...     yield from zip(*iterators)
...

所以,观察:

>>> pprint(list(efficient_ngrams(doublelist[0], 1)))
[('all',),
 ('the',),
 ('big',),
 ('dogs',),
 ('eat',),
 ('chicken',),
 ('all',),
 ('the',),
 ('small',),
 ('kids',),
 ('eat',),
 ('paste',)]
>>> pprint(list(efficient_ngrams(doublelist[0], 2)))
[('all', 'the'),
 ('the', 'big'),
 ('big', 'dogs'),
 ('dogs', 'eat'),
 ('eat', 'chicken'),
 ('chicken', 'all'),
 ('all', 'the'),
 ('the', 'small'),
 ('small', 'kids'),
 ('kids', 'eat'),
 ('eat', 'paste')]
>>> pprint(list(efficient_ngrams(doublelist[0], 3)))
[('all', 'the', 'big'),
 ('the', 'big', 'dogs'),
 ('big', 'dogs', 'eat'),
 ('dogs', 'eat', 'chicken'),
 ('eat', 'chicken', 'all'),
 ('chicken', 'all', 'the'),
 ('all', 'the', 'small'),
 ('the', 'small', 'kids'),
 ('small', 'kids', 'eat'),
 ('kids', 'eat', 'paste')]
>>>

当然,它仍然适用于您想要完成的事情:

>>> counts = Counter(chain.from_iterable(efficient_ngrams(sub, 2) for sub in doublelist))
>>> pprint(counts)
Counter({('all', 'the'): 3,
         ('the', 'big'): 2,
         ('chicken', 'all'): 2,
         ('eat', 'paste'): 2,
         ('the', 'small'): 2,
         ('kids', 'eat'): 2,
         ('dogs', 'eat'): 2,
         ('eat', 'chicken'): 2,
         ('small', 'kids'): 2,
         ('big', 'dogs'): 2,
         ('paste', 'lumps'): 1})
>>>