
Find and count all occurrences and position of numbers in a range in a list

我想知道每个数字在 6 个数字集的列表中的每个索引位置出现的次数,但我不知道这些数字是什么,但它们的范围仅为 0-99。


data = [['22', '45', '6', '72', '1', '65'], ['2', '65', '67', '23', '98', '1'], ['13', '45', '98', '4', '12', '65']]

最终我会将结果计数放入 pandas DataFrame 中,看起来像这样:

num numofoccurances position numoftimesinposition
01         02            04            01
01         02            05            01
02         01            00            01
04         02            03            01
06         01            02            01
12         01            04            01
13         01            00            01
and so on...

由于每次出现在不同索引位置的 num 重复,结果数据会略有不同,但希望这有助于您理解我在寻找什么。


data = json.load(f)
numbers = []
contains = []

This section is simply taking the data from the json file and putting it all into a list of lists containing the 6 elements I need in each list
for i in data['data']:
    item = [i[9], i[10]]
#   print(item)
    item = [words for segments in item for words in segments.split()]

This is my attempt to count to number of occurrences for each number in the range then add it to a list.
x = range(1,99)
for i in numbers:
    if x in i and not contains:
import pandas as pd
num_pos = [(num,pos) for i in data for pos,num in enumerate(i)]
df = pd.DataFrame(num_pos,columns = ['number','position']).assign(numoftimesinposition = 1)
df = df.astype(int).groupby(['number','position']).count().reset_index()

df1 = df.groupby('number').numoftimesinposition.sum().reset_index().\
    rename(columns = {'numoftimesinposition':'numofoccurences'}).\
    merge(df, on='number')

    number  numofoccurences  position  numoftimesinposition
0        1                2         4                     1
1        1                2         5                     1
4        2                1         0                     1
7        4                1         3                     1
9        6                1         2                     1
2       12                1         4                     1
3       13                1         0                     1
5       22                1         0                     1
6       23                1         3                     1
8       45                2         1                     2
10      65                3         1                     1
11      65                3         5                     2
12      67                1         2                     1
13      72                1         3                     1
14      98                2         2                     1
15      98                2         4                     1

如果上面的代码感觉很慢,那么使用 Counter from collections:

import pandas as pd
from collections import Counter

num_pos = [(int(num),pos) for i in data for pos,num in enumerate(i)]

count_data = [(num,pos,occurence) for (num,pos), occurence in Counter(num_pos).items()]

df = pd.DataFrame(count_data, columns = ['num','pos','occurence']).sort_values(by='num')

df['total_occurence'] = [Counter(df.num).get(num) for num in df.num]

这应该可以解决您的查询(应该比极慢的 groupby(您将需要其中的 2 个)和其他 pandas 更大数据的操作更快)-

#get the list of lists into a 2d numpy array
dd = np.array(data).astype(int)

#get vocab of all unique numbers
vocab = np.unique(dd.flatten())

#loop thru vocab and get sum of occurances in each index position
df = pd.DataFrame([[i]+list(np.sum((dd==i).astype(int), axis=0)) for i in vocab])

#rename cols
df.columns = ['num', 0, 1, 2, 3, 4, 5] 

#create total occurances of the item
df['numoccurances'] = df.iloc[:,1:].sum(axis=1)  
#Stack the position counts and rename cols
stats = pd.DataFrame(df.set_index(['num','numoccurances']).\
                     set_axis(['num', 'numoccurances', 'position', 'numtimesinposition'], axis=1)

#get only rows with occurances
stats = stats[stats['numtimesinposition']>0].reset_index(drop=True) 
    num  numoccurances  position  numtimesinposition
0     1              2         4                   1
1     1              2         5                   1
2     2              1         0                   1
3     4              1         3                   1
4     6              1         2                   1
5    12              1         4                   1
6    13              1         0                   1
7    22              1         0                   1
8    23              1         3                   1
9    45              2         1                   2
10   65              3         1                   1
11   65              3         5                   2
12   67              1         2                   1
13   72              1         3                   1
14   98              2         2                   1
15   98              2         4                   1

