在 numpy 数组中查找值的出现并为其分配适当的权重

Question

我有一个接近 100 万的文本文件 lines.It 有 2 个 columns.Column 1 个包含 0-99 的数字，列有 4 种大小，范围为 S、M、L、XL。从 0 到 99 的数字在 100 万行中以不同的大小不断重复，如下所示：

11 S
19 S
19 M
19 M
63 L
14 S
11 L
63 XL
14 S
11 L
63 XL

我的objective是为每个数求一个最终的大小number.The行动计划是找到每个数字的出现，找到每个出现的大小，然后将最终的大小分配给数字对于尺寸的最大出现次数。

预期输出：

11 L
14 S
19 M
63 XL

由于数据集的大小，我正在看 numpy，并不是说我以前有过使用它的经验。我已经开始创建一个基本的 numpy 数组，如下所示：

import numpy as np

data = np.loadtxt('size_data.txt')

这确实创建了一个 numpy array.However，从我到目前为止阅读的任何文档来看，它看起来不像是我想做的事情的直接方法 accomplish.Can 有人给我关于如何前进的一些指示？

Answer 1

我们可以通过对从文件接收到的第一列的反转版本应用 numpy.unique 来做到这一点。需要反转，因为否则它（return_index=True）将从一开始就return找到项目的第一次出现的索引。

>>> arr = np.loadtxt('foo.txt', dtype=object)
>>> _, indices = np.unique(arr[:, 0][::-1], return_index=True)
>>> arr[::-1][indices]
array([['11', 'L'],
       ['14', 'S'],
       ['19', 'M'],
       ['63', 'XL']], dtype=object)
# or
>>> arr[len(arr) - indices - 1]
array([['11', 'L'],
       ['14', 'S'],
       ['19', 'M'],
       ['63', 'XL']], dtype=object)

Answer 2

使用pandas。您需要 groupby(size)，然后使用自定义聚合来聚合每个组，在这种情况下非常有用 collections.Counter.most_common(n=1):

import numpy as np
import pandas as pd
from collections import Counter

dat.groupby('Id').aggregate(lambda grp: Counter(grp).most_common(1)[0][0] )

   Size
Id     
11    L
14    S
19    M
63   XL

在 numpy 数组中查找值的出现并为其分配适当的权重

Find occurrences of a value in a numpy array and assign it appropriate weights

python

numpy

aggregate