如何使用 pandas 分解、交叉表和计算大型数据集的出现次数？（7MM排）

Question

我有一个包含 700 万行和 2 列的大型数据集。但是，第二列包含不同案例的列表。案例的数量将是相当大的（在运行将我的算法用于 10 000 个第一行后，我发现了 27k+ 个案例）。

这是我所拥有的和我正在寻找的结果的代表示例：

我的初始数据框：

df = pd.DataFrame(columns=["id", "listElements"])
df = df.append([{"id": 1, "listElements": ["apple","peer", "[apple, peer]", "banana", "chocolate", "[chocolate, apple]"]},
{"id": 2, "listElements": ["ginger","peer", "[ginger, sugar]", "tofu", "[tofu, veggie]", "chocolate", ]},
{"id": 3, "listElements": ["steak","beef", "[beef, potatoes]", "banana", ]}]

)

print(df)
#  id                                       listElements
#0  1  [apple, peer, [apple, peer], banana, chocolate...
#1  2  [ginger, peer, [ginger, sugar], tofu, [tofu, v...
#2  3            [steak, beef, [beef, potatoes], banana]

我的最终目标： 对于每个元素（或元素组），获取出现次数和发生位置的 id。

我现在在做什么： 分解第二列，然后像这样使用交叉表：

df2 = df['listElements'].explode()
df = df[['id',]].join(pd.crosstab(df2.index, df2, colnames=['listElements']))
print(df)

#which gives me:
#  id  [apple, peer]  [beef, potatoes]  [chocolate, apple]  [ginger, sugar]  ...  chocolate  ginger  peer  steak  tofu     
#0  1              1                 0                   1                0  ...          1       0     1      0     0     
#1  2              0                 0                   0                1  ...          1       1     1      0     1     
#2  3              0                 1                   0                0  ...          0       0     0      1     0

然后我在考虑汇总结果以获得每种元素的计数并保留 id 以供以后调查。

我的问题： 数据有大约 700 万行，我怀疑大约有 100 000 种类型的元素！

我很确定我的计算机会运行内存不足这样的数据集 and/or 这将需要很长时间来处理！

2 个问题：

是否有更直接、更快速的方法来生成我的结果（也许我正在执行一些不必要的步骤？）
如何避免内存或速度问题？我可以通过将元素转换为数字来加速算法吗？或者运行它在分批之前加入结果？

非常欢迎任何见解！！！如果有什么不清楚的，请随时向我询问更多信息！

Answer 1

考虑使用长而不宽的表示形式。这是一个基于 convtools 的示例：

from convtools import conversion as c


input_data = [
    { "id": 1, "listElements": [ "apple", "peer", "[apple, peer]", "banana", "chocolate", "[chocolate, apple]", ], },
    { "id": 2, "listElements": [ "ginger", "peer", "[ginger, sugar]", "tofu", "[tofu, veggie]", "chocolate", ], },
    { "id": 3, "listElements": [ "steak", "beef", "[beef, potatoes]", "banana", ], },
]

# generated ad hoc converter function; run on startup and reuse further
converter = (
    c.iter(
        c.zip(
            c.repeat(c.item("id")),
            c.item("listElements"),
        )
    )
    .flatten()
    .pipe(
        c.group_by(c.item(1)).aggregate(
            {
                "ingredient": c.item(1),
                "ids": c.ReduceFuncs.Array(c.item(0)),
                "count": c.ReduceFuncs.Count(),
            }
        )
    )
    .gen_converter()
)

result = converter(input_data)

assert result == [
    {"ingredient": "apple", "ids": [1], "count": 1},
    {"ingredient": "peer", "ids": [1, 2], "count": 2},
    {"ingredient": "[apple, peer]", "ids": [1], "count": 1},
    {"ingredient": "banana", "ids": [1, 3], "count": 2},
    {"ingredient": "chocolate", "ids": [1, 2], "count": 2},
    {"ingredient": "[chocolate, apple]", "ids": [1], "count": 1},
    {"ingredient": "ginger", "ids": [2], "count": 1},
    {"ingredient": "[ginger, sugar]", "ids": [2], "count": 1},
    {"ingredient": "tofu", "ids": [2], "count": 1},
    {"ingredient": "[tofu, veggie]", "ids": [2], "count": 1},
    {"ingredient": "steak", "ids": [3], "count": 1},
    {"ingredient": "beef", "ids": [3], "count": 1},
    {"ingredient": "[beef, potatoes]", "ids": [3], "count": 1},
]

此外，获取字典以便按成分轻松查询它也很有意义：

converter = (
    c.iter(
        c.zip(
            c.repeat(c.item("id")),
            c.item("listElements"),
        )
    )
    .flatten()
    .pipe(
        c.group_by(c.item(1)).aggregate(
            (
                c.item(1),
                c.ReduceFuncs.Array(c.item(0)),
            )
        )
    )
    .as_type(dict)
    .gen_converter()
)

result = converter(input_data)
assert result == {
    "apple": [1],
    "peer": [1, 2],
    "[apple, peer]": [1],
    "banana": [1, 3],
    "chocolate": [1, 2],
    "[chocolate, apple]": [1],
    "ginger": [2],
    "[ginger, sugar]": [2],
    "tofu": [2],
    "[tofu, veggie]": [2],
    "steak": [3],
    "beef": [3],
    "[beef, potatoes]": [3],
}

如何使用 pandas 分解、交叉表和计算大型数据集的出现次数？（7MM排）

How can I use pandas explode, crosstab and count number of occurence for large dataset? (7MM rows)

python

crosstab

dataframe

pandas

如何使用 pandas 分解、交叉表和计算大型数据集的出现次数？ （7MM排）

How can I use pandas explode, crosstab and count number of occurence for large dataset? (7MM rows)

python

crosstab

dataframe

pandas

如何使用 pandas 分解、交叉表和计算大型数据集的出现次数？（7MM排）