如何使用 pandas 分解、交叉表和计算大型数据集的出现次数? (7MM排)

How can I use pandas explode, crosstab and count number of occurence for large dataset? (7MM rows)

我有一个包含 700 万行和 2 列的大型数据集。但是,第二列包含不同案例的列表。案例的数量将是相当大的(在 运行 将我的算法用于 10 000 个第一行后,我发现了 27k+ 个案例)。

这是我所拥有的和我正在寻找的结果的代表示例:

我的初始数据框:

df = pd.DataFrame(columns=["id", "listElements"])
df = df.append([{"id": 1, "listElements": ["apple","peer", "[apple, peer]", "banana", "chocolate", "[chocolate, apple]"]},
{"id": 2, "listElements": ["ginger","peer", "[ginger, sugar]", "tofu", "[tofu, veggie]", "chocolate", ]},
{"id": 3, "listElements": ["steak","beef", "[beef, potatoes]", "banana", ]}]

)

print(df)
#  id                                       listElements
#0  1  [apple, peer, [apple, peer], banana, chocolate...
#1  2  [ginger, peer, [ginger, sugar], tofu, [tofu, v...
#2  3            [steak, beef, [beef, potatoes], banana]

我的最终目标: 对于每个元素(或元素组),获取出现次数和发生位置的 id。

我现在在做什么: 分解第二列,然后像这样使用交叉表:

df2 = df['listElements'].explode()
df = df[['id',]].join(pd.crosstab(df2.index, df2, colnames=['listElements']))
print(df)

#which gives me:
#  id  [apple, peer]  [beef, potatoes]  [chocolate, apple]  [ginger, sugar]  ...  chocolate  ginger  peer  steak  tofu     
#0  1              1                 0                   1                0  ...          1       0     1      0     0     
#1  2              0                 0                   0                1  ...          1       1     1      0     1     
#2  3              0                 1                   0                0  ...          0       0     0      1     0

然后我在考虑汇总结果以获得每种元素的计数并保留 id 以供以后调查。

我的问题: 数据有大约 700 万行,我怀疑大约有 100 000 种类型的元素!

我很确定我的计算机会 运行 内存不足这样的数据集 and/or 这将需要很长时间来处理!

2 个问题:

  1. 是否有更直接、更快速的方法来生成我的结果(也许我正在执行一些不必要的步骤?)
  2. 如何避免内存或速度问题?我可以通过将元素转换为数字来加速算法吗?或者运行它在分批之前加入结果?

非常欢迎任何见解!!!如果有什么不清楚的,请随时向我询问更多信息!

考虑使用长而不宽的表示形式。这是一个基于 convtools 的示例:

from convtools import conversion as c


input_data = [
    { "id": 1, "listElements": [ "apple", "peer", "[apple, peer]", "banana", "chocolate", "[chocolate, apple]", ], },
    { "id": 2, "listElements": [ "ginger", "peer", "[ginger, sugar]", "tofu", "[tofu, veggie]", "chocolate", ], },
    { "id": 3, "listElements": [ "steak", "beef", "[beef, potatoes]", "banana", ], },
]

# generated ad hoc converter function; run on startup and reuse further
converter = (
    c.iter(
        c.zip(
            c.repeat(c.item("id")),
            c.item("listElements"),
        )
    )
    .flatten()
    .pipe(
        c.group_by(c.item(1)).aggregate(
            {
                "ingredient": c.item(1),
                "ids": c.ReduceFuncs.Array(c.item(0)),
                "count": c.ReduceFuncs.Count(),
            }
        )
    )
    .gen_converter()
)

result = converter(input_data)

assert result == [
    {"ingredient": "apple", "ids": [1], "count": 1},
    {"ingredient": "peer", "ids": [1, 2], "count": 2},
    {"ingredient": "[apple, peer]", "ids": [1], "count": 1},
    {"ingredient": "banana", "ids": [1, 3], "count": 2},
    {"ingredient": "chocolate", "ids": [1, 2], "count": 2},
    {"ingredient": "[chocolate, apple]", "ids": [1], "count": 1},
    {"ingredient": "ginger", "ids": [2], "count": 1},
    {"ingredient": "[ginger, sugar]", "ids": [2], "count": 1},
    {"ingredient": "tofu", "ids": [2], "count": 1},
    {"ingredient": "[tofu, veggie]", "ids": [2], "count": 1},
    {"ingredient": "steak", "ids": [3], "count": 1},
    {"ingredient": "beef", "ids": [3], "count": 1},
    {"ingredient": "[beef, potatoes]", "ids": [3], "count": 1},
]

此外,获取字典以便按成分轻松查询它也很有意义:

converter = (
    c.iter(
        c.zip(
            c.repeat(c.item("id")),
            c.item("listElements"),
        )
    )
    .flatten()
    .pipe(
        c.group_by(c.item(1)).aggregate(
            (
                c.item(1),
                c.ReduceFuncs.Array(c.item(0)),
            )
        )
    )
    .as_type(dict)
    .gen_converter()
)

result = converter(input_data)
assert result == {
    "apple": [1],
    "peer": [1, 2],
    "[apple, peer]": [1],
    "banana": [1, 3],
    "chocolate": [1, 2],
    "[chocolate, apple]": [1],
    "ginger": [2],
    "[ginger, sugar]": [2],
    "tofu": [2],
    "[tofu, veggie]": [2],
    "steak": [3],
    "beef": [3],
    "[beef, potatoes]": [3],
}