如何使用 pandas 分解、交叉表和计算大型数据集的出现次数? (7MM排)
How can I use pandas explode, crosstab and count number of occurence for large dataset? (7MM rows)
我有一个包含 700 万行和 2 列的大型数据集。但是,第二列包含不同案例的列表。案例的数量将是相当大的(在 运行 将我的算法用于 10 000 个第一行后,我发现了 27k+ 个案例)。
这是我所拥有的和我正在寻找的结果的代表示例:
我的初始数据框:
df = pd.DataFrame(columns=["id", "listElements"])
df = df.append([{"id": 1, "listElements": ["apple","peer", "[apple, peer]", "banana", "chocolate", "[chocolate, apple]"]},
{"id": 2, "listElements": ["ginger","peer", "[ginger, sugar]", "tofu", "[tofu, veggie]", "chocolate", ]},
{"id": 3, "listElements": ["steak","beef", "[beef, potatoes]", "banana", ]}]
)
print(df)
# id listElements
#0 1 [apple, peer, [apple, peer], banana, chocolate...
#1 2 [ginger, peer, [ginger, sugar], tofu, [tofu, v...
#2 3 [steak, beef, [beef, potatoes], banana]
我的最终目标:
对于每个元素(或元素组),获取出现次数和发生位置的 id。
我现在在做什么:
分解第二列,然后像这样使用交叉表:
df2 = df['listElements'].explode()
df = df[['id',]].join(pd.crosstab(df2.index, df2, colnames=['listElements']))
print(df)
#which gives me:
# id [apple, peer] [beef, potatoes] [chocolate, apple] [ginger, sugar] ... chocolate ginger peer steak tofu
#0 1 1 0 1 0 ... 1 0 1 0 0
#1 2 0 0 0 1 ... 1 1 1 0 1
#2 3 0 1 0 0 ... 0 0 0 1 0
然后我在考虑汇总结果以获得每种元素的计数并保留 id 以供以后调查。
我的问题:
数据有大约 700 万行,我怀疑大约有 100 000 种类型的元素!
我很确定我的计算机会 运行 内存不足这样的数据集 and/or 这将需要很长时间来处理!
2 个问题:
- 是否有更直接、更快速的方法来生成我的结果(也许我正在执行一些不必要的步骤?)
- 如何避免内存或速度问题?我可以通过将元素转换为数字来加速算法吗?或者运行它在分批之前加入结果?
非常欢迎任何见解!!!如果有什么不清楚的,请随时向我询问更多信息!
考虑使用长而不宽的表示形式。这是一个基于 convtools 的示例:
from convtools import conversion as c
input_data = [
{ "id": 1, "listElements": [ "apple", "peer", "[apple, peer]", "banana", "chocolate", "[chocolate, apple]", ], },
{ "id": 2, "listElements": [ "ginger", "peer", "[ginger, sugar]", "tofu", "[tofu, veggie]", "chocolate", ], },
{ "id": 3, "listElements": [ "steak", "beef", "[beef, potatoes]", "banana", ], },
]
# generated ad hoc converter function; run on startup and reuse further
converter = (
c.iter(
c.zip(
c.repeat(c.item("id")),
c.item("listElements"),
)
)
.flatten()
.pipe(
c.group_by(c.item(1)).aggregate(
{
"ingredient": c.item(1),
"ids": c.ReduceFuncs.Array(c.item(0)),
"count": c.ReduceFuncs.Count(),
}
)
)
.gen_converter()
)
result = converter(input_data)
assert result == [
{"ingredient": "apple", "ids": [1], "count": 1},
{"ingredient": "peer", "ids": [1, 2], "count": 2},
{"ingredient": "[apple, peer]", "ids": [1], "count": 1},
{"ingredient": "banana", "ids": [1, 3], "count": 2},
{"ingredient": "chocolate", "ids": [1, 2], "count": 2},
{"ingredient": "[chocolate, apple]", "ids": [1], "count": 1},
{"ingredient": "ginger", "ids": [2], "count": 1},
{"ingredient": "[ginger, sugar]", "ids": [2], "count": 1},
{"ingredient": "tofu", "ids": [2], "count": 1},
{"ingredient": "[tofu, veggie]", "ids": [2], "count": 1},
{"ingredient": "steak", "ids": [3], "count": 1},
{"ingredient": "beef", "ids": [3], "count": 1},
{"ingredient": "[beef, potatoes]", "ids": [3], "count": 1},
]
此外,获取字典以便按成分轻松查询它也很有意义:
converter = (
c.iter(
c.zip(
c.repeat(c.item("id")),
c.item("listElements"),
)
)
.flatten()
.pipe(
c.group_by(c.item(1)).aggregate(
(
c.item(1),
c.ReduceFuncs.Array(c.item(0)),
)
)
)
.as_type(dict)
.gen_converter()
)
result = converter(input_data)
assert result == {
"apple": [1],
"peer": [1, 2],
"[apple, peer]": [1],
"banana": [1, 3],
"chocolate": [1, 2],
"[chocolate, apple]": [1],
"ginger": [2],
"[ginger, sugar]": [2],
"tofu": [2],
"[tofu, veggie]": [2],
"steak": [3],
"beef": [3],
"[beef, potatoes]": [3],
}
我有一个包含 700 万行和 2 列的大型数据集。但是,第二列包含不同案例的列表。案例的数量将是相当大的(在 运行 将我的算法用于 10 000 个第一行后,我发现了 27k+ 个案例)。
这是我所拥有的和我正在寻找的结果的代表示例:
我的初始数据框:
df = pd.DataFrame(columns=["id", "listElements"])
df = df.append([{"id": 1, "listElements": ["apple","peer", "[apple, peer]", "banana", "chocolate", "[chocolate, apple]"]},
{"id": 2, "listElements": ["ginger","peer", "[ginger, sugar]", "tofu", "[tofu, veggie]", "chocolate", ]},
{"id": 3, "listElements": ["steak","beef", "[beef, potatoes]", "banana", ]}]
)
print(df)
# id listElements
#0 1 [apple, peer, [apple, peer], banana, chocolate...
#1 2 [ginger, peer, [ginger, sugar], tofu, [tofu, v...
#2 3 [steak, beef, [beef, potatoes], banana]
我的最终目标: 对于每个元素(或元素组),获取出现次数和发生位置的 id。
我现在在做什么: 分解第二列,然后像这样使用交叉表:
df2 = df['listElements'].explode()
df = df[['id',]].join(pd.crosstab(df2.index, df2, colnames=['listElements']))
print(df)
#which gives me:
# id [apple, peer] [beef, potatoes] [chocolate, apple] [ginger, sugar] ... chocolate ginger peer steak tofu
#0 1 1 0 1 0 ... 1 0 1 0 0
#1 2 0 0 0 1 ... 1 1 1 0 1
#2 3 0 1 0 0 ... 0 0 0 1 0
然后我在考虑汇总结果以获得每种元素的计数并保留 id 以供以后调查。
我的问题: 数据有大约 700 万行,我怀疑大约有 100 000 种类型的元素!
我很确定我的计算机会 运行 内存不足这样的数据集 and/or 这将需要很长时间来处理!
2 个问题:
- 是否有更直接、更快速的方法来生成我的结果(也许我正在执行一些不必要的步骤?)
- 如何避免内存或速度问题?我可以通过将元素转换为数字来加速算法吗?或者运行它在分批之前加入结果?
非常欢迎任何见解!!!如果有什么不清楚的,请随时向我询问更多信息!
考虑使用长而不宽的表示形式。这是一个基于 convtools 的示例:
from convtools import conversion as c
input_data = [
{ "id": 1, "listElements": [ "apple", "peer", "[apple, peer]", "banana", "chocolate", "[chocolate, apple]", ], },
{ "id": 2, "listElements": [ "ginger", "peer", "[ginger, sugar]", "tofu", "[tofu, veggie]", "chocolate", ], },
{ "id": 3, "listElements": [ "steak", "beef", "[beef, potatoes]", "banana", ], },
]
# generated ad hoc converter function; run on startup and reuse further
converter = (
c.iter(
c.zip(
c.repeat(c.item("id")),
c.item("listElements"),
)
)
.flatten()
.pipe(
c.group_by(c.item(1)).aggregate(
{
"ingredient": c.item(1),
"ids": c.ReduceFuncs.Array(c.item(0)),
"count": c.ReduceFuncs.Count(),
}
)
)
.gen_converter()
)
result = converter(input_data)
assert result == [
{"ingredient": "apple", "ids": [1], "count": 1},
{"ingredient": "peer", "ids": [1, 2], "count": 2},
{"ingredient": "[apple, peer]", "ids": [1], "count": 1},
{"ingredient": "banana", "ids": [1, 3], "count": 2},
{"ingredient": "chocolate", "ids": [1, 2], "count": 2},
{"ingredient": "[chocolate, apple]", "ids": [1], "count": 1},
{"ingredient": "ginger", "ids": [2], "count": 1},
{"ingredient": "[ginger, sugar]", "ids": [2], "count": 1},
{"ingredient": "tofu", "ids": [2], "count": 1},
{"ingredient": "[tofu, veggie]", "ids": [2], "count": 1},
{"ingredient": "steak", "ids": [3], "count": 1},
{"ingredient": "beef", "ids": [3], "count": 1},
{"ingredient": "[beef, potatoes]", "ids": [3], "count": 1},
]
此外,获取字典以便按成分轻松查询它也很有意义:
converter = (
c.iter(
c.zip(
c.repeat(c.item("id")),
c.item("listElements"),
)
)
.flatten()
.pipe(
c.group_by(c.item(1)).aggregate(
(
c.item(1),
c.ReduceFuncs.Array(c.item(0)),
)
)
)
.as_type(dict)
.gen_converter()
)
result = converter(input_data)
assert result == {
"apple": [1],
"peer": [1, 2],
"[apple, peer]": [1],
"banana": [1, 3],
"chocolate": [1, 2],
"[chocolate, apple]": [1],
"ginger": [2],
"[ginger, sugar]": [2],
"tofu": [2],
"[tofu, veggie]": [2],
"steak": [3],
"beef": [3],
"[beef, potatoes]": [3],
}