对 Pandas 中的相似项目进行分组

Question

我正在尝试做一些事情，我想知道这是否可以在 Pandas 中完成，或者是否有更好的工具来完成这项工作（目前我只是直接使用 python 为了它）。这是起始数据：

# We have a listing of files for the movie Titanic
# And we want to break them into groups of similar titles,
# To see which of those are possible duplicates.
import pandas as pd
titanic_files = [
    {"File": "Titanic_HD2398.mov",  "Resolution": "HD", "FrameRate": 23.98, "Runtime": 102},
    {"File": "Titanic1.mov",        "Resolution": "SD", "FrameRate": 23.98, "Runtime": 102},
    {"File": "Titanic1.mov",        "Resolution": "HD", "FrameRate": 23.98, "Runtime": 102},
    {"File": "Titanic.mov",         "Resolution": "HD", "FrameRate": 24.00, "Runtime": 103},
    {"File": "MY_HD2398.mov",       "Resolution": "HD", "FrameRate": 23.98, "Runtime": 102}
]
df = pd.DataFrame(titanic_files)

而且我想按相似的数据对这些文件进行分组，从不折叠行级数据，例如：

步骤 1 -- 按分辨率分组


---- HD ----
File               Resolution             FrameRate              RunTime
Titanic_HD2398.mov HD                     23.98                  102
Titanic1.mov       HD                     23.98                  102
Titanic.mov        HD                     24.00                  103
MY_HD2398.mov      HD                     23.98                  102

---- SD ----
File               Resolution             FrameRate              RunTime
Titanic1.mov       SD                     23.98                  102

步骤 2 -- 按帧率分组

---- HD -----------------------
 +----------- 23.98 ------------
File               Resolution             FrameRate              RunTime
Titanic_HD2398.mov HD                     23.98                  102
Titanic1.mov       HD                     23.98                  102
MY_HD2398.mov      HD                     23.98                  102

 +----------- 24.00 ------------
File               Resolution             FrameRate              RunTime
Titanic.mov        HD                     24.00                  103


---- SD -----------------------
 + ---------- 23.98 ------------

File               Resolution             FrameRate              RunTime
Titanic1.mov       SD                     23.98                  102

最后，我想基本上为每个最小的分组提供单独的数据框。在 python 中，我目前正在使用以下数据结构执行此操作：

{
   'GroupingKeys': [{File1WithinThatBucket}, {File2WithinThatBucket}, ...]
}

例如：

{
   'HD+23.98' + [{'File': ...}],
   'HD+24.00' + [{'File': ...}]
}

另外，请记住，我要对大约 10-15 个字段进行分组，我在上面的问题中只包含了两个，所以这种方法需要非常普遍（另外，一些匹配标准不准确，例如运行时间可能被分桶为 +/- 2 秒，某些值可能为空等）。

回到最初的问题：在 Pandas 中可以做这样的事情吗？如果可以，怎么做？

Answer 1

Pandas' groupby 似乎是要使用的工具，它可以根据需要使用任意数量的石斑鱼，它们可以是列表、系列、column_name、index_level, 可调用...你说了算

例如你可以这样做：

df = df.groupby(
    [
        'Resolution', df.FrameRate//0.02 * 0.02,
        pd.cut(df.Runtime, bins=[45, 90, 95, 100, 120])
    ]
).File.apply(list)

这将 return 一个具有 3 级和一列的唯一 MultiIndex 的 DataFrame，每行包含一个文件名列表。

如果出于某种原因，您希望将一个 df 拆分为多个并保持这种状态，您还可以获得每个组的完整行。

for group_id, group_rows in df.groupby(...):
    # group id are tuples each with a unique combination of the grouping vectors
    # group_rows is a df of the matching rows, with the same columns as df

对 Pandas 中的相似项目进行分组

Grouping similar items in Pandas

python

group-by

aggregation

pandas