跨越 pandas 数据框

striding through pandas dataframe

我有一个表单的数据框

date_time                                                            uids
2018-10-16 23:00:00                                                 1000,1321,7654,1321
2018-10-16 23:10:00                                                 7654
2018-10-16 23:20:00                                                  NaN
2018-10-16 23:30:00                                                 7654,1000,7654,1321,1000
2018-10-16 23:40:00                                                 691,3974,3974,323
2018-10-16 23:50:00                                                  NaN
2018-10-17 00:00:00                                                  NaN
2018-10-17 00:10:00                                                  NaN
2018-10-17 00:20:00                                                 27,33,3974,3974,7665,27 

这是一个非常大的数据框,包含 5 分钟的时间间隔和这些时间间隔内 id 的出现次数。

我想一次(对应1小时)遍历这些DataFrame 6行,并创建包含ID和每个id在这段时间内出现的次数的DataFrame。

预期输出是每小时一个数据帧的信息。例如,在上述情况下,小时 23 - 00 的数据帧将具有这种形式

uid   1   2   3   4   5   6

1000  1   0   0   2   0  0
1321  2   0   0   1   0  0

等等

我怎样才能有效地做到这一点?

我没有确切的解决方案,但您可以创建一个枢轴 table:索引上的 ID 和列上的日期时间。然后你只需要select你想要的列。

import pandas as pd
import numpy as np

df = pd.DataFrame(
{
    "date_time": [
        "2018-10-16 23:00:00",
        "2018-10-16 23:10:00",
        "2018-10-16 23:20:00",
        "2018-10-16 23:30:00",
        "2018-10-16 23:40:00",
        "2018-10-16 23:50:00",
        "2018-10-17 00:00:00",
        "2018-10-17 00:10:00",
        "2018-10-17 00:20:00",
    ],
    "uids": [
        "1000,1321,7654,1321",
        "7654",
        np.nan,
        "7654,1000,7654,1321,1000",
        "691,3974,3974,323",
        np.nan,
        np.nan,
        np.nan,
        "27,33,3974,3974,7665,27",
    ],
}
)

df["date_time"] = pd.to_datetime(df["date_time"])

df = (
    df.set_index("date_time") #do not use set_index if date_time is current index
    .loc[:, "uids"]
    .str.extractall(r"(?P<uids>\d+)")
    .droplevel(level=1)
) # separate all the ids

df["number"] = df.index.minute.astype(float) / 10 + 1 # get the number 1 to 6 depending on the minutes

df_pivot = df.pivot_table(
    values="number", 
    index="uids", 
    columns=["date_time"], 
) #dataframe with all the uids on the index and all the datetimes in columns. 

您可以将其应用于整个数据框或仅包含 6 行的子集。然后重命名列。

您可以使用函数 crosstab:

df['uids'] = df['uids'].str.split(',')
df = df.explode('uids')
df['date_time'] = df['date_time'].dt.minute.floordiv(10).add(1)
pd.crosstab(df['uids'], df['date_time'], dropna=False)

输出:

date_time  1  2  3  4  5  6
uids                       
1000       1  0  0  2  0  0
1321       2  0  0  1  0  0
27         0  0  2  0  0  0
323        0  0  0  0  1  0
33         0  0  1  0  0  0
3974       0  0  2  0  2  0
691        0  0  0  0  1  0
7654       1  1  0  2  0  0
7665       0  0  1  0  0  0

我们可以通过从您的日期时间列中提取分钟来实现这一点。然后使用 pivot_table 获取宽格式:

df['date_time'] = pd.to_datetime(df['date_time'])

df['minute'] = df['date_time'].dt.minute // 10

piv = (df.assign(uids=df['uids'].str.split(','))
         .explode('uids')
         .pivot_table(index='uids', columns='minute', values='minute', aggfunc='size')
      )
minute    0    1    2    3    4
uids                           
1000    1.0  NaN  NaN  2.0  NaN
1321    2.0  NaN  NaN  1.0  NaN
27      NaN  NaN  2.0  NaN  NaN
323     NaN  NaN  NaN  NaN  1.0
33      NaN  NaN  1.0  NaN  NaN
3974    NaN  NaN  2.0  NaN  2.0
691     NaN  NaN  NaN  NaN  1.0
7654    1.0  1.0  NaN  2.0  NaN
7665    NaN  NaN  1.0  NaN  NaN