Pandas: 如何根据另一列添加分组变量?
Pandas: How to add a grouping variable based upon another column?
我有一个包含一些 ID 和一些日期的数据框。我希望能够根据日期的变化对 id 进行分组,以创建一个通用的“grouping_variable”。在 r 我会这样做:
df <- tibble(id = c(rep("1", 4), rep("2", 4), rep("3", 4)),
dates = as_date(c('2022-02-07', '2022-02-07', '2022-02-08', '2022-02-08',
'2022-02-09', '2022-02-09', '2022-02-10', '2022-02-10',
'2022-02-11', '2022-02-11', '2022-02-11', '2022-02-11')))
df <- df %>% group_by(id) %>% mutate(grouping_var = match(dates, unique(dates)))
基本上,此代码按 id 分组,然后在组内,为每个唯一日期分配一个值,然后将值与实际日期连接,从而产生具有这些值的列:1 1 2 2 1 1 2 2 1 1 1 1
在 Python/ pandas 中,我找不到与匹配函数等效的函数。有人知道怎么做吗?
这是 Python 中的一些示例数据:
d = {'user' : ["1", "1", "1", "1", "2", "2", "2", "2", "3", "3", "3", "3"],
'dates' : ['2022-02-07', '2022-02-07', '2022-02-08', '2022-02-08',
'2022-02-09', '2022-02-09', '2022-02-10', '2022-02-10',
'2022-02-11', '2022-02-11', '2022-02-11', '2022-02-11'],
'hoped_for_output' : [1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 1, 1]}
example_df = pd.DataFrame(data = d)
非常感谢!
按'user'
分组后,我们可能会使用factorize
d['hoped_for_output'] = d.groupby(['user'])['dates'].transform(lambda x: pd.factorize(x)[0]) + 1
-输出
d
user dates hoped_for_output
0 1 2022-02-07 1
1 1 2022-02-07 1
2 1 2022-02-08 2
3 1 2022-02-08 2
4 2 2022-02-09 1
5 2 2022-02-09 1
6 2 2022-02-10 2
7 2 2022-02-10 2
8 3 2022-02-11 1
9 3 2022-02-11 1
10 3 2022-02-11 1
11 3 2022-02-11 1
数据
d = pd.DataFrame(d)
看到它与 datar
一起实施可能会很有趣?
>>> from datar.all import f, c, rep, as_date, tibble, group_by, mutate, match, unique
>>> df = tibble(
... id=c(rep("1", 4), rep("2", 4), rep("3", 4)),
... dates=as_date(
... c(
... "2022-02-07",
... "2022-02-07",
... "2022-02-08",
... "2022-02-08",
... "2022-02-09",
... "2022-02-09",
... "2022-02-10",
... "2022-02-10",
... "2022-02-11",
... "2022-02-11",
... "2022-02-11",
... "2022-02-11",
... )
... ),
... )
>>> df >> group_by(f.id) >> mutate(grouping_var=match(f.dates, unique(f.dates)) + 1)
id dates grouping_var
<object> <datetime64[ns]> <int64>
0 1 2022-02-07 1
1 1 2022-02-07 1
2 1 2022-02-08 2
3 1 2022-02-08 2
4 2 2022-02-09 1
5 2 2022-02-09 1
6 2 2022-02-10 2
7 2 2022-02-10 2
8 3 2022-02-11 1
9 3 2022-02-11 1
10 3 2022-02-11 1
11 3 2022-02-11 1
[TibbleGrouped: id (n=3)]
我有一个包含一些 ID 和一些日期的数据框。我希望能够根据日期的变化对 id 进行分组,以创建一个通用的“grouping_variable”。在 r 我会这样做:
df <- tibble(id = c(rep("1", 4), rep("2", 4), rep("3", 4)),
dates = as_date(c('2022-02-07', '2022-02-07', '2022-02-08', '2022-02-08',
'2022-02-09', '2022-02-09', '2022-02-10', '2022-02-10',
'2022-02-11', '2022-02-11', '2022-02-11', '2022-02-11')))
df <- df %>% group_by(id) %>% mutate(grouping_var = match(dates, unique(dates)))
基本上,此代码按 id 分组,然后在组内,为每个唯一日期分配一个值,然后将值与实际日期连接,从而产生具有这些值的列:1 1 2 2 1 1 2 2 1 1 1 1
在 Python/ pandas 中,我找不到与匹配函数等效的函数。有人知道怎么做吗?
这是 Python 中的一些示例数据:
d = {'user' : ["1", "1", "1", "1", "2", "2", "2", "2", "3", "3", "3", "3"],
'dates' : ['2022-02-07', '2022-02-07', '2022-02-08', '2022-02-08',
'2022-02-09', '2022-02-09', '2022-02-10', '2022-02-10',
'2022-02-11', '2022-02-11', '2022-02-11', '2022-02-11'],
'hoped_for_output' : [1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 1, 1]}
example_df = pd.DataFrame(data = d)
非常感谢!
按'user'
分组后,我们可能会使用factorize
d['hoped_for_output'] = d.groupby(['user'])['dates'].transform(lambda x: pd.factorize(x)[0]) + 1
-输出
d
user dates hoped_for_output
0 1 2022-02-07 1
1 1 2022-02-07 1
2 1 2022-02-08 2
3 1 2022-02-08 2
4 2 2022-02-09 1
5 2 2022-02-09 1
6 2 2022-02-10 2
7 2 2022-02-10 2
8 3 2022-02-11 1
9 3 2022-02-11 1
10 3 2022-02-11 1
11 3 2022-02-11 1
数据
d = pd.DataFrame(d)
看到它与 datar
一起实施可能会很有趣?
>>> from datar.all import f, c, rep, as_date, tibble, group_by, mutate, match, unique
>>> df = tibble(
... id=c(rep("1", 4), rep("2", 4), rep("3", 4)),
... dates=as_date(
... c(
... "2022-02-07",
... "2022-02-07",
... "2022-02-08",
... "2022-02-08",
... "2022-02-09",
... "2022-02-09",
... "2022-02-10",
... "2022-02-10",
... "2022-02-11",
... "2022-02-11",
... "2022-02-11",
... "2022-02-11",
... )
... ),
... )
>>> df >> group_by(f.id) >> mutate(grouping_var=match(f.dates, unique(f.dates)) + 1)
id dates grouping_var
<object> <datetime64[ns]> <int64>
0 1 2022-02-07 1
1 1 2022-02-07 1
2 1 2022-02-08 2
3 1 2022-02-08 2
4 2 2022-02-09 1
5 2 2022-02-09 1
6 2 2022-02-10 2
7 2 2022-02-10 2
8 3 2022-02-11 1
9 3 2022-02-11 1
10 3 2022-02-11 1
11 3 2022-02-11 1
[TibbleGrouped: id (n=3)]