查找按天分隔的最常见值
Finding most common value seperated by day
我想查看每个参与者每天最常出现的类别。每天都有多个类别,我想要一个新列来说明特定参与者在特定日期大部分发生的类别。
我有一列 'user_id'、'date' 和一列 'category'(字符)。我应该使用哪个代码来添加一个新列,该列仅说明特定用户在特定日期出现次数最多的类别?
输出:
structure(list(user_id = c("10257", "10580", "10280", "10202", "10275","10281"),
date = structure(c(1552521600, 1552003200, 1551139200,1551484800, 1552867200, 1552521600), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
better_category = c("Email", "Internet_Browser", "Instant_Messaging","News","Background_Process","Instant_Messaging")),
row.nams = c(176300L, 184332L, 469288L, 119462L, 112507L, 399236L),
class = "data.frame")
让我们创建一些数据:
require(dplyr)
set.seed(100)
data<-data.frame(user_id=rep(c(1,2,3),10),date=rep(c("tuesday","wednesday","thursday"),each=10),category=(sample(c(1:3),30,replace=TRUE)))
如果我们arrange
为了方便查看,我们可以得到这个:
data<-data %>% arrange(user_id,date)
data
user_id date category
1 1 thursday 3
2 1 thursday 2
3 1 thursday 3
4 1 tuesday 1
5 1 tuesday 1
6 1 tuesday 3
7 1 tuesday 1
8 1 wednesday 1
9 1 wednesday 3
10 1 wednesday 2
11 2 thursday 2
12 2 thursday 1
13 2 thursday 2
14 2 tuesday 1
15 2 tuesday 2
16 2 tuesday 2
17 2 wednesday 2
18 2 wednesday 2
19 2 wednesday 1
20 2 wednesday 3
21 3 thursday 2
22 3 thursday 3
23 3 thursday 3
24 3 thursday 1
25 3 tuesday 2
26 3 tuesday 2
27 3 tuesday 2
28 3 wednesday 3
29 3 wednesday 3
30 3 wednesday 2
现在我们将按 user_id 和日期对其进行分组,并创建一个名为 max 的新列,该列采用每组中出现频率最高的类别。我们使用 table
over `category 来执行此操作,这会为每个分组创建列的交叉表:
data %>% group_by(user_id,date) %>%
dplyr::mutate(max=names(sort(table(category),decreasing=TRUE))[1])
# A tibble: 30 x 4
# Groups: user_id, date [9]
user_id date category max
<dbl> <fct> <int> <chr>
1 1 thursday 3 3
2 1 thursday 2 3
3 1 thursday 3 3
4 1 tuesday 1 1
5 1 tuesday 1 1
6 1 tuesday 3 1
7 1 tuesday 1 1
8 1 wednesday 1 1
9 1 wednesday 3 1
10 1 wednesday 2 1
# ... with 20 more rows
如您所见,每个用户日分组都有自己的 max
。在她展示的最后一个示例中(1-星期三),三个类别中的每一个都有一个,因此选择第一个,即 1.
这是使用您的 dput 数据的结果(其中每一行都有唯一的 user/date 配对):
# A tibble: 6 x 4
# Groups: user_id, date [6]
user_id date better_category max
<fct> <dttm> <fct> <chr>
1 10257 2019-03-14 00:00:00 Email Email
2 10580 2019-03-08 00:00:00 Internet_Browser Internet_Browser
3 10280 2019-02-26 00:00:00 Instant_Messaging Instant_Messaging
4 10202 2019-03-02 00:00:00 News News
5 10275 2019-03-18 00:00:00 Background_Process Background_Process
6 10281 2019-03-14 00:00:00 Instant_Messaging Instant_Messaging
所以我创建了一个相同的 table 但将最后一行复制了两次然后将其中一个类别更改为 "News" 和 运行 相同的代码:
# A tibble: 8 x 4
# Groups: user_id, date [6]
user_id date better_category max
<chr> <dttm> <chr> <chr>
1 10257 2019-03-14 00:00:00 Email Email
2 10580 2019-03-08 00:00:00 Internet_Browser Internet_Browser
3 10280 2019-02-26 00:00:00 Instant_Messaging Instant_Messaging
4 10202 2019-03-02 00:00:00 News News
5 10275 2019-03-18 00:00:00 Background_Process Background_Process
6 10281 2019-03-14 00:00:00 News Instant_Messaging
7 10281 2019-03-14 00:00:00 Instant_Messaging Instant_Messaging
8 10281 2019-03-14 00:00:00 Instant_Messaging Instant_Messaging
注意最后三行。
我想查看每个参与者每天最常出现的类别。每天都有多个类别,我想要一个新列来说明特定参与者在特定日期大部分发生的类别。
我有一列 'user_id'、'date' 和一列 'category'(字符)。我应该使用哪个代码来添加一个新列,该列仅说明特定用户在特定日期出现次数最多的类别?
输出:
structure(list(user_id = c("10257", "10580", "10280", "10202", "10275","10281"),
date = structure(c(1552521600, 1552003200, 1551139200,1551484800, 1552867200, 1552521600), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
better_category = c("Email", "Internet_Browser", "Instant_Messaging","News","Background_Process","Instant_Messaging")),
row.nams = c(176300L, 184332L, 469288L, 119462L, 112507L, 399236L),
class = "data.frame")
让我们创建一些数据:
require(dplyr)
set.seed(100)
data<-data.frame(user_id=rep(c(1,2,3),10),date=rep(c("tuesday","wednesday","thursday"),each=10),category=(sample(c(1:3),30,replace=TRUE)))
如果我们arrange
为了方便查看,我们可以得到这个:
data<-data %>% arrange(user_id,date)
data
user_id date category
1 1 thursday 3
2 1 thursday 2
3 1 thursday 3
4 1 tuesday 1
5 1 tuesday 1
6 1 tuesday 3
7 1 tuesday 1
8 1 wednesday 1
9 1 wednesday 3
10 1 wednesday 2
11 2 thursday 2
12 2 thursday 1
13 2 thursday 2
14 2 tuesday 1
15 2 tuesday 2
16 2 tuesday 2
17 2 wednesday 2
18 2 wednesday 2
19 2 wednesday 1
20 2 wednesday 3
21 3 thursday 2
22 3 thursday 3
23 3 thursday 3
24 3 thursday 1
25 3 tuesday 2
26 3 tuesday 2
27 3 tuesday 2
28 3 wednesday 3
29 3 wednesday 3
30 3 wednesday 2
现在我们将按 user_id 和日期对其进行分组,并创建一个名为 max 的新列,该列采用每组中出现频率最高的类别。我们使用 table
over `category 来执行此操作,这会为每个分组创建列的交叉表:
data %>% group_by(user_id,date) %>%
dplyr::mutate(max=names(sort(table(category),decreasing=TRUE))[1])
# A tibble: 30 x 4
# Groups: user_id, date [9]
user_id date category max
<dbl> <fct> <int> <chr>
1 1 thursday 3 3
2 1 thursday 2 3
3 1 thursday 3 3
4 1 tuesday 1 1
5 1 tuesday 1 1
6 1 tuesday 3 1
7 1 tuesday 1 1
8 1 wednesday 1 1
9 1 wednesday 3 1
10 1 wednesday 2 1
# ... with 20 more rows
如您所见,每个用户日分组都有自己的 max
。在她展示的最后一个示例中(1-星期三),三个类别中的每一个都有一个,因此选择第一个,即 1.
这是使用您的 dput 数据的结果(其中每一行都有唯一的 user/date 配对):
# A tibble: 6 x 4
# Groups: user_id, date [6]
user_id date better_category max
<fct> <dttm> <fct> <chr>
1 10257 2019-03-14 00:00:00 Email Email
2 10580 2019-03-08 00:00:00 Internet_Browser Internet_Browser
3 10280 2019-02-26 00:00:00 Instant_Messaging Instant_Messaging
4 10202 2019-03-02 00:00:00 News News
5 10275 2019-03-18 00:00:00 Background_Process Background_Process
6 10281 2019-03-14 00:00:00 Instant_Messaging Instant_Messaging
所以我创建了一个相同的 table 但将最后一行复制了两次然后将其中一个类别更改为 "News" 和 运行 相同的代码:
# A tibble: 8 x 4
# Groups: user_id, date [6]
user_id date better_category max
<chr> <dttm> <chr> <chr>
1 10257 2019-03-14 00:00:00 Email Email
2 10580 2019-03-08 00:00:00 Internet_Browser Internet_Browser
3 10280 2019-02-26 00:00:00 Instant_Messaging Instant_Messaging
4 10202 2019-03-02 00:00:00 News News
5 10275 2019-03-18 00:00:00 Background_Process Background_Process
6 10281 2019-03-14 00:00:00 News Instant_Messaging
7 10281 2019-03-14 00:00:00 Instant_Messaging Instant_Messaging
8 10281 2019-03-14 00:00:00 Instant_Messaging Instant_Messaging
注意最后三行。