在数据框中按多个组标记唯一值
Label unique values by multiple groups in dataframe
我在 R 中有一个大型数据框,其中用户的任务是描述场景中的对象。每个场景我需要唯一的 3 个用户,但是有些场景被描述了 3 次以上。我试图保留前 3 个唯一用户并删除其余用户。
玩具数据(真实数据集有更多的行和列)
user <- c("A", "A", "A", "B", "B", "C", "C", "D", "E", "E", "F", "F", "F")
scene <- c("library", "library", "library", "park", "park", "library", "library", "park", "library", "library", "library", "library", "library")
object <- c("book", "book", "lamp", "dog", "cat", "book", "lamp", "dog", "desk", "desk", "book", "lamp", "lamp")
index <- c(1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2)
dat <- data.frame(user, scene, object, index)
user scene object index
A library book 1
A library book 2
A library lamp 1
B park dog 1
B park cat 1
C library book 1
C library lamp 1
D park dog 1
E library desk 1
E library desk 2
F library book 1
F library lamp 1
F library lamp 2
... ... ... ...
例如,这里A
、B
、C
是第一个描述场景library
的用户。所以现在不需要F
的描述了。我的主要问题是,虽然我可以获得唯一身份用户的总体数量,但我不知道如何将它们标记为 1
、2
、3
等以切断超过 3 的值。
期望的输出
user scene object index count
A library book 1 1
A library book 2 1
A library lamp 1 1
B park dog 1 1
B park cat 1 1
C library book 1 2
C library lamp 1 2
D park dog 1 2
E library desk 1 3
E library desk 2 3
这很有用,但只能按一列分组,所以我无法在此处应用它:
对于每个 user
,您可以使用 match
创建一个 count
变量,然后 filter
输出值直到 count <= 3
:
library(dplyr)
dat %>%
group_by(scene) %>%
mutate(count = match(user, unique(user))) %>%
filter(count <= 3)
# user scene object index count
# <chr> <chr> <chr> <dbl> <int>
# 1 A library book 1 1
# 2 A library book 2 1
# 3 A library lamp 1 1
# 4 B park dog 1 1
# 5 B park cat 1 1
# 6 C library book 1 2
# 7 C library lamp 1 2
# 8 D park dog 1 2
# 9 E library desk 1 3
#10 E library desk 2 3
data.table
中的相同内容是:
library(data.table)
setDT(dat)[, count := match(user, unique(user)), scene]
dat[count <= 3]
和基础 R :
dat$count <- with(dat, ave(user, scene, FUN = function(x) match(x, unique(x))))
subset(dat, count <= 3)
我在 R 中有一个大型数据框,其中用户的任务是描述场景中的对象。每个场景我需要唯一的 3 个用户,但是有些场景被描述了 3 次以上。我试图保留前 3 个唯一用户并删除其余用户。
玩具数据(真实数据集有更多的行和列)
user <- c("A", "A", "A", "B", "B", "C", "C", "D", "E", "E", "F", "F", "F")
scene <- c("library", "library", "library", "park", "park", "library", "library", "park", "library", "library", "library", "library", "library")
object <- c("book", "book", "lamp", "dog", "cat", "book", "lamp", "dog", "desk", "desk", "book", "lamp", "lamp")
index <- c(1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2)
dat <- data.frame(user, scene, object, index)
user scene object index
A library book 1
A library book 2
A library lamp 1
B park dog 1
B park cat 1
C library book 1
C library lamp 1
D park dog 1
E library desk 1
E library desk 2
F library book 1
F library lamp 1
F library lamp 2
... ... ... ...
例如,这里A
、B
、C
是第一个描述场景library
的用户。所以现在不需要F
的描述了。我的主要问题是,虽然我可以获得唯一身份用户的总体数量,但我不知道如何将它们标记为 1
、2
、3
等以切断超过 3 的值。
期望的输出
user scene object index count
A library book 1 1
A library book 2 1
A library lamp 1 1
B park dog 1 1
B park cat 1 1
C library book 1 2
C library lamp 1 2
D park dog 1 2
E library desk 1 3
E library desk 2 3
这很有用,但只能按一列分组,所以我无法在此处应用它:
对于每个 user
,您可以使用 match
创建一个 count
变量,然后 filter
输出值直到 count <= 3
:
library(dplyr)
dat %>%
group_by(scene) %>%
mutate(count = match(user, unique(user))) %>%
filter(count <= 3)
# user scene object index count
# <chr> <chr> <chr> <dbl> <int>
# 1 A library book 1 1
# 2 A library book 2 1
# 3 A library lamp 1 1
# 4 B park dog 1 1
# 5 B park cat 1 1
# 6 C library book 1 2
# 7 C library lamp 1 2
# 8 D park dog 1 2
# 9 E library desk 1 3
#10 E library desk 2 3
data.table
中的相同内容是:
library(data.table)
setDT(dat)[, count := match(user, unique(user)), scene]
dat[count <= 3]
和基础 R :
dat$count <- with(dat, ave(user, scene, FUN = function(x) match(x, unique(x))))
subset(dat, count <= 3)