确定满足两个条件的唯一观察值,然后删除 R
Identify unique observations that satisfy two conditions and then remove R
我有一个df如下:
data
names fruit
7 john apple
13 john orange
14 john apple
2 mary orange
5 mary apple
8 mary orange
10 mary apple
12 mary apple
1 tom apple
6 tom apple
我想做两件事。首先,计算同时具有苹果和橙子(即 2 个玛丽和约翰)的独特观察的数量。
然后,我想将它们从我的数据框中删除,这样我就只剩下只得到苹果的独特个体。
这是我试过的
toremove<-unique(data[data$fruit=='apple' & data$fruit=='orange',"names"]) ##this part doesn't work, if it had I would have used the below code to remove the names identified
data2<-data[!data$names %in% toremove,]
真的,我想用grepl,因为我的真实数据比水果复杂一点。这是我试过的(首先转换为 data.table)
data1<-data.table(data1)
z<-data1[,ind := grepl('app.*? & orang.*?', fruit), by='names'] ## this works fine when i just use 'app.*?' but collapses when I try to add the & sign, so I'm making an error with the operator. In addition the by='names' doesn't work out for me, which is important. My plan here was to create an indicator (if an individual has an apple and an orange, then they get an indicator==1 and I would then filter them out on the basis of this indicator).
所以,总而言之,我的问题是识别同时拥有苹果和橙子的人。这看起来很简单,所以请随时将我引向可以教我这个的资源!
期望输出
names fruit
1 tom apple
6 tom apple
我正在使用 dplyr 包 flag/spot 有橙子的用户和有两种水果的用户。 (我在最后添加了一个额外的行以获得仅橙色的案例)。
data =
read.table(text="
names fruit
7 john apple
13 john orange
14 john apple
2 mary orange
5 mary apple
8 mary orange
10 mary apple
12 mary apple
1 tom apple
6 tom apple
21 kathy orange", header=T)
# names fruit
# 7 john apple
# 13 john orange
# 14 john apple
# 2 mary orange
# 5 mary apple
# 8 mary orange
# 10 mary apple
# 12 mary apple
# 1 tom apple
# 6 tom apple
# 21 kathy orange
library(dplyr)
data %>%
group_by(names) %>% # for each user name
mutate(N_dist = n_distinct(fruit), # count distinct number of fruits
N_oranges = sum(fruit=="orange")) %>% # count number of oranges
filter(N_oranges == 0 & N_dist < 2) %>% # keep users with no oranges and no both fruits
select(names, fruit)
# names fruit
# 1 tom apple
# 2 tom apple
请注意,在应用过滤器之前,您的数据集如下所示:
# names fruit N_dist N_oranges
# 1 john apple 2 1
# 2 john orange 2 1
# 3 john apple 2 1
# 4 mary orange 2 2
# 5 mary apple 2 2
# 6 mary orange 2 2
# 7 mary apple 2 2
# 8 mary apple 2 2
# 9 tom apple 1 0
# 10 tom apple 1 0
# 11 kathy orange 1 1
从那里您可以获得包含两种水果或包含橙子的用户的独特名称。
如果您只查找带有 apple
的名称,这里有一个简单的 data.table
方法
setDT(data)[ , if(all(fruit == "apple")) .SD, by = names]
# names fruit
# 1: tom apple
# 2: tom apple
对于同时具有 "apple" 和“橙色”计数的独特观察,您可以执行类似
的操作
data[, any(fruit == "apple") & any(fruit == "orange"), by = names][, sum(V1)]
## [1] 2
最后,如果您要查找的只是只有一个唯一 fruit
的用户,您可以尝试使用 devel version on GH 中的 uniqueN
(或 length(unique())
)
data[, if(uniqueN(fruit) < 2L) .SD, by = names]
# names fruit
# 1: tom apple
# 2: tom apple
我有一个df如下:
data
names fruit
7 john apple
13 john orange
14 john apple
2 mary orange
5 mary apple
8 mary orange
10 mary apple
12 mary apple
1 tom apple
6 tom apple
我想做两件事。首先,计算同时具有苹果和橙子(即 2 个玛丽和约翰)的独特观察的数量。
然后,我想将它们从我的数据框中删除,这样我就只剩下只得到苹果的独特个体。
这是我试过的
toremove<-unique(data[data$fruit=='apple' & data$fruit=='orange',"names"]) ##this part doesn't work, if it had I would have used the below code to remove the names identified
data2<-data[!data$names %in% toremove,]
真的,我想用grepl,因为我的真实数据比水果复杂一点。这是我试过的(首先转换为 data.table)
data1<-data.table(data1)
z<-data1[,ind := grepl('app.*? & orang.*?', fruit), by='names'] ## this works fine when i just use 'app.*?' but collapses when I try to add the & sign, so I'm making an error with the operator. In addition the by='names' doesn't work out for me, which is important. My plan here was to create an indicator (if an individual has an apple and an orange, then they get an indicator==1 and I would then filter them out on the basis of this indicator).
所以,总而言之,我的问题是识别同时拥有苹果和橙子的人。这看起来很简单,所以请随时将我引向可以教我这个的资源!
期望输出
names fruit
1 tom apple
6 tom apple
我正在使用 dplyr 包 flag/spot 有橙子的用户和有两种水果的用户。 (我在最后添加了一个额外的行以获得仅橙色的案例)。
data =
read.table(text="
names fruit
7 john apple
13 john orange
14 john apple
2 mary orange
5 mary apple
8 mary orange
10 mary apple
12 mary apple
1 tom apple
6 tom apple
21 kathy orange", header=T)
# names fruit
# 7 john apple
# 13 john orange
# 14 john apple
# 2 mary orange
# 5 mary apple
# 8 mary orange
# 10 mary apple
# 12 mary apple
# 1 tom apple
# 6 tom apple
# 21 kathy orange
library(dplyr)
data %>%
group_by(names) %>% # for each user name
mutate(N_dist = n_distinct(fruit), # count distinct number of fruits
N_oranges = sum(fruit=="orange")) %>% # count number of oranges
filter(N_oranges == 0 & N_dist < 2) %>% # keep users with no oranges and no both fruits
select(names, fruit)
# names fruit
# 1 tom apple
# 2 tom apple
请注意,在应用过滤器之前,您的数据集如下所示:
# names fruit N_dist N_oranges
# 1 john apple 2 1
# 2 john orange 2 1
# 3 john apple 2 1
# 4 mary orange 2 2
# 5 mary apple 2 2
# 6 mary orange 2 2
# 7 mary apple 2 2
# 8 mary apple 2 2
# 9 tom apple 1 0
# 10 tom apple 1 0
# 11 kathy orange 1 1
从那里您可以获得包含两种水果或包含橙子的用户的独特名称。
如果您只查找带有 apple
的名称,这里有一个简单的 data.table
方法
setDT(data)[ , if(all(fruit == "apple")) .SD, by = names]
# names fruit
# 1: tom apple
# 2: tom apple
对于同时具有 "apple" 和“橙色”计数的独特观察,您可以执行类似
的操作data[, any(fruit == "apple") & any(fruit == "orange"), by = names][, sum(V1)]
## [1] 2
最后,如果您要查找的只是只有一个唯一 fruit
的用户,您可以尝试使用 devel version on GH 中的 uniqueN
(或 length(unique())
)
data[, if(uniqueN(fruit) < 2L) .SD, by = names]
# names fruit
# 1: tom apple
# 2: tom apple