将 NA 转换为基于另一列的最常出现的值
Convert NA to most appearing value based in another column
我有一个名为 df 的数据框,如下所示:
Author_ID Country Cited Name Title
1: 1 Spain 10 Alex Whatever
2: 1 France 15 Ale Whatever2
3: 1 NA 10 Alex Whatever3
4: 1 Spain 10 Alex Whatever4
5: 2 Italy 10 Alice Whatever5
6: 2 Greece 10 Alice Whatever6
7: 2 Greece 10 Alice Whatever7
8: 2 NA 10 Alce Whatever8
8: 2 NA 10 Alce Whatever8
我想得到这样的东西,其中 NA 被替换为那个 Author_ID 出现次数最多的国家(如果有两个国家出现相同次数,随机那两个会很好):
Author_ID Country Cited Name Title
1: 1 Spain 10 Alex Whatever
2: 1 France 15 Ale Whatever2
3: 1 Spain 10 Alex Whatever3
4: 1 Spain 10 Alex Whatever4
5: 2 Italy 10 Alice Whatever5
6: 2 Greece 10 Alice Whatever6
7: 2 Greece 10 Alice Whatever7
8: 2 Greece 10 Alce Whatever8
8: 2 Greece 10 Alce Whatever8
提前致谢。
与data.table
library(data.table)
# setDT(df)
df[,Country := replace(Country,is.na(Country),names(which.max(table(Country)))),by=Author_ID]
# Author_ID Country Cited Name Title
# 1: 1 Spain 10 Alex Whatever
# 2: 1 France 15 Ale Whatever2
# 3: 1 Spain 10 Alex Whatever3
# 4: 1 Spain 10 Alex Whatever4
# 5: 2 Italy 10 Alice Whatever5
# 6: 2 Greece 10 Alice Whatever6
# 7: 2 Greece 10 Alice Whatever7
# 8: 2 Greece 10 Alce Whatever8
# 9: 2 Greece 10 Alce Whatever8
在基地 R
:
df$Country <- unlist(tapply(df$Country,df$Author_ID,function(x)
replace(x,is.na(x),names(which.max(table(x))))))
# Author_ID Country Cited Name Title
# 1 1 Spain 10 Alex Whatever
# 2 1 France 15 Ale Whatever2
# 3 1 Spain 10 Alex Whatever3
# 4 1 Spain 10 Alex Whatever4
# 5 2 Italy 10 Alice Whatever5
# 6 2 Greece 10 Alice Whatever6
# 7 2 Greece 10 Alice Whatever7
# 8 2 Greece 10 Alce Whatever8
# 9 2 Greece 10 Alce Whatever8
与 dplyr
:
library(dplyr)
df %>% group_by(Author_ID) %>%
mutate(Country = replace(
Country,
is.na(Country),
names(which.max(table(Country)))))
# # A tibble: 9 x 5
# # Groups: Author_ID [2]
# Author_ID Country Cited Name Title
# <int> <chr> <int> <chr> <chr>
# 1 1 Spain 10 Alex Whatever
# 2 1 France 15 Ale Whatever2
# 3 1 Spain 10 Alex Whatever3
# 4 1 Spain 10 Alex Whatever4
# 5 2 Italy 10 Alice Whatever5
# 6 2 Greece 10 Alice Whatever6
# 7 2 Greece 10 Alice Whatever7
# 8 2 Greece 10 Alce Whatever8
# 9 2 Greece 10 Alce Whatever8
如果多个国家出现的时间达到最大值,则取第一个,而不是随机的。
如果某些国家/地区仅适用于某些作者
首先调用它来修改示例数据:
df$Country[df$Author_ID ==2] <- NA
那么这里是 3 个改编的解决方案,不是非常优雅,但它有效。我怀疑可能有一个 base/dplyr/data.table 函数可以更顺利地将零长度元素更改为 NA
。
setDT(df)
df[,Country := replace(Country,is.na(Country),{
nm <- names(which.max(table(x)))
if(length(nm)==0) NA else nm}),
by=Author_ID]
df <- df[!is.na(df$Country),]
# Author_ID Country Cited Name Title
# 1: 1 Spain 10 Alex Whatever
# 2: 1 France 15 Ale Whatever2
# 3: 1 Spain 10 Alex Whatever4
df$Country <- unlist(tapply(df$Country,df$Author_ID,function(x)
replace(x,is.na(x),{
nm <- names(which.max(table(x)))
if(length(nm)==0) NA else nm
})))
df <- df[!is.na(df$Country),]
# Author_ID Country Cited Name Title
# 1 1 Spain 10 Alex Whatever
# 2 1 France 15 Ale Whatever2
# 3 1 Spain 10 Alex Whatever3
# 4 1 Spain 10 Alex Whatever4
df %>% group_by(Author_ID) %>%
mutate(Country = replace(
Country,
is.na(Country),
names(which.max(table(Country))) %>%
{if(length(.)==0) NA else .})) %>%
filter(!is.na(Country))
# # A tibble: 4 x 5
# # Groups: Author_ID [1]
# Author_ID Country Cited Name Title
# <int> <chr> <int> <chr> <chr>
# 1 1 Spain 10 Alex Whatever
# 2 1 France 15 Ale Whatever2
# 3 1 Spain 10 Alex Whatever3
# 4 1 Spain 10 Alex Whatever4
数据
df <- read.table(text="Author_ID Country Cited Name Title
1 Spain 10 Alex Whatever
1 France 15 Ale Whatever2
1 NA 10 Alex Whatever3
1 Spain 10 Alex Whatever4
2 Italy 10 Alice Whatever5
2 Greece 10 Alice Whatever6
2 Greece 10 Alice Whatever7
2 NA 10 Alce Whatever8
2 NA 10 Alce Whatever8",h=T,strin=F)
我有一个名为 df 的数据框,如下所示:
Author_ID Country Cited Name Title
1: 1 Spain 10 Alex Whatever
2: 1 France 15 Ale Whatever2
3: 1 NA 10 Alex Whatever3
4: 1 Spain 10 Alex Whatever4
5: 2 Italy 10 Alice Whatever5
6: 2 Greece 10 Alice Whatever6
7: 2 Greece 10 Alice Whatever7
8: 2 NA 10 Alce Whatever8
8: 2 NA 10 Alce Whatever8
我想得到这样的东西,其中 NA 被替换为那个 Author_ID 出现次数最多的国家(如果有两个国家出现相同次数,随机那两个会很好):
Author_ID Country Cited Name Title
1: 1 Spain 10 Alex Whatever
2: 1 France 15 Ale Whatever2
3: 1 Spain 10 Alex Whatever3
4: 1 Spain 10 Alex Whatever4
5: 2 Italy 10 Alice Whatever5
6: 2 Greece 10 Alice Whatever6
7: 2 Greece 10 Alice Whatever7
8: 2 Greece 10 Alce Whatever8
8: 2 Greece 10 Alce Whatever8
提前致谢。
与data.table
library(data.table)
# setDT(df)
df[,Country := replace(Country,is.na(Country),names(which.max(table(Country)))),by=Author_ID]
# Author_ID Country Cited Name Title
# 1: 1 Spain 10 Alex Whatever
# 2: 1 France 15 Ale Whatever2
# 3: 1 Spain 10 Alex Whatever3
# 4: 1 Spain 10 Alex Whatever4
# 5: 2 Italy 10 Alice Whatever5
# 6: 2 Greece 10 Alice Whatever6
# 7: 2 Greece 10 Alice Whatever7
# 8: 2 Greece 10 Alce Whatever8
# 9: 2 Greece 10 Alce Whatever8
在基地 R
:
df$Country <- unlist(tapply(df$Country,df$Author_ID,function(x)
replace(x,is.na(x),names(which.max(table(x))))))
# Author_ID Country Cited Name Title
# 1 1 Spain 10 Alex Whatever
# 2 1 France 15 Ale Whatever2
# 3 1 Spain 10 Alex Whatever3
# 4 1 Spain 10 Alex Whatever4
# 5 2 Italy 10 Alice Whatever5
# 6 2 Greece 10 Alice Whatever6
# 7 2 Greece 10 Alice Whatever7
# 8 2 Greece 10 Alce Whatever8
# 9 2 Greece 10 Alce Whatever8
与 dplyr
:
library(dplyr)
df %>% group_by(Author_ID) %>%
mutate(Country = replace(
Country,
is.na(Country),
names(which.max(table(Country)))))
# # A tibble: 9 x 5
# # Groups: Author_ID [2]
# Author_ID Country Cited Name Title
# <int> <chr> <int> <chr> <chr>
# 1 1 Spain 10 Alex Whatever
# 2 1 France 15 Ale Whatever2
# 3 1 Spain 10 Alex Whatever3
# 4 1 Spain 10 Alex Whatever4
# 5 2 Italy 10 Alice Whatever5
# 6 2 Greece 10 Alice Whatever6
# 7 2 Greece 10 Alice Whatever7
# 8 2 Greece 10 Alce Whatever8
# 9 2 Greece 10 Alce Whatever8
如果多个国家出现的时间达到最大值,则取第一个,而不是随机的。
如果某些国家/地区仅适用于某些作者
首先调用它来修改示例数据:
df$Country[df$Author_ID ==2] <- NA
那么这里是 3 个改编的解决方案,不是非常优雅,但它有效。我怀疑可能有一个 base/dplyr/data.table 函数可以更顺利地将零长度元素更改为 NA
。
setDT(df)
df[,Country := replace(Country,is.na(Country),{
nm <- names(which.max(table(x)))
if(length(nm)==0) NA else nm}),
by=Author_ID]
df <- df[!is.na(df$Country),]
# Author_ID Country Cited Name Title
# 1: 1 Spain 10 Alex Whatever
# 2: 1 France 15 Ale Whatever2
# 3: 1 Spain 10 Alex Whatever4
df$Country <- unlist(tapply(df$Country,df$Author_ID,function(x)
replace(x,is.na(x),{
nm <- names(which.max(table(x)))
if(length(nm)==0) NA else nm
})))
df <- df[!is.na(df$Country),]
# Author_ID Country Cited Name Title
# 1 1 Spain 10 Alex Whatever
# 2 1 France 15 Ale Whatever2
# 3 1 Spain 10 Alex Whatever3
# 4 1 Spain 10 Alex Whatever4
df %>% group_by(Author_ID) %>%
mutate(Country = replace(
Country,
is.na(Country),
names(which.max(table(Country))) %>%
{if(length(.)==0) NA else .})) %>%
filter(!is.na(Country))
# # A tibble: 4 x 5
# # Groups: Author_ID [1]
# Author_ID Country Cited Name Title
# <int> <chr> <int> <chr> <chr>
# 1 1 Spain 10 Alex Whatever
# 2 1 France 15 Ale Whatever2
# 3 1 Spain 10 Alex Whatever3
# 4 1 Spain 10 Alex Whatever4
数据
df <- read.table(text="Author_ID Country Cited Name Title
1 Spain 10 Alex Whatever
1 France 15 Ale Whatever2
1 NA 10 Alex Whatever3
1 Spain 10 Alex Whatever4
2 Italy 10 Alice Whatever5
2 Greece 10 Alice Whatever6
2 Greece 10 Alice Whatever7
2 NA 10 Alce Whatever8
2 NA 10 Alce Whatever8",h=T,strin=F)