使用现有因素为 R 数据框中具有不存在值的组分配和创建值
Assign and create values using existing factors for groups with non existing values in a R dataframe
我有一个巨大的甲虫计数实验数据集,具有以下示例结构:
species_name1 <- c("A", "A", "A", "A", "B") # two factors for name1
species_name2 <- c("a", "a", "b", "b", "c") # three factors for name2
date <- c("2021-06-02", "2021-08-20", "2021-06-15", "2021-08-20", "2021-08-20") # three date factors
number <- c("30", "30", "11", "15", "40") # number of encountered beetles for the "date"
df <- data.frame(species_name1, species_name2, date, number) # create dataframe
df$species_full_name <- gsub(" ", " ", paste(df$species_name1, df$species_name2)) # new column with merged data of the first two columns
df$date <- as.Date(df$date, format ="%Y-%m-%d")
df$number <- as.numeric(df$number)
df$species_name1 <- as.factor(df$species_name1)
df$species_name2 <- as.factor(df$species_name2)
df$species_full_name <- as.factor(df$species_full_name)
str(df)
总体而言,有三个日期因素(2021-06-02、2021-06-15、2021-08-20),但不是每个“species_full_name”。
我需要创建一个数据框,其中包含“species_full_name”列的三个日期中的每一个日期。
对于“species_full_name”-原始数据框日期中不存在“日期”的因素,R 应该在“数字”列中写入“0”。
我找到了一个几乎可以解决我的目标数据框的代码。问题是其他列(“species_name1”和...“_name2”)将消失:
as.data.frame(xtabs(number ~ species_full_name+date, df)) # create every factor "date" for every factor "species_full_name" and give counting data in column "Freq"
我需要一个类似于此输出的数据框,但每一列都来自原始数据框“df”。假设列“species_name1”和“species_name2”的值也很重要。
感谢您的帮助!
您可以使用 tidyr
中的 complete()
complete(df, species_full_name,date) %>%
mutate(number=if_else(is.na(number),0,number))
输出:
species_full_name date species_name1 species_name2 number
<fct> <date> <fct> <fct> <dbl>
1 A a 2021-06-02 A a 30
2 A a 2021-06-15 NA NA 0
3 A a 2021-08-20 A a 30
4 A b 2021-06-02 NA NA 0
5 A b 2021-06-15 A b 11
6 A b 2021-08-20 A b 15
7 B c 2021-06-02 NA NA 0
8 B c 2021-06-15 NA NA 0
9 B c 2021-08-20 B c 40
但是 data.table 方法会更快。您可以按如下方式使用 data.table
和 CJ()
:
# load library
library(data.table)
# set df as data.table
setDT(df)
# get unique values of species_full_name and date
species_full_name = unique(df$species_full_name)
date = unique(df$date)
# merge (and update number to 0 if NA, and the name1 and name2 columns)
merge(CJ(date,species_full_name),df,by=c('date','species_full_name'),all.x = T) %>%
.[, number:=fifelse(is.na(number),0,as.double(number))] %>%
.[, c("species_name1","species_name2"):=tstrsplit(species_full_name, " ")] %>%
.[]
输出:
date species_full_name species_name1 species_name2 number
<Date> <fctr> <char> <char> <num>
1: 2021-06-02 A a A a 30
2: 2021-06-02 A b A b 0
3: 2021-06-02 B c B c 0
4: 2021-06-15 A a A a 0
5: 2021-06-15 A b A b 11
6: 2021-06-15 B c B c 0
7: 2021-08-20 A a A a 30
8: 2021-08-20 A b A b 15
9: 2021-08-20 B c B c 40
我有一个巨大的甲虫计数实验数据集,具有以下示例结构:
species_name1 <- c("A", "A", "A", "A", "B") # two factors for name1
species_name2 <- c("a", "a", "b", "b", "c") # three factors for name2
date <- c("2021-06-02", "2021-08-20", "2021-06-15", "2021-08-20", "2021-08-20") # three date factors
number <- c("30", "30", "11", "15", "40") # number of encountered beetles for the "date"
df <- data.frame(species_name1, species_name2, date, number) # create dataframe
df$species_full_name <- gsub(" ", " ", paste(df$species_name1, df$species_name2)) # new column with merged data of the first two columns
df$date <- as.Date(df$date, format ="%Y-%m-%d")
df$number <- as.numeric(df$number)
df$species_name1 <- as.factor(df$species_name1)
df$species_name2 <- as.factor(df$species_name2)
df$species_full_name <- as.factor(df$species_full_name)
str(df)
总体而言,有三个日期因素(2021-06-02、2021-06-15、2021-08-20),但不是每个“species_full_name”。 我需要创建一个数据框,其中包含“species_full_name”列的三个日期中的每一个日期。 对于“species_full_name”-原始数据框日期中不存在“日期”的因素,R 应该在“数字”列中写入“0”。
我找到了一个几乎可以解决我的目标数据框的代码。问题是其他列(“species_name1”和...“_name2”)将消失:
as.data.frame(xtabs(number ~ species_full_name+date, df)) # create every factor "date" for every factor "species_full_name" and give counting data in column "Freq"
我需要一个类似于此输出的数据框,但每一列都来自原始数据框“df”。假设列“species_name1”和“species_name2”的值也很重要。
感谢您的帮助!
您可以使用 tidyr
complete()
complete(df, species_full_name,date) %>%
mutate(number=if_else(is.na(number),0,number))
输出:
species_full_name date species_name1 species_name2 number
<fct> <date> <fct> <fct> <dbl>
1 A a 2021-06-02 A a 30
2 A a 2021-06-15 NA NA 0
3 A a 2021-08-20 A a 30
4 A b 2021-06-02 NA NA 0
5 A b 2021-06-15 A b 11
6 A b 2021-08-20 A b 15
7 B c 2021-06-02 NA NA 0
8 B c 2021-06-15 NA NA 0
9 B c 2021-08-20 B c 40
但是 data.table 方法会更快。您可以按如下方式使用 data.table
和 CJ()
:
# load library
library(data.table)
# set df as data.table
setDT(df)
# get unique values of species_full_name and date
species_full_name = unique(df$species_full_name)
date = unique(df$date)
# merge (and update number to 0 if NA, and the name1 and name2 columns)
merge(CJ(date,species_full_name),df,by=c('date','species_full_name'),all.x = T) %>%
.[, number:=fifelse(is.na(number),0,as.double(number))] %>%
.[, c("species_name1","species_name2"):=tstrsplit(species_full_name, " ")] %>%
.[]
输出:
date species_full_name species_name1 species_name2 number
<Date> <fctr> <char> <char> <num>
1: 2021-06-02 A a A a 30
2: 2021-06-02 A b A b 0
3: 2021-06-02 B c B c 0
4: 2021-06-15 A a A a 0
5: 2021-06-15 A b A b 11
6: 2021-06-15 B c B c 0
7: 2021-08-20 A a A a 30
8: 2021-08-20 A b A b 15
9: 2021-08-20 B c B c 40