在 R 中随机化后重新填充关系数据框中的列
Re-Populate column in a relational data frame after randomization in R
我有一个个人及其配偶的数据框,其中包含一些我用 plyr::mapvalues 随机分配的个人信息(即姓氏)以保护身份。这是更改姓氏前后的可重现示例:
# before
d <- data.frame(id = c(1:6),
first_name = c('Jeff', 'Marilyn', 'Gwyn',
'Alice', 'Sam', 'Sarah'),
surname = c('Goldbloom', 'Monroe', 'Paltrow', 'Goldbloom',
'Smith', 'Silverman'),
spouse_id = c(2, 1, 1, 5, 4, "NA"),
spouse = c('Marilyn Monroe', 'Jeff Goldbloom', 'Jeff Goldbloom',
'Sam Smith', 'Alice Goldbloom', 'NA'))
d
> id first_name surname spouse_id spouse
1 Jeff Goldbloom 2 Marilyn Monroe
2 Marilyn Monroe 1 Jeff Goldbloom
3 Gwyn Paltrow 1 Jeff Goldbloom
4 Alice Goldbloom 5 Sam Smith
5 Sam Smith 4 Alice Goldbloom
6 Sarah Silverman NA NA
# replacement names to serve as surnames (doesn't matter what they are, just
that the ratios remain the same as before; mapvalues takes care of this)
repnames <- c("Arman" , "Clovis" , "Garner" , "Casey" , "Birch")
s <- unique(d$surname)
d$surname <- plyr::mapvalues(d$surname, from = s, to = repnames) #replace surnames
# After replacement, the dataframe looks like:
d
> id first_name surname spouse_id spouse
1 Jeff Arman 2 Marilyn Monroe
2 Marilyn Clovis 1 Jeff Goldbloom
3 Gwyn Garner 1 Jeff Goldbloom
4 Alice Arman 5 Sam Smith
5 Sam Casey 4 Alice Goldbloom
6 Sarah Birch NA NA
每个人都有自己的id
号码,但并不是所有人都有配偶。如果一个人确实有配偶,他们配偶的个人 id
会反映在 spouse_id
列中。我这样做是为了以后可以使用 dplyr::filter(d, spouse %in% spouse_id)
.
之类的方法分别过滤个人及其配偶
我的问题是,如何使用关系 id
和 spouse_id
列重新填充 spouse
列,以便它反映新的随机姓氏?即最终预期输出为:
id first_name surname spouse_id spouse
1 Jeff Arman 2 Marilyn Clovis
2 Marilyn Clovis 1 Jeff Arman
3 Gwyn Garner 1 Jeff Arman
4 Alice Arman 5 Sam Casey
5 Sam Casey 4 Alice Arman
6 Sarah Birch NA NA
...因此 first_name
和 surname
列将涉及一些串联。我从来没有在 R 中做过如此有条件的事情 - 在 Excel 我猜它会是嵌套的 VLOOKUP 函数...
谢谢,抱歉,它太具体了,但希望它能给外面的人带来有趣的挑战。
假设你的 NA 是真实的 NA,那么
d$spouse <- paste(d$first_name, d$surname)[d$spouse_id]
d$spouse
#[1] "Marilyn Clovis" "Jeff Arman" "Jeff Arman" "Sam Casey" "Alice Arman" NA
我有一个个人及其配偶的数据框,其中包含一些我用 plyr::mapvalues 随机分配的个人信息(即姓氏)以保护身份。这是更改姓氏前后的可重现示例:
# before
d <- data.frame(id = c(1:6),
first_name = c('Jeff', 'Marilyn', 'Gwyn',
'Alice', 'Sam', 'Sarah'),
surname = c('Goldbloom', 'Monroe', 'Paltrow', 'Goldbloom',
'Smith', 'Silverman'),
spouse_id = c(2, 1, 1, 5, 4, "NA"),
spouse = c('Marilyn Monroe', 'Jeff Goldbloom', 'Jeff Goldbloom',
'Sam Smith', 'Alice Goldbloom', 'NA'))
d
> id first_name surname spouse_id spouse
1 Jeff Goldbloom 2 Marilyn Monroe
2 Marilyn Monroe 1 Jeff Goldbloom
3 Gwyn Paltrow 1 Jeff Goldbloom
4 Alice Goldbloom 5 Sam Smith
5 Sam Smith 4 Alice Goldbloom
6 Sarah Silverman NA NA
# replacement names to serve as surnames (doesn't matter what they are, just
that the ratios remain the same as before; mapvalues takes care of this)
repnames <- c("Arman" , "Clovis" , "Garner" , "Casey" , "Birch")
s <- unique(d$surname)
d$surname <- plyr::mapvalues(d$surname, from = s, to = repnames) #replace surnames
# After replacement, the dataframe looks like:
d
> id first_name surname spouse_id spouse
1 Jeff Arman 2 Marilyn Monroe
2 Marilyn Clovis 1 Jeff Goldbloom
3 Gwyn Garner 1 Jeff Goldbloom
4 Alice Arman 5 Sam Smith
5 Sam Casey 4 Alice Goldbloom
6 Sarah Birch NA NA
每个人都有自己的id
号码,但并不是所有人都有配偶。如果一个人确实有配偶,他们配偶的个人 id
会反映在 spouse_id
列中。我这样做是为了以后可以使用 dplyr::filter(d, spouse %in% spouse_id)
.
我的问题是,如何使用关系 id
和 spouse_id
列重新填充 spouse
列,以便它反映新的随机姓氏?即最终预期输出为:
id first_name surname spouse_id spouse
1 Jeff Arman 2 Marilyn Clovis
2 Marilyn Clovis 1 Jeff Arman
3 Gwyn Garner 1 Jeff Arman
4 Alice Arman 5 Sam Casey
5 Sam Casey 4 Alice Arman
6 Sarah Birch NA NA
...因此 first_name
和 surname
列将涉及一些串联。我从来没有在 R 中做过如此有条件的事情 - 在 Excel 我猜它会是嵌套的 VLOOKUP 函数...
谢谢,抱歉,它太具体了,但希望它能给外面的人带来有趣的挑战。
假设你的 NA 是真实的 NA,那么
d$spouse <- paste(d$first_name, d$surname)[d$spouse_id]
d$spouse
#[1] "Marilyn Clovis" "Jeff Arman" "Jeff Arman" "Sam Casey" "Alice Arman" NA