通过选择两列信息合并一个新列
Combine a new column by selecting two columns information
我有一个包含两列信息的数据框,我想创建一个基于第二列的新列和 select 不包含 NA 值的内容,如果它是重复的,将选择第一列。
df:
200610-1 rs28619217
200610-10 NA
200610-100 rs367572771
200610-102 rs144402189
200610-105 rs375896687
200610-107 NA
200610-108 NA
200610-109 NA
200610-110 rs199838004
200610-111 rs374875201
200610-112 NA
200610-113 rs377546596
200610-114 NA
200610-115 NA
200610-116 NA
200610-117 rs67858721
200610-118 rs67858721
200610-119 rs9876735
200610-120 rs9876735
desired output:
200610-1 rs28619217 rs28619217
200610-10 NA 200610-10
200610-100 rs367572771 rs367572771
200610-102 rs144402189 rs144402189
200610-105 rs375896687 rs375896687
200610-107 NA 200610-107
200610-108 NA 200610-108
200610-109 NA 200610-109
200610-110 rs199838004 rs199838004
200610-111 rs374875201 rs374875201
200610-112 NA 200610-112
200610-113 rs377546596 rs377546596
200610-114 NA 200610-114
200610-115 NA 200610-115
200610-116 NA 200610-116
200610-117 rs67858721 rs67858721
200610-118 rs67858721 200610-118
200610-119 rs9876735 rs9876735
200610-120 rs9876735 200610-120
我应该怎么做?我正在考虑使用应用功能。
考虑下面的变体...
df <- data.frame(colA=c(1,2,3,4),
colB=c("a",NA,"b","c"),
stringsAsFactors = FALSE)
df$colC <- df[,2]
df[is.na(df$colC) | duplicated(df$colB),"colC"]<- df[is.na(df$colC)| duplicated(df$colB),"colA"]
我们可以使用ifelse
df1$Col3 <- with(df1, ifelse(is.na(Col2), Col1, Col2))
df1$Col3
#[1] "rs28619217" "200610-10" "rs367572771" "rs144402189" "rs375896687"
#[6] "200610-107" "200610-108" "200610-109" "rs199838004" "rs374875201"
#[11] "200610-112" "rs377546596" "200610-114" "200610-115" "200610-116"
更新
如果有重复项,正如@Sotos 在评论中提到的,我们可以在 ifelse
中创建一个带有 duplicated
的逻辑向量
with(df1, ifelse(is.na(Col2)|duplicated(Col2), Col1, Col2))
mutate 和 ifelse 语句即可完成工作:
df <- read_table("200610-1 rs28619217
200610-10 NA
200610-100 rs367572771
200610-102 rs144402189
200610-105 rs375896687
200610-107 NA
200610-108 NA
200610-109 NA
200610-110 rs199838004
200610-111 rs374875201
200610-112 NA
200610-113 rs377546596
200610-114 NA
200610-115 NA
200610-116 NA", col_names = c("col1", "col2"), col_types = "cc")
df %>%
mutate(fill = ifelse(is.na(col2), col1, col2))
# A tibble: 15 × 3
col1 col2 fill
<chr> <chr> <chr>
1 200610-1 rs28619217 rs28619217
2 200610-10 <NA> 200610-10
3 200610-100 rs367572771 rs367572771
4 200610-102 rs144402189 rs144402189
5 200610-105 rs375896687 rs375896687
6 200610-107 <NA> 200610-107
7 200610-108 <NA> 200610-108
8 200610-109 <NA> 200610-109
9 200610-110 rs199838004 rs199838004
10 200610-111 rs374875201 rs374875201
11 200610-112 <NA> 200610-112
12 200610-113 rs377546596 rs377546596
13 200610-114 <NA> 200610-114
14 200610-115 <NA> 200610-115
15 200610-116 <NA> 200610-116
df = df[! is.na(df[,2])]
df[,3]= paste0(df[,1], df[,2])
df = df[ unique(df[,3]), ]
df = df[,3]
成功了吗?
df =
df %>%
mutate(fill = ifelse(is.na(col2), col1, col2)) %>%
unique(df$col1)
成功了吗?
我有一个包含两列信息的数据框,我想创建一个基于第二列的新列和 select 不包含 NA 值的内容,如果它是重复的,将选择第一列。
df:
200610-1 rs28619217
200610-10 NA
200610-100 rs367572771
200610-102 rs144402189
200610-105 rs375896687
200610-107 NA
200610-108 NA
200610-109 NA
200610-110 rs199838004
200610-111 rs374875201
200610-112 NA
200610-113 rs377546596
200610-114 NA
200610-115 NA
200610-116 NA
200610-117 rs67858721
200610-118 rs67858721
200610-119 rs9876735
200610-120 rs9876735
desired output:
200610-1 rs28619217 rs28619217
200610-10 NA 200610-10
200610-100 rs367572771 rs367572771
200610-102 rs144402189 rs144402189
200610-105 rs375896687 rs375896687
200610-107 NA 200610-107
200610-108 NA 200610-108
200610-109 NA 200610-109
200610-110 rs199838004 rs199838004
200610-111 rs374875201 rs374875201
200610-112 NA 200610-112
200610-113 rs377546596 rs377546596
200610-114 NA 200610-114
200610-115 NA 200610-115
200610-116 NA 200610-116
200610-117 rs67858721 rs67858721
200610-118 rs67858721 200610-118
200610-119 rs9876735 rs9876735
200610-120 rs9876735 200610-120
我应该怎么做?我正在考虑使用应用功能。
考虑下面的变体...
df <- data.frame(colA=c(1,2,3,4),
colB=c("a",NA,"b","c"),
stringsAsFactors = FALSE)
df$colC <- df[,2]
df[is.na(df$colC) | duplicated(df$colB),"colC"]<- df[is.na(df$colC)| duplicated(df$colB),"colA"]
我们可以使用ifelse
df1$Col3 <- with(df1, ifelse(is.na(Col2), Col1, Col2))
df1$Col3
#[1] "rs28619217" "200610-10" "rs367572771" "rs144402189" "rs375896687"
#[6] "200610-107" "200610-108" "200610-109" "rs199838004" "rs374875201"
#[11] "200610-112" "rs377546596" "200610-114" "200610-115" "200610-116"
更新
如果有重复项,正如@Sotos 在评论中提到的,我们可以在 ifelse
duplicated
的逻辑向量
with(df1, ifelse(is.na(Col2)|duplicated(Col2), Col1, Col2))
mutate 和 ifelse 语句即可完成工作:
df <- read_table("200610-1 rs28619217
200610-10 NA
200610-100 rs367572771
200610-102 rs144402189
200610-105 rs375896687
200610-107 NA
200610-108 NA
200610-109 NA
200610-110 rs199838004
200610-111 rs374875201
200610-112 NA
200610-113 rs377546596
200610-114 NA
200610-115 NA
200610-116 NA", col_names = c("col1", "col2"), col_types = "cc")
df %>%
mutate(fill = ifelse(is.na(col2), col1, col2))
# A tibble: 15 × 3
col1 col2 fill
<chr> <chr> <chr>
1 200610-1 rs28619217 rs28619217
2 200610-10 <NA> 200610-10
3 200610-100 rs367572771 rs367572771
4 200610-102 rs144402189 rs144402189
5 200610-105 rs375896687 rs375896687
6 200610-107 <NA> 200610-107
7 200610-108 <NA> 200610-108
8 200610-109 <NA> 200610-109
9 200610-110 rs199838004 rs199838004
10 200610-111 rs374875201 rs374875201
11 200610-112 <NA> 200610-112
12 200610-113 rs377546596 rs377546596
13 200610-114 <NA> 200610-114
14 200610-115 <NA> 200610-115
15 200610-116 <NA> 200610-116
df = df[! is.na(df[,2])]
df[,3]= paste0(df[,1], df[,2])
df = df[ unique(df[,3]), ]
df = df[,3]
成功了吗?
df =
df %>%
mutate(fill = ifelse(is.na(col2), col1, col2)) %>%
unique(df$col1)
成功了吗?