通过选择两列信息合并一个新列

Combine a new column by selecting two columns information

我有一个包含两列信息的数据框,我想创建一个基于第二列的新列和 select 不包含 NA 值的内容,如果它是重复的,将选择第一列。

df:
200610-1    rs28619217
200610-10   NA
200610-100  rs367572771
200610-102  rs144402189
200610-105  rs375896687
200610-107  NA
200610-108  NA
200610-109  NA
200610-110  rs199838004
200610-111  rs374875201
200610-112  NA
200610-113  rs377546596
200610-114  NA
200610-115  NA
200610-116  NA
200610-117  rs67858721
200610-118  rs67858721
200610-119  rs9876735
200610-120  rs9876735

desired output:
200610-1    rs28619217  rs28619217
200610-10   NA          200610-10
200610-100  rs367572771 rs367572771
200610-102  rs144402189 rs144402189
200610-105  rs375896687 rs375896687
200610-107  NA          200610-107
200610-108  NA          200610-108
200610-109  NA          200610-109
200610-110  rs199838004 rs199838004
200610-111  rs374875201 rs374875201
200610-112  NA          200610-112
200610-113  rs377546596 rs377546596
200610-114  NA          200610-114
200610-115  NA          200610-115
200610-116  NA          200610-116
200610-117  rs67858721  rs67858721
200610-118  rs67858721  200610-118
200610-119  rs9876735   rs9876735
200610-120  rs9876735   200610-120

我应该怎么做?我正在考虑使用应用功能。

考虑下面的变体...

df <- data.frame(colA=c(1,2,3,4),
                 colB=c("a",NA,"b","c"),
                 stringsAsFactors = FALSE)

df$colC <- df[,2]
df[is.na(df$colC) | duplicated(df$colB),"colC"]<- df[is.na(df$colC)| duplicated(df$colB),"colA"]

我们可以使用ifelse

df1$Col3 <- with(df1, ifelse(is.na(Col2), Col1, Col2))
df1$Col3
#[1] "rs28619217"  "200610-10"   "rs367572771" "rs144402189" "rs375896687"
#[6] "200610-107"  "200610-108"  "200610-109"  "rs199838004" "rs374875201"
#[11] "200610-112"  "rs377546596" "200610-114"  "200610-115"  "200610-116" 

更新

如果有重复项,正如@Sotos 在评论中提到的,我们可以在 ifelse

中创建一个带有 duplicated 的逻辑向量
with(df1, ifelse(is.na(Col2)|duplicated(Col2), Col1, Col2))

mutate 和 ifelse 语句即可完成工作:

df <- read_table("200610-1    rs28619217
200610-10   NA
200610-100  rs367572771
200610-102  rs144402189
200610-105  rs375896687
200610-107  NA
200610-108  NA
200610-109  NA
200610-110  rs199838004
200610-111  rs374875201
200610-112  NA
200610-113  rs377546596
200610-114  NA
200610-115  NA
200610-116  NA", col_names = c("col1", "col2"), col_types = "cc")

df %>% 
  mutate(fill = ifelse(is.na(col2), col1, col2))

# A tibble: 15 × 3
         col1        col2        fill
        <chr>       <chr>       <chr>
1    200610-1  rs28619217  rs28619217
2   200610-10        <NA>   200610-10
3  200610-100 rs367572771 rs367572771
4  200610-102 rs144402189 rs144402189
5  200610-105 rs375896687 rs375896687
6  200610-107        <NA>  200610-107
7  200610-108        <NA>  200610-108
8  200610-109        <NA>  200610-109
9  200610-110 rs199838004 rs199838004
10 200610-111 rs374875201 rs374875201
11 200610-112        <NA>  200610-112
12 200610-113 rs377546596 rs377546596
13 200610-114        <NA>  200610-114
14 200610-115        <NA>  200610-115
15 200610-116        <NA>  200610-116
df = df[! is.na(df[,2])]
df[,3]= paste0(df[,1], df[,2])
df = df[ unique(df[,3]), ]
df = df[,3]

成功了吗?

df =
df %>% 
mutate(fill = ifelse(is.na(col2), col1, col2)) %>%
unique(df$col1)

成功了吗?