如何根据多个列的多个条件创建新列?
How do I create a new column based on multiple conditions from multiple columns?
我正在尝试根据其他列的几个条件向数据框添加一个新列。我有以下数据:
> commute <- c("walk", "bike", "subway", "drive", "ferry", "walk", "bike", "subway", "drive", "ferry", "walk", "bike", "subway", "drive", "ferry")
> kids <- c("Yes", "Yes", "No", "No", "Yes", "Yes", "No", "No", "Yes", "Yes", "No", "No", "Yes", "No", "Yes")
> distance <- c(1, 12, 5, 25, 7, 2, "", 8, 19, 7, "", 4, 16, 12, 7)
>
> df = data.frame(commute, kids, distance)
> df
commute kids distance
1 walk Yes 1
2 bike Yes 12
3 subway No 5
4 drive No 25
5 ferry Yes 7
6 walk Yes 2
7 bike No
8 subway No 8
9 drive Yes 19
10 ferry Yes 7
11 walk No
12 bike No 4
13 subway Yes 16
14 drive No 12
15 ferry Yes 7
如果满足以下三个条件:
commute = walk OR bike OR subway OR ferry
AND
kids = Yes
AND
distance is less than 10
然后我想要一个名为 get.flyer 的新列等于 "Yes"。最终数据框应如下所示:
commute kids distance get.flyer
1 walk Yes 1 Yes
2 bike Yes 12 Yes
3 subway No 5
4 drive No 25
5 ferry Yes 7 Yes
6 walk Yes 2 Yes
7 bike No
8 subway No 8
9 drive Yes 19
10 ferry Yes 7 Yes
11 walk No
12 bike No 4
13 subway Yes 16 Yes
14 drive No 12
15 ferry Yes 7 Yes
我们可以使用 %in%
来比较列中的多个元素,&
来检查两个条件是否都为真。
library(dplyr)
df %>%
mutate(get.flyer = c("", "Yes")[(commute %in% c("walk", "bike", "subway", "ferry") &
as.character(kids) == "Yes" &
as.numeric(as.character(distance)) < 10)+1] )
最好用 stringsAsFactors=FALSE
创建 data.frame
,因为默认情况下它是 TRUE
。如果我们检查str(df)
,我们可以发现所有列都是factor
class。此外,如果存在缺失值,可以使用 NA
代替 ""
来避免将 numeric
列的 class
转换为其他内容。
如果我们重写 'df'
的创建
distance <- c(1, 12, 5, 25, 7, 2, NA, 8, 19, 7, NA, 4, 16, 12, 7)
df1 <- data.frame(commute, kids, distance, stringsAsFactors=FALSE)
以上代码可以简化
df1 %>%
mutate(get.flyer = c("", "Yes")[(commute %in% c("walk", "bike", "subway", "ferry") &
kids == "Yes" &
distance < 10)+1] )
为了更好的理解,有些人更喜欢ifelse
df1 %>%
mutate(get.flyer = ifelse(commute %in% c("walk", "bike", "subway", "ferry") &
kids == "Yes" &
distance < 10,
"Yes", ""))
这也可以使用 base R
方法轻松完成
df1$get.flyer <- with(df1, ifelse(commute %in% c("walk", "bike", "subway", "ferry") &
kids == "Yes" &
distance < 10,
"Yes", ""))
@akrun 已经指出了解决方案。我想以更 'wrapped up' 的方式呈现它。
您可以使用 ifelse
语句根据一个(或多个)条件创建列。但首先您必须更改距离列中缺失值的 'encoding'。您使用 ""
来指示缺失值,但这会将整个列转换为 string
并禁止数值比较(distance < 10
是不可能的)。 R
表示缺失值的方式是 NA
,你的 distance
列定义应该是:
distance <- c(1, 12, 5, 25, 7, 2, NA, 8, 19, 7, NA, 4, 16, 12, 7)
ifelse
语句如下所示:
df$get.flyer <- ifelse(
(
(df$commute %in% c("walk", "bike", "subway", "ferry")) &
(df$kids == "Yes") &
(df$distance < 10)
),
1, # if condition is met, put 1
0 # else put 0
)
可选:也考虑以不同的方式对其他列进行编码:
- 您可以使用
TRUE
和 FALSE
而不是 "Yes" 和 "No" 作为 kids
变量
- 您可以使用
factor
通勤
示例,检查first_column_name是否包含在second_column_name中并将结果写入new_column
df$new_column <- apply(df, 1, function(x) grepl(x['first_column_name'], x['second_column_name'], fixed = TRUE))
详情:
df$new_column <- # create a new column with name new_column on df
apply(df, 1 # `1` means for each row, `apply(df` means apply the following function on df
function(x) # Function definition to apply on each row, `x` means input row for each row.
grepl(x['first_column_name'], x['second_column_name'], fixed = TRUE)) # Body of function to apply, basically run grepl to find if first_column_name is in second_column_name, fixed = TRUE means don't use regular expression just the plain text from first_column_name.
我正在尝试根据其他列的几个条件向数据框添加一个新列。我有以下数据:
> commute <- c("walk", "bike", "subway", "drive", "ferry", "walk", "bike", "subway", "drive", "ferry", "walk", "bike", "subway", "drive", "ferry")
> kids <- c("Yes", "Yes", "No", "No", "Yes", "Yes", "No", "No", "Yes", "Yes", "No", "No", "Yes", "No", "Yes")
> distance <- c(1, 12, 5, 25, 7, 2, "", 8, 19, 7, "", 4, 16, 12, 7)
>
> df = data.frame(commute, kids, distance)
> df
commute kids distance
1 walk Yes 1
2 bike Yes 12
3 subway No 5
4 drive No 25
5 ferry Yes 7
6 walk Yes 2
7 bike No
8 subway No 8
9 drive Yes 19
10 ferry Yes 7
11 walk No
12 bike No 4
13 subway Yes 16
14 drive No 12
15 ferry Yes 7
如果满足以下三个条件:
commute = walk OR bike OR subway OR ferry
AND
kids = Yes
AND
distance is less than 10
然后我想要一个名为 get.flyer 的新列等于 "Yes"。最终数据框应如下所示:
commute kids distance get.flyer
1 walk Yes 1 Yes
2 bike Yes 12 Yes
3 subway No 5
4 drive No 25
5 ferry Yes 7 Yes
6 walk Yes 2 Yes
7 bike No
8 subway No 8
9 drive Yes 19
10 ferry Yes 7 Yes
11 walk No
12 bike No 4
13 subway Yes 16 Yes
14 drive No 12
15 ferry Yes 7 Yes
我们可以使用 %in%
来比较列中的多个元素,&
来检查两个条件是否都为真。
library(dplyr)
df %>%
mutate(get.flyer = c("", "Yes")[(commute %in% c("walk", "bike", "subway", "ferry") &
as.character(kids) == "Yes" &
as.numeric(as.character(distance)) < 10)+1] )
最好用 stringsAsFactors=FALSE
创建 data.frame
,因为默认情况下它是 TRUE
。如果我们检查str(df)
,我们可以发现所有列都是factor
class。此外,如果存在缺失值,可以使用 NA
代替 ""
来避免将 numeric
列的 class
转换为其他内容。
如果我们重写 'df'
的创建distance <- c(1, 12, 5, 25, 7, 2, NA, 8, 19, 7, NA, 4, 16, 12, 7)
df1 <- data.frame(commute, kids, distance, stringsAsFactors=FALSE)
以上代码可以简化
df1 %>%
mutate(get.flyer = c("", "Yes")[(commute %in% c("walk", "bike", "subway", "ferry") &
kids == "Yes" &
distance < 10)+1] )
为了更好的理解,有些人更喜欢ifelse
df1 %>%
mutate(get.flyer = ifelse(commute %in% c("walk", "bike", "subway", "ferry") &
kids == "Yes" &
distance < 10,
"Yes", ""))
这也可以使用 base R
方法轻松完成
df1$get.flyer <- with(df1, ifelse(commute %in% c("walk", "bike", "subway", "ferry") &
kids == "Yes" &
distance < 10,
"Yes", ""))
@akrun 已经指出了解决方案。我想以更 'wrapped up' 的方式呈现它。
您可以使用 ifelse
语句根据一个(或多个)条件创建列。但首先您必须更改距离列中缺失值的 'encoding'。您使用 ""
来指示缺失值,但这会将整个列转换为 string
并禁止数值比较(distance < 10
是不可能的)。 R
表示缺失值的方式是 NA
,你的 distance
列定义应该是:
distance <- c(1, 12, 5, 25, 7, 2, NA, 8, 19, 7, NA, 4, 16, 12, 7)
ifelse
语句如下所示:
df$get.flyer <- ifelse(
(
(df$commute %in% c("walk", "bike", "subway", "ferry")) &
(df$kids == "Yes") &
(df$distance < 10)
),
1, # if condition is met, put 1
0 # else put 0
)
可选:也考虑以不同的方式对其他列进行编码:
- 您可以使用
TRUE
和FALSE
而不是 "Yes" 和 "No" 作为kids
变量 - 您可以使用
factor
通勤
示例,检查first_column_name是否包含在second_column_name中并将结果写入new_column
df$new_column <- apply(df, 1, function(x) grepl(x['first_column_name'], x['second_column_name'], fixed = TRUE))
详情:
df$new_column <- # create a new column with name new_column on df
apply(df, 1 # `1` means for each row, `apply(df` means apply the following function on df
function(x) # Function definition to apply on each row, `x` means input row for each row.
grepl(x['first_column_name'], x['second_column_name'], fixed = TRUE)) # Body of function to apply, basically run grepl to find if first_column_name is in second_column_name, fixed = TRUE means don't use regular expression just the plain text from first_column_name.