如何使用 ifelse 和 grepl 基于具有长字符串的列创建具有子字符串的新列?
How to create a new column with sub string based on a column with long string using ifelse and grepl?
首先查看 ac$summary 列的行
1
during a demonstration flight, a u.s. army flyer flown by orville wright nose-dived into the ground from a height of approximately 75 feet, killing lt. thomas e. selfridge who was a passenger. this was the first recorded airplane fatality in history. one of two propellers separated in flight, tearing loose the wires bracing the rudder and causing the loss of control of the aircraft. orville wright suffered broken ribs, pelvis and a leg. selfridge suffered a crushed skull and died a short time later.
2
first u.s. dirigible akron exploded just offshore at an altitude of 1,000 ft. during a test flight.
3
the first fatal airplane accident in canada occurred when american barnstormer, john m. bryant, california aviator was killed.
4
the airship flew into a thunderstorm and encountered a severe downdraft crashing 20 miles north of helgoland island into the sea. the ship broke in two and the control car immediately sank drowning its occupants.
5
hydrogen gas which was being vented was sucked into the forward engine and ignited causing the airship to explode and burn at 3,000 ft..
6
crashed into trees while attempting to land after being shot down by british and french aircraft.
7
exploded and burned near neuwerk island, when hydrogen gas, being vented, was ignited by lightning.
8
crashed near the black sea, cause unknown.
9
shot down by british aircraft crashing in flames.
10
shot down in flames by the british 39th home defence squadron.
11
crashed in a storm.
12
shot down by british anti-aircraft fire and aircraft and crashed into the north sea.
13
caught fire and crashed.
我想根据 ac$summary 制作 ac$sumnew 列
我写了下面的代码,但它不是 return 想要的输出
& 和 |被使用了。什么时候 |使用,结果不规则。有时对,有时错。
ac$sumnew = ifelse(grepl("missing & crashed",ac$Summary),"missing and crashed",
ifelse(grepl("shot | crashed",ac$Summary),"shot down and crashed",
ifelse(grepl("struck | lightening",ac$Summary),"struck by lightening and crashed",
ifelse(grepl("struck | bird & crashed",ac$Summary),"struck by bird and crashed",
ifelse(grepl("exploded | crashed",ac$Summary),"exploded and crashed",
ifelse(grepl("engine | failure",ac$Summary),"engine failure",
ifelse(grepl("fog | crashed",ac$Summary),"crashed due to heavy fog",
ifelse(grepl("fire | crashed",ac$Summary),"caught fire and crashed",
ifelse(grepl("shot",ac$Summary),"shot down",
ifelse(grepl("crashed",ac$Summary),"Crashed",
ifelse(grepl("shot",ac$Summary),"Shot down",
ifelse(grepl("disappeared",ac$Summary),"Disappeared",
ifelse(grepl("struck | obstacle | crashed ",ac$Summary),"struck by obstacle and Crashed",
ifelse(grepl("crashed",ac$Summary),"crashed",
ifelse(grepl("exploded",ac$Summary),"exploded",
ifelse(grepl("fire",ac$Summary),"caught fire","others"))))))))))))))))
比如飞机已经被击中,应该return "shot down"
如果只是崩溃,输出应该return "crashed"
如果它既丢失又崩溃,它应该 return "missing and crashed"
我无法使用 & 和 | 正确获取此部分还有
获得的输出如下所示
1
others
2
exploded and crashed
3
others
4
others
5
engine failure
6
shot down and crashed
7
exploded and crashed
8
Crashed
9
shot down and crashed
10
shot down and crashed
11
Crashed
12
missing and crashed
13
missing and crashed
14
missing and crashed
15
Crashed
16
shot down and crashed
17
shot down and crashed
我认为你有层次问题。 R 按顺序测试这些,所以你必须以适当的方式安排它。这里有一个 link 来帮助解决这个问题:https://www.programiz.com/r-programming/if-else-statement。
ac$new <-ifelse(apply(sapply(c("struck","bird","crash"), grepl, as.character(s$s)), 1, all) ,"struck by bird and crashed",
ifelse(apply(sapply(c("struck","obstacle","crash"), grepl, as.character(s$s)), 1, all) ,"struck by obstacle and Crashed",
ifelse(apply(sapply(c("miss" , "crash"), grepl, as.character(s$s)), 1, all) ,"missing and crashed",
ifelse(apply(sapply(c("shot" , "crash"), grepl, as.character(s$s)), 1, all) ,"shot down and crashed",
ifelse(apply(sapply(c("struck","lightening"), grepl, as.character(s$s)), 1, all) ,"struck by lightening and crashed",
ifelse(apply(sapply(c("explode","crash"), grepl, as.character(s$s)), 1 , all) ,"exploded and crashed",
ifelse(apply(sapply(c("engine|failure"), grepl, as.character(s$s)), 1 , all) ,"engine failure",
ifelse(apply(sapply(c("fog","crash"), grepl, as.character(s$s)) , 1, all) ,"crashed due to heavy fog",
ifelse(apply(sapply(c("fire","crash"), grepl, as.character(s$s)), 1, all) ,"caught fire and crashed",
ifelse(apply(sapply("shot", grepl, as.character(s$s)), 1, all) ,"shot down",
ifelse(apply(sapply("crash", grepl, as.character(s$s)), 1, all), "crashed",
ifelse(apply(sapply("explode", grepl, as.character(s$s)), 1, all), "exploded",
ifelse(apply(sapply("fire", grepl, as.character(s$s)), 1, all),"caught fire",
ifelse(apply(sapply("disappear", grepl, as.character(s$s)), 1, all), "Disappeared","others"))))))))))))))
现在,它的工作原理是检查 c()
中的所有单词,然后将值等同于 ac$new
,engine|failure
除外。此外,因为我们正在处理单词,所以您希望使用最简单的词干来检查所有变体:因此,例如,您应该使用 "miss" 而不是 "missing"。
我得到了
1 others
2 exploded
3 others
4 crashed
5 engine failure
6 shot down and crashed
7 exploded
8 crashed
9 shot down and crashed
10 shot down
11 crashed
12 shot down and crashed
13 caught fire and crashed
上面有些词不匹配,因为我确实检查了所有词。我检查所有单词的原因是因为您在 "ifelse" 链的后半部分识别了单个单词。我确实做了眼球测试,我认为基于检查所有单词我的是正确的。
顺便说一句,这很乏味,尤其是当您想扩展列表时。你可能想使用类似
ac <- data.frame(s = as.character(t), word.que = seq(1, length(t), by = 1))
ac$word.count <- sapply(gregexpr(" ", ac$s), length) + 1
new.mat <- data.frame(word.que = rep.int(ac$word.que, ac$word.count), word = unlist(strsplit(as.character(ac$s), split = " ")))
words.of.interest <- c("struck|bird|crash|obstacle|miss|shot|struck|lightening|explode|engine|failure|fog|fire|disappear")
new.mats<- new.mat %>%
mutate(word = gsub("\,", "", gsub("\.", "", word))) %>%
mutate(word.interest = ifelse(grepl(words.of.interest, as.character(word)), 1, 0)) %>%
filter(word.interest == 1) %>%
group_by(word.que) %>%
summarise(word.list = paste0(unique(word), collapse = "; ")) %>%
full_join(ac, by = "word.que" ) %>%
arrange(word.que) %>%
mutate(word.list = ifelse(is.na(word.list), 'other', word.list))
这将创建一个更有效的搜索列表供您构建。结果是
word.que word.list
1 1 other
2 2 exploded
3 3 other
4 4 crashing
5 5 engine; explode
6 6 crashed; shot
7 7 exploded
8 8 crashed
9 9 shot; crashing
10 10 shot
11 11 crashed
12 12 shot; fire; crashed
13 13 fire; crashed
以及您的文本变量和 word.count
。这在长 运行 中也可能更有效。
首先查看 ac$summary 列的行
1
during a demonstration flight, a u.s. army flyer flown by orville wright nose-dived into the ground from a height of approximately 75 feet, killing lt. thomas e. selfridge who was a passenger. this was the first recorded airplane fatality in history. one of two propellers separated in flight, tearing loose the wires bracing the rudder and causing the loss of control of the aircraft. orville wright suffered broken ribs, pelvis and a leg. selfridge suffered a crushed skull and died a short time later.
2
first u.s. dirigible akron exploded just offshore at an altitude of 1,000 ft. during a test flight.
3
the first fatal airplane accident in canada occurred when american barnstormer, john m. bryant, california aviator was killed.
4
the airship flew into a thunderstorm and encountered a severe downdraft crashing 20 miles north of helgoland island into the sea. the ship broke in two and the control car immediately sank drowning its occupants.
5
hydrogen gas which was being vented was sucked into the forward engine and ignited causing the airship to explode and burn at 3,000 ft..
6
crashed into trees while attempting to land after being shot down by british and french aircraft.
7
exploded and burned near neuwerk island, when hydrogen gas, being vented, was ignited by lightning.
8
crashed near the black sea, cause unknown.
9
shot down by british aircraft crashing in flames.
10
shot down in flames by the british 39th home defence squadron.
11
crashed in a storm.
12
shot down by british anti-aircraft fire and aircraft and crashed into the north sea.
13
caught fire and crashed.
我想根据 ac$summary 制作 ac$sumnew 列
我写了下面的代码,但它不是 return 想要的输出 & 和 |被使用了。什么时候 |使用,结果不规则。有时对,有时错。
ac$sumnew = ifelse(grepl("missing & crashed",ac$Summary),"missing and crashed",
ifelse(grepl("shot | crashed",ac$Summary),"shot down and crashed",
ifelse(grepl("struck | lightening",ac$Summary),"struck by lightening and crashed",
ifelse(grepl("struck | bird & crashed",ac$Summary),"struck by bird and crashed",
ifelse(grepl("exploded | crashed",ac$Summary),"exploded and crashed",
ifelse(grepl("engine | failure",ac$Summary),"engine failure",
ifelse(grepl("fog | crashed",ac$Summary),"crashed due to heavy fog",
ifelse(grepl("fire | crashed",ac$Summary),"caught fire and crashed",
ifelse(grepl("shot",ac$Summary),"shot down",
ifelse(grepl("crashed",ac$Summary),"Crashed",
ifelse(grepl("shot",ac$Summary),"Shot down",
ifelse(grepl("disappeared",ac$Summary),"Disappeared",
ifelse(grepl("struck | obstacle | crashed ",ac$Summary),"struck by obstacle and Crashed",
ifelse(grepl("crashed",ac$Summary),"crashed",
ifelse(grepl("exploded",ac$Summary),"exploded",
ifelse(grepl("fire",ac$Summary),"caught fire","others"))))))))))))))))
比如飞机已经被击中,应该return "shot down"
如果只是崩溃,输出应该return "crashed"
如果它既丢失又崩溃,它应该 return "missing and crashed"
我无法使用 & 和 | 正确获取此部分还有
获得的输出如下所示
1
others
2
exploded and crashed
3
others
4
others
5
engine failure
6
shot down and crashed
7
exploded and crashed
8
Crashed
9
shot down and crashed
10
shot down and crashed
11
Crashed
12
missing and crashed
13
missing and crashed
14
missing and crashed
15
Crashed
16
shot down and crashed
17
shot down and crashed
我认为你有层次问题。 R 按顺序测试这些,所以你必须以适当的方式安排它。这里有一个 link 来帮助解决这个问题:https://www.programiz.com/r-programming/if-else-statement。
ac$new <-ifelse(apply(sapply(c("struck","bird","crash"), grepl, as.character(s$s)), 1, all) ,"struck by bird and crashed",
ifelse(apply(sapply(c("struck","obstacle","crash"), grepl, as.character(s$s)), 1, all) ,"struck by obstacle and Crashed",
ifelse(apply(sapply(c("miss" , "crash"), grepl, as.character(s$s)), 1, all) ,"missing and crashed",
ifelse(apply(sapply(c("shot" , "crash"), grepl, as.character(s$s)), 1, all) ,"shot down and crashed",
ifelse(apply(sapply(c("struck","lightening"), grepl, as.character(s$s)), 1, all) ,"struck by lightening and crashed",
ifelse(apply(sapply(c("explode","crash"), grepl, as.character(s$s)), 1 , all) ,"exploded and crashed",
ifelse(apply(sapply(c("engine|failure"), grepl, as.character(s$s)), 1 , all) ,"engine failure",
ifelse(apply(sapply(c("fog","crash"), grepl, as.character(s$s)) , 1, all) ,"crashed due to heavy fog",
ifelse(apply(sapply(c("fire","crash"), grepl, as.character(s$s)), 1, all) ,"caught fire and crashed",
ifelse(apply(sapply("shot", grepl, as.character(s$s)), 1, all) ,"shot down",
ifelse(apply(sapply("crash", grepl, as.character(s$s)), 1, all), "crashed",
ifelse(apply(sapply("explode", grepl, as.character(s$s)), 1, all), "exploded",
ifelse(apply(sapply("fire", grepl, as.character(s$s)), 1, all),"caught fire",
ifelse(apply(sapply("disappear", grepl, as.character(s$s)), 1, all), "Disappeared","others"))))))))))))))
现在,它的工作原理是检查 c()
中的所有单词,然后将值等同于 ac$new
,engine|failure
除外。此外,因为我们正在处理单词,所以您希望使用最简单的词干来检查所有变体:因此,例如,您应该使用 "miss" 而不是 "missing"。
我得到了
1 others
2 exploded
3 others
4 crashed
5 engine failure
6 shot down and crashed
7 exploded
8 crashed
9 shot down and crashed
10 shot down
11 crashed
12 shot down and crashed
13 caught fire and crashed
上面有些词不匹配,因为我确实检查了所有词。我检查所有单词的原因是因为您在 "ifelse" 链的后半部分识别了单个单词。我确实做了眼球测试,我认为基于检查所有单词我的是正确的。
顺便说一句,这很乏味,尤其是当您想扩展列表时。你可能想使用类似
ac <- data.frame(s = as.character(t), word.que = seq(1, length(t), by = 1))
ac$word.count <- sapply(gregexpr(" ", ac$s), length) + 1
new.mat <- data.frame(word.que = rep.int(ac$word.que, ac$word.count), word = unlist(strsplit(as.character(ac$s), split = " ")))
words.of.interest <- c("struck|bird|crash|obstacle|miss|shot|struck|lightening|explode|engine|failure|fog|fire|disappear")
new.mats<- new.mat %>%
mutate(word = gsub("\,", "", gsub("\.", "", word))) %>%
mutate(word.interest = ifelse(grepl(words.of.interest, as.character(word)), 1, 0)) %>%
filter(word.interest == 1) %>%
group_by(word.que) %>%
summarise(word.list = paste0(unique(word), collapse = "; ")) %>%
full_join(ac, by = "word.que" ) %>%
arrange(word.que) %>%
mutate(word.list = ifelse(is.na(word.list), 'other', word.list))
这将创建一个更有效的搜索列表供您构建。结果是
word.que word.list
1 1 other
2 2 exploded
3 3 other
4 4 crashing
5 5 engine; explode
6 6 crashed; shot
7 7 exploded
8 8 crashed
9 9 shot; crashing
10 10 shot
11 11 crashed
12 12 shot; fire; crashed
13 13 fire; crashed
以及您的文本变量和 word.count
。这在长 运行 中也可能更有效。