替换数据框中的常用表达式

Replace common expressions in a data frame

我有一个由维基百科文本组成的数据框。 一个例子是:

dput(text3)
structure(list(texts = c("Apollo 13 was the seventh crewed mission in the Apollo space program and the third meant to land on the Moon. The craft was launched from Kennedy Space Center on April 11, 1970, but the lunar landing was aborted after an oxygen tank in the service module (SM) failed two days into the mission. The crew instead looped around the Moon, and returned safely to Earth on April 17. The mission was commanded by Lovell with Swigert as command module (CM) pilot and Haise as lunar module (LM) pilot. Swigert was a late replacement for Mattingly, who was grounded after exposure to rubella.", 
"A routine stir of an oxygen tank ignited damaged wire insulation inside it, causing an explosion that vented the contents of both of the SM's oxygen tanks to space. Without oxygen, needed for breathing and for generating electric power, the SM's propulsion and life support systems could not operate. The CM's systems had to be shut down to conserve its remaining resources for reentry, forcing the crew to transfer to the LM as a lifeboat. With the lunar landing canceled, mission controllers worked to bring the crew home alive. ", 
"Although the LM was designed to support two men on the lunar surface for two days, Mission Control in Houston improvised new procedures so it could support three men for four days. The crew experienced great hardship caused by limited power, a chilly and wet cabin and a shortage of potable water. There was a critical need to adapt the CM's cartridges for the carbon dioxide removal system to work in the LM; the crew and mission controllers were successful in improvising a solution. The astronauts' peril briefly renewed interest in the Apollo program; tens of millions watched the splashdown in the South Pacific Ocean on television."
), paragraph = c("p1", "p2", "p3"), source = c("wiki", "wiki", 
"wiki"), autronauts = c("Lovell", "Swigert", "Haise")), row.names = c(NA, 
-3L), class = "data.frame")

在我的研究中,我需要根据他们的社会角色来研究文章中的人物,我对真实姓名不感兴趣。所以我需要用一个独特的社会指标来代替每个名字。

洛弗尔 = @Astronaut1

Swigert = @Austronaut2

海斯=@Autronaut3

Mattingly = @Austronaut4

a01 <- c('Lovell', 'Swigert', 'Haise' ,'Mattingly')
a02 <- c('@Astronaut1', '@Austronaut2', '@Autronaut3', '@Austronaut4')

由于我必须替换两列中的字符串并保留数据帧格式,因此尝试但失败了:

library(stringi)
text3$texts <-  stri_replace_all_fixed(str = text3$texts, pattern = a01, replacement = a02)
Error in `$<-.data.frame`(`*tmp*`, texts, value = c("Apollo 13 was the seventh crewed mission in the Apollo space program and the third meant to land on the Moon. The craft was launched from Kennedy Space Center on April 11, 1970, but the lunar landing was aborted after an oxygen tank in the service module (SM) failed two days into the mission. The crew instead looped around the Moon, and returned safely to Earth on April 17. The mission was commanded by @Astronaut1 with Swigert as command module (CM) pilot and Haise as lunar module (LM) pilot. Swigert was a late replacement for Mattingly, who was grounded after exposure to rubella.",  : 
  replacement has 4 rows, data has 3
In addition: Warning message:
In stri_replace_all_fixed(str = text3$texts, pattern = a01, replacement = a02) :
  longer object length is not a multiple of shorter object length

text3$astronauts <-  stri_replace_all_fixed(str = text3$astronauts, pattern = a01, replacement = a02)
Error in `$<-.data.frame`(`*tmp*`, astronauts, value = c("@Astronaut1",  : 
  replacement has 4 rows, data has 3
In addition: Warning message:
In stri_replace_all_fixed(str = text3$astronauts, pattern = a01,  :
  longer object length is not a multiple of shorter object length

如有帮助将不胜感激

您可以使用 stringr::str_replace_all 尝试以下操作:

library(stringr)

text3$texts <- str_replace_all(text3$texts, c("Lovell" = "@Astronaut1", 
"Swigert" = "@Astronaut2", "Haise" = "@Astronaut3", "Mattingly" = "@Astronaut4"))

text3


texts
1                                       Apollo 13 was the seventh crewed 
mission in the Apollo space program and the third meant to land on the Moon. 
The craft was launched from Kennedy Space Center on April 11, 1970, but the 
lunar landing was aborted after an oxygen tank in the service module (SM) 
failed two days into the mission. The crew instead looped around the Moon, 
and returned safely to Earth on April 17. The mission was commanded by 
@Astronaut1 with...

这应该适用于 Base R

for(X in 1:length(a01)){
  text3 <- gsub(a01[X],a02[X],text3)
}

收到错误:

stri_replace_all_fixed(str = text3$texts, pattern = a01, replacement = a02)

来自矢量化方法(默认)。见 ?stringi-arguments:

Almost all functions are vectorized with respect to all their arguments and the recycling rule is applied whenever necessary. Due to this property you may, for instance, search for one pattern in each given string, search for each pattern in one given string, and search for the i-th pattern within the i-th string. This behavior sometimes leads to peculiar results - we assume you know what you are doing.

因此,结果将是 4 个字符串对象:

[1] texts 中的第一行替换为第一个 pattern/replacement (Astronaut1)

[2] texts 中的第二行替换为第二行 pattern/replacement (Astronaut2)

[3] texts 中的第三行替换为第三行 pattern/replacement (Astronaut3)

[4] texts 中的第一行(回收)替换为第四行 pattern/replacement (Astronaut4)

并且由于 returning 对象长度为 4,这大于您在 text3$texts 中开始尝试替换的 3 个字符串,导致错误。

要解决此问题,请设置 vectorize_all = FALSE:

stri_replace_all_fixed(str = text3$texts, pattern = a01, replacement = a02, vectorize_all = FALSE)

哪个应该 return 3 个字符串并替换 3 个字符串中每个字符串中所有模式之后的所有替换。