有条件地拆分数据帧 R 中的字符串单元格
Conditionnaly split strings cells in a dataframe R
我有一个包含物种名称的数据集,其中一些最初使用的名称现在已过时,因此它们被标注 "old_species***retired*** use new_species",而正确的单元格仅被标注 "new_species"。这是数据示例:
df<- data.frame(species=c("Etheostoma spectabile","Ictalurus furcatus","Micropterus salmoides","Micropterus salmoides","Ictalurus punctatus","Ictalurus punctatus","Ictalurus punctatus","Micropterus salmoides","Etheostoma olmstedi","Noturus insignis","Lepomis auritus","Lepomis auritus","Nocomis leptocephalus","Scartomyzon rupiscartes***retired***use Moxostoma rupiscartes","Lepomis cyanellus","Notropis chlorocephalus","Scartomyzon cervinus***retired***use Moxostoma cervinum","Ictalurus punctatus","Lythrurus ardens","Moxostoma pappillosum","Micropterus salmoides","Micropterus salmoides","Ictalurus punctatus"))
我试过了
sapply(strsplit(df$species, split='***retired***use', fixed = T),function(x) (x[2])))
但数据正确的单元格 returns NA 因为它们不包含拆分。
有没有办法只对实际包含它的单元格进行拆分?
我们可以使用 grep
创建索引,然后使用这些行进行拆分
i1 <- grep('retired', df$species)
df$species <- as.character(df$species)
df$species[i1] <- sapply(strsplit(df$species[i1], "***retired***use ",
fixed = TRUE), `[`, 2)
df$species
#[1] "Etheostoma spectabile" "Ictalurus furcatus" "Micropterus salmoides" "Micropterus salmoides" "Ictalurus punctatus"
#[6] "Ictalurus punctatus" "Ictalurus punctatus" "Micropterus salmoides" "Etheostoma olmstedi" "Noturus insignis"
#[11] "Lepomis auritus" "Lepomis auritus" "Nocomis leptocephalus" "Moxostoma rupiscartes" "Lepomis cyanellus"
#[16] "Notropis chlorocephalus" "Moxostoma cervinum" "Ictalurus punctatus" "Lythrurus ardens" "Moxostoma pappillosum"
#[21] "Micropterus salmoides" "Micropterus salmoides" "Ictalurus punctatus"
或者通过 sub
使用正则表达式
df$species <- sub(".*\*{3}retired\*{3}use\s+", "", df$species)
您可以使用 gsub
加上反向引用将旧名称更改为新名称:
gsub(".*\*\*\*retired\*\*\*use\s(.*)", "\1", df$species)
# [1] "Etheostoma spectabile" "Ictalurus furcatus" "Micropterus salmoides" "Micropterus salmoides"
# [5] "Ictalurus punctatus" "Ictalurus punctatus" "Ictalurus punctatus" "Micropterus salmoides"
# [9] "Etheostoma olmstedi" "Noturus insignis" "Lepomis auritus" "Lepomis auritus"
# [13] "Nocomis leptocephalus" "Moxostoma rupiscartes" "Lepomis cyanellus" "Notropis chlorocephalus"
# [17] "Moxostoma cervinum" "Ictalurus punctatus" "Lythrurus ardens" "Moxostoma pappillosum"
# [21] "Micropterus salmoides" "Micropterus salmoides" "Ictalurus punctatus"
解释:
.*
任意次数后跟...
\*\*\*retired\*\*\*use\s
... 文字模式 ***retired***use
后跟 ...
(.*)
...任何次数——这是 gsub
的替换参数中的反向引用 \1
引用
的捕获组
我有一个包含物种名称的数据集,其中一些最初使用的名称现在已过时,因此它们被标注 "old_species***retired*** use new_species",而正确的单元格仅被标注 "new_species"。这是数据示例:
df<- data.frame(species=c("Etheostoma spectabile","Ictalurus furcatus","Micropterus salmoides","Micropterus salmoides","Ictalurus punctatus","Ictalurus punctatus","Ictalurus punctatus","Micropterus salmoides","Etheostoma olmstedi","Noturus insignis","Lepomis auritus","Lepomis auritus","Nocomis leptocephalus","Scartomyzon rupiscartes***retired***use Moxostoma rupiscartes","Lepomis cyanellus","Notropis chlorocephalus","Scartomyzon cervinus***retired***use Moxostoma cervinum","Ictalurus punctatus","Lythrurus ardens","Moxostoma pappillosum","Micropterus salmoides","Micropterus salmoides","Ictalurus punctatus"))
我试过了
sapply(strsplit(df$species, split='***retired***use', fixed = T),function(x) (x[2])))
但数据正确的单元格 returns NA 因为它们不包含拆分。 有没有办法只对实际包含它的单元格进行拆分?
我们可以使用 grep
创建索引,然后使用这些行进行拆分
i1 <- grep('retired', df$species)
df$species <- as.character(df$species)
df$species[i1] <- sapply(strsplit(df$species[i1], "***retired***use ",
fixed = TRUE), `[`, 2)
df$species
#[1] "Etheostoma spectabile" "Ictalurus furcatus" "Micropterus salmoides" "Micropterus salmoides" "Ictalurus punctatus"
#[6] "Ictalurus punctatus" "Ictalurus punctatus" "Micropterus salmoides" "Etheostoma olmstedi" "Noturus insignis"
#[11] "Lepomis auritus" "Lepomis auritus" "Nocomis leptocephalus" "Moxostoma rupiscartes" "Lepomis cyanellus"
#[16] "Notropis chlorocephalus" "Moxostoma cervinum" "Ictalurus punctatus" "Lythrurus ardens" "Moxostoma pappillosum"
#[21] "Micropterus salmoides" "Micropterus salmoides" "Ictalurus punctatus"
或者通过 sub
df$species <- sub(".*\*{3}retired\*{3}use\s+", "", df$species)
您可以使用 gsub
加上反向引用将旧名称更改为新名称:
gsub(".*\*\*\*retired\*\*\*use\s(.*)", "\1", df$species)
# [1] "Etheostoma spectabile" "Ictalurus furcatus" "Micropterus salmoides" "Micropterus salmoides"
# [5] "Ictalurus punctatus" "Ictalurus punctatus" "Ictalurus punctatus" "Micropterus salmoides"
# [9] "Etheostoma olmstedi" "Noturus insignis" "Lepomis auritus" "Lepomis auritus"
# [13] "Nocomis leptocephalus" "Moxostoma rupiscartes" "Lepomis cyanellus" "Notropis chlorocephalus"
# [17] "Moxostoma cervinum" "Ictalurus punctatus" "Lythrurus ardens" "Moxostoma pappillosum"
# [21] "Micropterus salmoides" "Micropterus salmoides" "Ictalurus punctatus"
解释:
.*
任意次数后跟...
\*\*\*retired\*\*\*use\s
... 文字模式 ***retired***use
后跟 ...
(.*)
...任何次数——这是 gsub
的替换参数中的反向引用 \1
引用