如果满足特定条件,则替换值并移动数据框

Replace Value & Shift Data Frame If Certain Condition Met

我从在线来源中抓取了数据以创建一个数据框 (df1),其中包含 n 行与个人相关的信息。它以单个字符串的形式出现,我将单词分成适当的列。

90% 的信息被正确格式化为数据框中适当的列数 (6) - 然而,偶尔会有一行数据带有一个位于字符串开头的第 4 个单词。这些行现在有 7 列,并且与数据框中的其他所有内容都有偏移。

这是一个例子:

Num Last-Name First-Name Cat. DOB Location

11 Jackson, Adam L 1982-06-15 USA
2 Pearl, Sam R 1986-11-04 UK
5 Livingston, Steph LL 1983-12-12 USA
7 Thornton, Mark LR 1982-03-26 USA
10 Silver, John RED LL 1983-09-14 USA


df1 = c(" 11 Jackson, Adam L 1982-06-15 USA",
    "2 Pearl, Sam R 1986-11-04 UK",
    "5 Livingston, Steph LL 1983-12-12 USA",
    "7 Thornton, Mark LR 1982-03-26 USA",
    "10 Silver, John RED LL 1983-09-14 USA")

您可以看到项目 #10 添加了一个额外的输入,颜色 "RED" 被插入到字符串的中间。

我开始 运行 使用 to evaluate how many characters were present in the 4th word, and if it was 3 or greater (every value that will be in the Cat. column is is 1-2 characters), I created a new column at the end of the data frame, assigned the value to it, and if there was no value (i.e. it evaluates to FALSE), input NA. I'm sure I could likely create a massive nested ifelse statement in a mutate 的代码(我个人的舒适区),但我认为必须有更有效的方法来实现我想要的结果:

Num Last-Name First-Name Cat. DOB Location Color

11 Jackson, Adam L 1982-06-15 USA NA
2 Pearl, Sam R 1986-11-04 UK NA
5 Livingston, Steph LL 1983-12-12 USA NA
7 Thornton, Mark LR 1982-03-26 USA NA
10 Silver, John LL 1983-09-14 USA RED

我想找到从字符串开头算起的第 4 个单词为 3 个字符或更长的实例,将该单词或值分配给数据框末尾的新列,并将相应的值移动到左边的行与其他数据行正确对齐。

我们可以使用gsub删除多余的子串

v1 <- gsub("([^,]+),(\s+[[:alpha:]]+)\s*\S*(\s+[[:alpha:]]+\s+\d{4}-\d{2}-\d{2}.*)",
            "\1\2\3", trimws(df1))
d1 <- read.table(text=v1, sep="", header=FALSE, stringsAsFactors=FALSE, 
 col.names = c("Num", "LastName", "FirstName", "Cat", "DOB", "Location"))
d1$Color <-  trimws(gsub("^[^,]+,\s+[[:alpha:]]+|[[:alpha:]]+\s+\d{4}-\d{2}-\d{2}\s+\S+$",
                       "", trimws(df1)))
d1
#  Num   LastName FirstName Cat        DOB Location Color
#1  11    Jackson      Adam   L 1982-06-15      USA      
#2   2      Pearl       Sam   R 1986-11-04       UK      
#3   5 Livingston     Steph  LL 1983-12-12      USA      
#4   7   Thornton      Mark  LR 1982-03-26      USA      
#5  10     Silver      John  LL 1983-09-14      USA   RED

这里有一个更简单的方法:

input <- gsub("(.*, \w+) ((?:\w){3,})(.*)", "\1 \3 \2", input, TRUE)
input <- gsub("([0-9]\s\w+)\n", "\1 NA\n", input, TRUE)

第一个 gsub 将颜色转置到字符串的末尾。第二个 gsub 利用 unchanged 行现在将以日期和国家代码(而不是国家代码和颜色)结尾的事实,并简单地添加一个 "NA" 给他们。

IDEone demo

使用 strsplit 而不是正则表达式:

# split strings in df1 on commas and spaces not preceded by the start of the line
s <- strsplit(df1, '(?<!^)[, ]+', perl = T)

# iterate over s, transpose the result and make it a data.frame
df2 <- data.frame(t(sapply(s, function(x){
    # if number of items in row is 6, insert NA, else rearrange
    if (length(x) == 6) {c(x, NA)} else {x[c(1:3, 5:7, 4)]}
})))

# add names
names(df2) <- c("Num", "Last-Name", "First-Name", "Cat.", "DOB", "Location", "Color")

df2
#   Num  Last-Name First-Name Cat.        DOB Location Color
# 1  11    Jackson       Adam    L 1982-06-15      USA  <NA>
# 2   2      Pearl        Sam    R 1986-11-04       UK  <NA>
# 3   5 Livingston      Steph   LL 1983-12-12      USA  <NA>
# 4   7   Thornton       Mark   LR 1982-03-26      USA  <NA>
# 5  10     Silver       John   LL 1983-09-14      USA   RED