如果字符串的第一部分相同，则标记为相同

Question

一些示例数据：

    id_trial        
 001_a.txt          
 001_a_t2.txt       
 949482_b.txt       
 949482_b_t2.txt    
 95_c.txt           
 95_c_t2.txt

注意：字符串长度不同，但长度等于对减去“_t2”

我怎样才能做到，如果 _t2 之前的字符串部分相同，那么两者都会在新列中标记为这样。也就是说，我想要这样的东西：

    id_trial         subject
 001_a.txt           person_a
 001_a_t2.txt        person_a
 949482_b.txt        person_b
 949482_b_t2.txt     person_b
 95_c.txt            person_c
 95_c_t2.txt         person_c

即使这样也行得通：

    id_trial         subject
 001_a.txt               a
 001_a_t2.txt            a
 949482_b.txt            b
 949482_b_t2.txt         b
 95_c.txt                c
 95_c_t2.txt             c

如有任何帮助，我们将不胜感激。

Answer 1

您可以尝试sub提取前缀部分

df1$subject <-   sub('([^_]+_.).*', '\1',sub('([^_]+)\1+',
          '\1', df1$id_trial))
df1
#        id_trial  subject
#1   personn_a.txt person_a
#2 person_a_t2.txt person_a
#3    person_b.txt person_b
#4 person_b_t2.txt person_b
#5  personnn_c.txt person_c
#6 person_c_t2.txt person_c

如果您需要numeric主题

as.numeric(factor(df1$subject))
#[1] 1 1 2 2 3 3

更新

对于第二个数据集

df2$subject <- sub('\d+_([a-z]+).*', '\1', df2$id_trial)
df2
#         id_trial subject
#1       001_a.txt       a
#2    001_a_t2.txt       a
#3    949482_b.txt       b
#4 949482_b_t2.txt       b
#5        95_c.txt       c
#6     95_c_t2.txt       c

数据

df1 <-  structure(list(id_trial = c("personn_a.txt", "person_a_t2.txt", 
"person_b.txt", "person_b_t2.txt", "personnn_c.txt", "person_c_t2.txt"
)), .Names = "id_trial", class = "data.frame", row.names = c(NA, -6L))

df2 <- structure(list(id_trial = c("001_a.txt", "001_a_t2.txt", 
"949482_b.txt", 
"949482_b_t2.txt", "95_c.txt", "95_c_t2.txt")), .Names = "id_trial", 
class = "data.frame", row.names = c(NA, -6L))

如果字符串的第一部分相同，则标记为相同

Label as the same if the first part of string is the same

r

data-manipulation

更新

数据