如果字符串的第一部分相同,则标记为相同
Label as the same if the first part of string is the same
一些示例数据:
id_trial
001_a.txt
001_a_t2.txt
949482_b.txt
949482_b_t2.txt
95_c.txt
95_c_t2.txt
注意:字符串长度不同,但长度等于对减去“_t2”
我怎样才能做到,如果 _t2
之前的字符串部分相同,那么两者都会在新列中标记为这样。
也就是说,我想要这样的东西:
id_trial subject
001_a.txt person_a
001_a_t2.txt person_a
949482_b.txt person_b
949482_b_t2.txt person_b
95_c.txt person_c
95_c_t2.txt person_c
即使这样也行得通:
id_trial subject
001_a.txt a
001_a_t2.txt a
949482_b.txt b
949482_b_t2.txt b
95_c.txt c
95_c_t2.txt c
如有任何帮助,我们将不胜感激。
您可以尝试sub
提取前缀部分
df1$subject <- sub('([^_]+_.).*', '\1',sub('([^_]+)\1+',
'\1', df1$id_trial))
df1
# id_trial subject
#1 personn_a.txt person_a
#2 person_a_t2.txt person_a
#3 person_b.txt person_b
#4 person_b_t2.txt person_b
#5 personnn_c.txt person_c
#6 person_c_t2.txt person_c
如果您需要numeric
主题
as.numeric(factor(df1$subject))
#[1] 1 1 2 2 3 3
更新
对于第二个数据集
df2$subject <- sub('\d+_([a-z]+).*', '\1', df2$id_trial)
df2
# id_trial subject
#1 001_a.txt a
#2 001_a_t2.txt a
#3 949482_b.txt b
#4 949482_b_t2.txt b
#5 95_c.txt c
#6 95_c_t2.txt c
数据
df1 <- structure(list(id_trial = c("personn_a.txt", "person_a_t2.txt",
"person_b.txt", "person_b_t2.txt", "personnn_c.txt", "person_c_t2.txt"
)), .Names = "id_trial", class = "data.frame", row.names = c(NA, -6L))
df2 <- structure(list(id_trial = c("001_a.txt", "001_a_t2.txt",
"949482_b.txt",
"949482_b_t2.txt", "95_c.txt", "95_c_t2.txt")), .Names = "id_trial",
class = "data.frame", row.names = c(NA, -6L))
一些示例数据:
id_trial
001_a.txt
001_a_t2.txt
949482_b.txt
949482_b_t2.txt
95_c.txt
95_c_t2.txt
注意:字符串长度不同,但长度等于对减去“_t2”
我怎样才能做到,如果 _t2
之前的字符串部分相同,那么两者都会在新列中标记为这样。
也就是说,我想要这样的东西:
id_trial subject
001_a.txt person_a
001_a_t2.txt person_a
949482_b.txt person_b
949482_b_t2.txt person_b
95_c.txt person_c
95_c_t2.txt person_c
即使这样也行得通:
id_trial subject
001_a.txt a
001_a_t2.txt a
949482_b.txt b
949482_b_t2.txt b
95_c.txt c
95_c_t2.txt c
如有任何帮助,我们将不胜感激。
您可以尝试sub
提取前缀部分
df1$subject <- sub('([^_]+_.).*', '\1',sub('([^_]+)\1+',
'\1', df1$id_trial))
df1
# id_trial subject
#1 personn_a.txt person_a
#2 person_a_t2.txt person_a
#3 person_b.txt person_b
#4 person_b_t2.txt person_b
#5 personnn_c.txt person_c
#6 person_c_t2.txt person_c
如果您需要numeric
主题
as.numeric(factor(df1$subject))
#[1] 1 1 2 2 3 3
更新
对于第二个数据集
df2$subject <- sub('\d+_([a-z]+).*', '\1', df2$id_trial)
df2
# id_trial subject
#1 001_a.txt a
#2 001_a_t2.txt a
#3 949482_b.txt b
#4 949482_b_t2.txt b
#5 95_c.txt c
#6 95_c_t2.txt c
数据
df1 <- structure(list(id_trial = c("personn_a.txt", "person_a_t2.txt",
"person_b.txt", "person_b_t2.txt", "personnn_c.txt", "person_c_t2.txt"
)), .Names = "id_trial", class = "data.frame", row.names = c(NA, -6L))
df2 <- structure(list(id_trial = c("001_a.txt", "001_a_t2.txt",
"949482_b.txt",
"949482_b_t2.txt", "95_c.txt", "95_c_t2.txt")), .Names = "id_trial",
class = "data.frame", row.names = c(NA, -6L))