为每一行获取字符串的特定部分
Get specific part of a string for each line
我有一个数据,我想获取其中的特定部分
DoseResponse_Curves/drCurve_AAATT.pdf
DoseResponse_Curves/drCurve_AGMK1.pdf
DoseResponse_Curves/drCurve_AGU.pdf
DoseResponse_Curves/drCurve_ALH1L2.pdf
DoseResponse_Curves/drCurve_ALKB1.pdf
DoseResponse_Curves/drCurve_AS2.pdf
DoseResponse_Curves/drCurve_ANK1.pdf
DoseResponse_Curves/drCurve_ANKRD54.pdf
我只想取第二个 _ 之后和之前的所有内容。这意味着输出看起来像这样
AAATT
AGMK1
AGU
ALH1L2
ALKB1
AS2
ANK1
ANKRD54
注意: 因为我们使用的是基因名称,它们可以包含 c(".", "-")
.
等字符
您可以使用 sub 和正则表达式来完成此操作。
Files = c(
'DoseResponse_Curves/drCurve_AAATT.pdf',
'DoseResponse_Curves/drCurve_AGMK1.pdf',
'DoseResponse_Curves/drCurve_AGU.pdf',
'DoseResponse_Curves/drCurve_ALH1L2.pdf',
'DoseResponse_Curves/drCurve_ALKB1.pdf',
'DoseResponse_Curves/drCurve_AS2.pdf',
'DoseResponse_Curves/drCurve_ANK1.pdf',
'DoseResponse_Curves/drCurve_ANKRD54.pdf')
sub(".*?_.*?_(.*?)\..*", "\1", Files)
[1] "AAATT" "AGMK1" "AGU" "ALH1L2" "ALKB1" "AS2" "ANK1"
[8] "ANKRD54"
方法有很多种:
# Example data with gene names with dots and dashes
Files = c('DoseResponse_Curves/drCurve_ALKB1.pdf',
'DoseResponse_Curves/drCurve_BAC05914.1.pdf',
'DoseResponse_Curves/drCurve_ALDH1L1-AS1.pdf',
'DoseResponse_Curves/drCurve_AL953854.2-002.pdf')
# as parts of path are all same, we can sub with "":
gsub("DoseResponse_Curves/drCurve_|.pdf", "", Files)
# [1] "ALKB1" "BAC05914.1" "ALDH1L1-AS1" "AL953854.2-002"
# Or, as we are working with path and filenames maybe:
gsub("drCurve_", "", tools::file_path_sans_ext(basename(Files)))
# [1] "ALKB1" "BAC05914.1" "ALDH1L1-AS1" "AL953854.2-002"
# @G5W answer doesn't handle extra dots in gene names
sub(".*?_.*?_(.*?)\..*", "\1", Files)
# [1] "ALKB1" "BAC05914" "ALDH1L1-AS1" "AL953854"
我有一个数据,我想获取其中的特定部分
DoseResponse_Curves/drCurve_AAATT.pdf
DoseResponse_Curves/drCurve_AGMK1.pdf
DoseResponse_Curves/drCurve_AGU.pdf
DoseResponse_Curves/drCurve_ALH1L2.pdf
DoseResponse_Curves/drCurve_ALKB1.pdf
DoseResponse_Curves/drCurve_AS2.pdf
DoseResponse_Curves/drCurve_ANK1.pdf
DoseResponse_Curves/drCurve_ANKRD54.pdf
我只想取第二个 _ 之后和之前的所有内容。这意味着输出看起来像这样
AAATT
AGMK1
AGU
ALH1L2
ALKB1
AS2
ANK1
ANKRD54
注意: 因为我们使用的是基因名称,它们可以包含 c(".", "-")
.
您可以使用 sub 和正则表达式来完成此操作。
Files = c(
'DoseResponse_Curves/drCurve_AAATT.pdf',
'DoseResponse_Curves/drCurve_AGMK1.pdf',
'DoseResponse_Curves/drCurve_AGU.pdf',
'DoseResponse_Curves/drCurve_ALH1L2.pdf',
'DoseResponse_Curves/drCurve_ALKB1.pdf',
'DoseResponse_Curves/drCurve_AS2.pdf',
'DoseResponse_Curves/drCurve_ANK1.pdf',
'DoseResponse_Curves/drCurve_ANKRD54.pdf')
sub(".*?_.*?_(.*?)\..*", "\1", Files)
[1] "AAATT" "AGMK1" "AGU" "ALH1L2" "ALKB1" "AS2" "ANK1"
[8] "ANKRD54"
方法有很多种:
# Example data with gene names with dots and dashes
Files = c('DoseResponse_Curves/drCurve_ALKB1.pdf',
'DoseResponse_Curves/drCurve_BAC05914.1.pdf',
'DoseResponse_Curves/drCurve_ALDH1L1-AS1.pdf',
'DoseResponse_Curves/drCurve_AL953854.2-002.pdf')
# as parts of path are all same, we can sub with "":
gsub("DoseResponse_Curves/drCurve_|.pdf", "", Files)
# [1] "ALKB1" "BAC05914.1" "ALDH1L1-AS1" "AL953854.2-002"
# Or, as we are working with path and filenames maybe:
gsub("drCurve_", "", tools::file_path_sans_ext(basename(Files)))
# [1] "ALKB1" "BAC05914.1" "ALDH1L1-AS1" "AL953854.2-002"
# @G5W answer doesn't handle extra dots in gene names
sub(".*?_.*?_(.*?)\..*", "\1", Files)
# [1] "ALKB1" "BAC05914" "ALDH1L1-AS1" "AL953854"