在 R 中的 stri_regex 中使用哪个正则表达式来提取 propoer 信息？

Question

我正在尝试提取 R

中那个字符中这个词 gdac.broadinstitute.org_ 之后的名称

element <- "<li><a href=\"gdac.broadinstitute.org_BRCA.miRseq_Preprocess.mage-tab.2015020400.0.0.tar.gz.md5\"> gdac.broadinstitute.org_BRCA.miRseq_Preprocess.mage-tab.2015020400.0.0.tar.gz.md5</a></li>"

我正在使用 stringi 包中的 stri_extract，但看起来我不太了解正则表达式。我试过这样的事情：

stri_extract( element, 
                      regex  = "gdac.broadinstitute.org_")

有人可以帮忙吗？

Answer 1

我不熟悉 stringi，但使用 gsub 可以轻松完成。我可以找到名称结束的地方，所以我假设名称是 "

下划线之后的所有内容

gsub(".*gdac.broadinstitute.org_(.*)\".*", "\1", element)

Answer 2

试试这个：

stri_extract_first_regex( element, "(?<=gdac.broadinstitute.org_)[\w\.-]+")

通常，使用正则表达式 (?<=start)[set]+，您可以提取表达式 start 之后的所有内容（匹配 set 的所有内容）。有关 ICU 正则表达式的更多信息：http://userguide.icu-project.org/strings/regexp

在 R 中的 stri_regex 中使用哪个正则表达式来提取 propoer 信息？

Which regex to use to extract propoer information in stri_regex in R?

regex

r

stringi