R:regexpr()如何在模式参数中使用向量
R: regexpr() how to use a vector in pattern parameter
我想从一组短文本中的词典中了解术语的位置。问题出在下面代码的最后几行,大致基于
library(tm)
pkd.names.quotes <- c(
"Mr. Rick Deckard",
"Do Androids Dream of Electric Sheep",
"Roy Batty",
"How much is an electric ostrich?",
"My schedule for today lists a six-hour self-accusatory depression.",
"Upon him the contempt of three planets descended.",
"J.F. Sebastian",
"Harry Bryant",
"goat class",
"Holden, Dave",
"Leon Kowalski",
"Dr. Eldon Tyrell"
)
firstnames <- c("Sebastian", "Dave", "Roy",
"Harry", "Dave", "Leon",
"Tyrell")
dict <- sort(unique(tolower(firstnames)))
corp <- VCorpus(VectorSource(pkd.names.quotes))
#strange but Corpus() gives wrong segment numbers for the matches.
tdm <-
TermDocumentMatrix(corp, control = list(tolower = TRUE, dictionary = dict))
inspect(corp)
inspect(tdm)
View(as.matrix(tdm))
data.frame(
Name = rownames(tdm)[tdm$i],
Segment = colnames(tdm)[tdm$j],
Content = pkd.names.quotes[tdm$j],
Postion = regexpr(
pattern = rownames(tdm)[tdm$i],
text = tolower(pkd.names.quotes[tdm$j])
)
)
输出带有警告,只有第一行是正确的。
Name Segment Content Postion
1 roy 3 Roy Batty 1
2 sebastian 7 J.F. Sebastian -1
3 harry 8 Harry Bryant -1
4 dave 10 Holden, Dave -1
5 leon 11 Leon Kowalski -1
6 tyrell 12 Dr. Eldon Tyrell -1
Warning message:
In regexpr(pattern = rownames(tdm)[tdm$i], text = tolower(pkd.names.quotes[tdm$j])) :
argument 'pattern' has length > 1 and only the first element will be used
我知道 pattern=paste(vector,collapse="|") 的解决方案,但我的向量可能很长(所有流行名称)。
能否有此命令的简单矢量化版本或每行接受新模式参数的解决方案?
您可以使用 mapply
:
向量化 regexpr
mapply
is a multivariate version of sapply
. mapply
applies FUN to the first elements of each ... argument, the second elements, the third elements, and so on.
使用
data.frame(
Name = rownames(tdm)[tdm$i],
Segment = colnames(tdm)[tdm$j],
Content = pkd.names.quotes[tdm$j],
Postion = mapply(regexpr, rownames(tdm)[tdm$i], tolower(pkd.names.quotes[tdm$j]), fixed=TRUE)
)
结果:
Name Segment Content Postion
roy roy 3 Roy Batty 1
sebastian sebastian 7 J.F. Sebastian 6
harry harry 8 Harry Bryant 1
dave dave 10 Holden, Dave 9
leon leon 11 Leon Kowalski 1
tyrell tyrell 12 Dr. Eldon Tyrell 11
或者,使用 stringr str_locate
:
Vectorised over string and pattern
它returns:
For str_locate
, an integer matrix. First column gives start postion of match, and second column gives end position.
使用
str_locate(tolower(pkd.names.quotes[tdm$j]), fixed(rownames(tdm)[tdm$i]))[,1]
请注意,如果您需要将字符串与固定(即非正则表达式模式)匹配,则使用 fixed()
。否则,删除 fixed()
和 fixed=TRUE
.
我想从一组短文本中的词典中了解术语的位置。问题出在下面代码的最后几行,大致基于
library(tm)
pkd.names.quotes <- c(
"Mr. Rick Deckard",
"Do Androids Dream of Electric Sheep",
"Roy Batty",
"How much is an electric ostrich?",
"My schedule for today lists a six-hour self-accusatory depression.",
"Upon him the contempt of three planets descended.",
"J.F. Sebastian",
"Harry Bryant",
"goat class",
"Holden, Dave",
"Leon Kowalski",
"Dr. Eldon Tyrell"
)
firstnames <- c("Sebastian", "Dave", "Roy",
"Harry", "Dave", "Leon",
"Tyrell")
dict <- sort(unique(tolower(firstnames)))
corp <- VCorpus(VectorSource(pkd.names.quotes))
#strange but Corpus() gives wrong segment numbers for the matches.
tdm <-
TermDocumentMatrix(corp, control = list(tolower = TRUE, dictionary = dict))
inspect(corp)
inspect(tdm)
View(as.matrix(tdm))
data.frame(
Name = rownames(tdm)[tdm$i],
Segment = colnames(tdm)[tdm$j],
Content = pkd.names.quotes[tdm$j],
Postion = regexpr(
pattern = rownames(tdm)[tdm$i],
text = tolower(pkd.names.quotes[tdm$j])
)
)
输出带有警告,只有第一行是正确的。
Name Segment Content Postion
1 roy 3 Roy Batty 1
2 sebastian 7 J.F. Sebastian -1
3 harry 8 Harry Bryant -1
4 dave 10 Holden, Dave -1
5 leon 11 Leon Kowalski -1
6 tyrell 12 Dr. Eldon Tyrell -1
Warning message:
In regexpr(pattern = rownames(tdm)[tdm$i], text = tolower(pkd.names.quotes[tdm$j])) :
argument 'pattern' has length > 1 and only the first element will be used
我知道 pattern=paste(vector,collapse="|") 的解决方案,但我的向量可能很长(所有流行名称)。
能否有此命令的简单矢量化版本或每行接受新模式参数的解决方案?
您可以使用 mapply
:
regexpr
mapply
is a multivariate version ofsapply
.mapply
applies FUN to the first elements of each ... argument, the second elements, the third elements, and so on.
使用
data.frame(
Name = rownames(tdm)[tdm$i],
Segment = colnames(tdm)[tdm$j],
Content = pkd.names.quotes[tdm$j],
Postion = mapply(regexpr, rownames(tdm)[tdm$i], tolower(pkd.names.quotes[tdm$j]), fixed=TRUE)
)
结果:
Name Segment Content Postion
roy roy 3 Roy Batty 1
sebastian sebastian 7 J.F. Sebastian 6
harry harry 8 Harry Bryant 1
dave dave 10 Holden, Dave 9
leon leon 11 Leon Kowalski 1
tyrell tyrell 12 Dr. Eldon Tyrell 11
或者,使用 stringr str_locate
:
Vectorised over string and pattern
它returns:
For
str_locate
, an integer matrix. First column gives start postion of match, and second column gives end position.
使用
str_locate(tolower(pkd.names.quotes[tdm$j]), fixed(rownames(tdm)[tdm$i]))[,1]
请注意,如果您需要将字符串与固定(即非正则表达式模式)匹配,则使用 fixed()
。否则,删除 fixed()
和 fixed=TRUE
.