计算 R 中包含特定关键词的句子
counting sentences containing a specific key word in R
更新
这是我到目前为止所做的。
library(tm)
library(NLP)
library(SnowballC)
# set directory
setwd("C:\Users\...\Data pretest all TXT")
# create corpus with tm package
pretest <- Corpus(DirSource("\Users\...\Data pretest all TXT"), readerControl = list(language = "en"))
pretest 是一个包含 36 个元素的大型 SimpleCorpus。
我的文件夹包含 36 个 txt 文件。
# check what went in
summary(pretest)
# create TDM
pretest.tdm <- TermDocumentMatrix(pretest, control = list(stopwords = TRUE,
tolower = TRUE, stemming = TRUE))
# convert corpus to data frame
dataframePT <- data.frame(text = unlist(sapply(pretest, `[`, "content")),
stringsAsFactors = FALSE)
dataframePT 有 36 个观测值。所以我认为到这里为止还可以。
# load stringr library
library(stringr)
# define sentences
v = strsplit(dataframePT[,1], "(?<=[A-Za-z ,]{10})\.", perl = TRUE)
lapply(v, function(x) (stringr::str_count(x, "gain")))
我的输出看起来像这样
...
[[35]]
[1] 不适用
[[36]]
[1] 不适用
所以实际上有36个文件,很好。但是我不知道为什么 returns NA.
提前感谢您的任何建议。
您好,我建议使用 dplyr 包中的过滤器函数和 grepl 函数在内部搜索模式
pattern <- "word1|word2"
df<- df %>%
filter(grepl(pattern,column_name)
df 将仅限于那些符合该条件的人。那么只需使用 nrow 函数来计算最后有多少行:)
示例:
a1<-1:10
a2<-11:20
(data<-data.frame(a1,a2,stringsAsFactors = F))
a1 a2
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20
(data<-data %>% filter(grepl("5|7",data$a2)))
a1 a2
1 5 15
2 7 17
(nrow(data))
[1] 2
library(NLP)
library(tm)
library(SnowballC)
加载数据:
data("crude")
crude.tdm <- TermDocumentMatrix(crude, control = list(stopwords = TRUE, tolower = TRUE, stemming= TRUE))
首先将语料库转换为数据框
dataframe <- data.frame(text = unlist(sapply(crude, `[`, "content")), stringsAsFactors = F)
还可以查看内容:crude[[2]]$content
现在我们需要定义一个句子 - 在这里我用一个实体来定义它,该实体至少有 10 个 A-Z 或 a-z 字符混合空格和“,”并以“.”结尾。然后我使用 .
后视规则按照该规则拆分文档
z = strsplit(dataframe[,1], "(?<=[A-Za-z ,]{10})\.", perl = T)
但是 crude
语料库不需要这样做,因为每个句子都以 .\n
结尾,所以可以这样做:
z = strsplit(dataframe[,1], "\.n\", perl = T)
我会坚持我之前对句子的定义,因为人们希望它不仅适用于原始语料库。定义不完美所以我很想听听你的想法?
让我们检查输出
z[[2]]
[1] "OPEC may be forced to meet before a\nscheduled June session to readdress its production cutting\nagreement if the organization wants to halt the current slide\nin oil prices, oil industry analysts said"
[2] "\n \"The movement to higher oil prices was never to be as easy\nas OPEC thought"
[3] " They may need an emergency meeting to sort out\nthe problems,\" said Daniel Yergin, director of Cambridge Energy\nResearch Associates, CERA"
[4] "\n Analysts and oil industry sources said the problem OPEC\nfaces is excess oil supply in world oil markets"
[5] "\n \"OPEC's problem is not a price problem but a production\nissue and must be addressed in that way,\" said Paul Mlotok, oil\nanalyst with Salomon Brothers Inc"
[6] "\n He said the market's earlier optimism about OPEC and its\nability to keep production under control have given way to a\npessimistic outlook that the organization must address soon if\nit wishes to regain the initiative in oil prices"
[7] "\n But some other analysts were uncertain that even an\nemergency meeting would address the problem of OPEC production\nabove the 15.8 mln bpd quota set last December"
[8] "\n \"OPEC has to learn that in a buyers market you cannot have\ndeemed quotas, fixed prices and set differentials,\" said the\nregional manager for one of the major oil companies who spoke\non condition that he not be named"
[9] " \"The market is now trying to\nteach them that lesson again,\" he added.\n David T"
[10] " Mizrahi, editor of Mideast reports, expects OPEC\nto meet before June, although not immediately"
[11] " However, he is\nnot optimistic that OPEC can address its principal problems"
[12] "\n \"They will not meet now as they try to take advantage of the\nwinter demand to sell their oil, but in late March and April\nwhen demand slackens,\" Mizrahi said"
[13] "\n But Mizrahi said that OPEC is unlikely to do anything more\nthan reiterate its agreement to keep output at 15.8 mln bpd.\"\n Analysts said that the next two months will be critical for\nOPEC's ability to hold together prices and output"
[14] "\n \"OPEC must hold to its pact for the next six to eight weeks\nsince buyers will come back into the market then,\" said Dillard\nSpriggs of Petroleum Analysis Ltd in New York"
[15] "\n But Bijan Moussavar-Rahmani of Harvard University's Energy\nand Environment Policy Center said that the demand for OPEC oil\nhas been rising through the first quarter and this may have\nprompted excesses in its production"
[16] "\n \"Demand for their (OPEC) oil is clearly above 15.8 mln bpd\nand is probably closer to 17 mln bpd or higher now so what we\nare seeing characterized as cheating is OPEC meeting this\ndemand through current production,\" he told Reuters in a\ntelephone interview"
[17] "\n Reuter"
和原来的:
cat(crude[[2]]$content)
OPEC may be forced to meet before a
scheduled June session to readdress its production cutting
agreement if the organization wants to halt the current slide
in oil prices, oil industry analysts said.
"The movement to higher oil prices was never to be as easy
as OPEC thought. They may need an emergency meeting to sort out
the problems," said Daniel Yergin, director of Cambridge Energy
Research Associates, CERA.
Analysts and oil industry sources said the problem OPEC
faces is excess oil supply in world oil markets.
"OPEC's problem is not a price problem but a production
issue and must be addressed in that way," said Paul Mlotok, oil
analyst with Salomon Brothers Inc.
He said the market's earlier optimism about OPEC and its
ability to keep production under control have given way to a
pessimistic outlook that the organization must address soon if
it wishes to regain the initiative in oil prices.
But some other analysts were uncertain that even an
emergency meeting would address the problem of OPEC production
above the 15.8 mln bpd quota set last December.
"OPEC has to learn that in a buyers market you cannot have
deemed quotas, fixed prices and set differentials," said the
regional manager for one of the major oil companies who spoke
on condition that he not be named. "The market is now trying to
teach them that lesson again," he added.
David T. Mizrahi, editor of Mideast reports, expects OPEC
to meet before June, although not immediately. However, he is
not optimistic that OPEC can address its principal problems.
"They will not meet now as they try to take advantage of the
winter demand to sell their oil, but in late March and April
when demand slackens," Mizrahi said.
But Mizrahi said that OPEC is unlikely to do anything more
than reiterate its agreement to keep output at 15.8 mln bpd."
Analysts said that the next two months will be critical for
OPEC's ability to hold together prices and output.
"OPEC must hold to its pact for the next six to eight weeks
since buyers will come back into the market then," said Dillard
Spriggs of Petroleum Analysis Ltd in New York.
But Bijan Moussavar-Rahmani of Harvard University's Energy
and Environment Policy Center said that the demand for OPEC oil
has been rising through the first quarter and this may have
prompted excesses in its production.
"Demand for their (OPEC) oil is clearly above 15.8 mln bpd
and is probably closer to 17 mln bpd or higher now so what we
are seeing characterized as cheating is OPEC meeting this
demand through current production," he told Reuters in a
telephone interview.
Reuter
如果您愿意,可以稍微清理一下,删除尾随 \n
,但您的请求不需要它。
现在您可以做各种各样的事情,例如:
哪些句子包含 "gain"
这个词
lapply(z, function(x) (grepl("gain", x)))
或每个句子中单词"gain"的出现频率:
lapply(z, function(x) (stringr::str_count(x, "gain")))
更新
这是我到目前为止所做的。
library(tm)
library(NLP)
library(SnowballC)
# set directory
setwd("C:\Users\...\Data pretest all TXT")
# create corpus with tm package
pretest <- Corpus(DirSource("\Users\...\Data pretest all TXT"), readerControl = list(language = "en"))
pretest 是一个包含 36 个元素的大型 SimpleCorpus。 我的文件夹包含 36 个 txt 文件。
# check what went in
summary(pretest)
# create TDM
pretest.tdm <- TermDocumentMatrix(pretest, control = list(stopwords = TRUE,
tolower = TRUE, stemming = TRUE))
# convert corpus to data frame
dataframePT <- data.frame(text = unlist(sapply(pretest, `[`, "content")),
stringsAsFactors = FALSE)
dataframePT 有 36 个观测值。所以我认为到这里为止还可以。
# load stringr library
library(stringr)
# define sentences
v = strsplit(dataframePT[,1], "(?<=[A-Za-z ,]{10})\.", perl = TRUE)
lapply(v, function(x) (stringr::str_count(x, "gain")))
我的输出看起来像这样
... [[35]] [1] 不适用
[[36]] [1] 不适用
所以实际上有36个文件,很好。但是我不知道为什么 returns NA.
提前感谢您的任何建议。
您好,我建议使用 dplyr 包中的过滤器函数和 grepl 函数在内部搜索模式
pattern <- "word1|word2"
df<- df %>%
filter(grepl(pattern,column_name)
df 将仅限于那些符合该条件的人。那么只需使用 nrow 函数来计算最后有多少行:)
示例:
a1<-1:10
a2<-11:20
(data<-data.frame(a1,a2,stringsAsFactors = F))
a1 a2
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20
(data<-data %>% filter(grepl("5|7",data$a2)))
a1 a2
1 5 15
2 7 17
(nrow(data))
[1] 2
library(NLP)
library(tm)
library(SnowballC)
加载数据:
data("crude")
crude.tdm <- TermDocumentMatrix(crude, control = list(stopwords = TRUE, tolower = TRUE, stemming= TRUE))
首先将语料库转换为数据框
dataframe <- data.frame(text = unlist(sapply(crude, `[`, "content")), stringsAsFactors = F)
还可以查看内容:crude[[2]]$content
现在我们需要定义一个句子 - 在这里我用一个实体来定义它,该实体至少有 10 个 A-Z 或 a-z 字符混合空格和“,”并以“.”结尾。然后我使用 .
z = strsplit(dataframe[,1], "(?<=[A-Za-z ,]{10})\.", perl = T)
但是 crude
语料库不需要这样做,因为每个句子都以 .\n
结尾,所以可以这样做:
z = strsplit(dataframe[,1], "\.n\", perl = T)
我会坚持我之前对句子的定义,因为人们希望它不仅适用于原始语料库。定义不完美所以我很想听听你的想法?
让我们检查输出
z[[2]]
[1] "OPEC may be forced to meet before a\nscheduled June session to readdress its production cutting\nagreement if the organization wants to halt the current slide\nin oil prices, oil industry analysts said"
[2] "\n \"The movement to higher oil prices was never to be as easy\nas OPEC thought"
[3] " They may need an emergency meeting to sort out\nthe problems,\" said Daniel Yergin, director of Cambridge Energy\nResearch Associates, CERA"
[4] "\n Analysts and oil industry sources said the problem OPEC\nfaces is excess oil supply in world oil markets"
[5] "\n \"OPEC's problem is not a price problem but a production\nissue and must be addressed in that way,\" said Paul Mlotok, oil\nanalyst with Salomon Brothers Inc"
[6] "\n He said the market's earlier optimism about OPEC and its\nability to keep production under control have given way to a\npessimistic outlook that the organization must address soon if\nit wishes to regain the initiative in oil prices"
[7] "\n But some other analysts were uncertain that even an\nemergency meeting would address the problem of OPEC production\nabove the 15.8 mln bpd quota set last December"
[8] "\n \"OPEC has to learn that in a buyers market you cannot have\ndeemed quotas, fixed prices and set differentials,\" said the\nregional manager for one of the major oil companies who spoke\non condition that he not be named"
[9] " \"The market is now trying to\nteach them that lesson again,\" he added.\n David T"
[10] " Mizrahi, editor of Mideast reports, expects OPEC\nto meet before June, although not immediately"
[11] " However, he is\nnot optimistic that OPEC can address its principal problems"
[12] "\n \"They will not meet now as they try to take advantage of the\nwinter demand to sell their oil, but in late March and April\nwhen demand slackens,\" Mizrahi said"
[13] "\n But Mizrahi said that OPEC is unlikely to do anything more\nthan reiterate its agreement to keep output at 15.8 mln bpd.\"\n Analysts said that the next two months will be critical for\nOPEC's ability to hold together prices and output"
[14] "\n \"OPEC must hold to its pact for the next six to eight weeks\nsince buyers will come back into the market then,\" said Dillard\nSpriggs of Petroleum Analysis Ltd in New York"
[15] "\n But Bijan Moussavar-Rahmani of Harvard University's Energy\nand Environment Policy Center said that the demand for OPEC oil\nhas been rising through the first quarter and this may have\nprompted excesses in its production"
[16] "\n \"Demand for their (OPEC) oil is clearly above 15.8 mln bpd\nand is probably closer to 17 mln bpd or higher now so what we\nare seeing characterized as cheating is OPEC meeting this\ndemand through current production,\" he told Reuters in a\ntelephone interview"
[17] "\n Reuter"
和原来的:
cat(crude[[2]]$content)
OPEC may be forced to meet before a
scheduled June session to readdress its production cutting
agreement if the organization wants to halt the current slide
in oil prices, oil industry analysts said.
"The movement to higher oil prices was never to be as easy
as OPEC thought. They may need an emergency meeting to sort out
the problems," said Daniel Yergin, director of Cambridge Energy
Research Associates, CERA.
Analysts and oil industry sources said the problem OPEC
faces is excess oil supply in world oil markets.
"OPEC's problem is not a price problem but a production
issue and must be addressed in that way," said Paul Mlotok, oil
analyst with Salomon Brothers Inc.
He said the market's earlier optimism about OPEC and its
ability to keep production under control have given way to a
pessimistic outlook that the organization must address soon if
it wishes to regain the initiative in oil prices.
But some other analysts were uncertain that even an
emergency meeting would address the problem of OPEC production
above the 15.8 mln bpd quota set last December.
"OPEC has to learn that in a buyers market you cannot have
deemed quotas, fixed prices and set differentials," said the
regional manager for one of the major oil companies who spoke
on condition that he not be named. "The market is now trying to
teach them that lesson again," he added.
David T. Mizrahi, editor of Mideast reports, expects OPEC
to meet before June, although not immediately. However, he is
not optimistic that OPEC can address its principal problems.
"They will not meet now as they try to take advantage of the
winter demand to sell their oil, but in late March and April
when demand slackens," Mizrahi said.
But Mizrahi said that OPEC is unlikely to do anything more
than reiterate its agreement to keep output at 15.8 mln bpd."
Analysts said that the next two months will be critical for
OPEC's ability to hold together prices and output.
"OPEC must hold to its pact for the next six to eight weeks
since buyers will come back into the market then," said Dillard
Spriggs of Petroleum Analysis Ltd in New York.
But Bijan Moussavar-Rahmani of Harvard University's Energy
and Environment Policy Center said that the demand for OPEC oil
has been rising through the first quarter and this may have
prompted excesses in its production.
"Demand for their (OPEC) oil is clearly above 15.8 mln bpd
and is probably closer to 17 mln bpd or higher now so what we
are seeing characterized as cheating is OPEC meeting this
demand through current production," he told Reuters in a
telephone interview.
Reuter
如果您愿意,可以稍微清理一下,删除尾随 \n
,但您的请求不需要它。
现在您可以做各种各样的事情,例如: 哪些句子包含 "gain"
这个词lapply(z, function(x) (grepl("gain", x)))
或每个句子中单词"gain"的出现频率:
lapply(z, function(x) (stringr::str_count(x, "gain")))