如何使用自定义分隔符将语料库分成段落
How to break a corpus into paragraphs using custom delimiters
我正在抓取纽约时报的网页对其进行一些自然语言处理,我想在使用语料库时将网页分成段落,以便对段落中出现的单词进行频率统计,这些段落也包含关键字或短语 .
以下内容适用于句子,但段落是由 NYT 的 • 捐赠的,因此我需要将其替换为语料库阅读段落的方式 - 任何人有任何想法吗?我试过 gsub("•","/n",...) 和 gsub("•","/r/n") 但这没有用。
如果有人知道如何在 tm 语料库中完成所有这些操作,而不必在 quanteda 和 TM 之间切换,那将节省一些代码。
website<-read_html("https://www.nytimes.com/2017/01/03/briefing/asia-australia-briefing.html") #Read URL
#Obtain any text with the paragraph Html deliminator
text<-website%>%
html_nodes("p") %>%
html_text() %>% as.character()
#Collapse the string as it is currently text[1]=para1 and text[2]= para 2
text<- str_c(text,collapse=" ")
data_corpus_para <-
corpus_reshape(corpus((text),to="paragraphs"))
data_corpus_para <-tolower(data_corpus_para )
containstarget <-
stringr::str_detect(texts(data_corpus_para ), "pull out of peace talks") #Random string in only one of the paragraphs to proof concept
#Filter for the para's that only contain the sentence above
data_corpus_para <-
corpus_subset(data_corpus_para , containstarget)
data_corpus_para <-corpus_reshape(data_corpus_para , to = "documents")
#There are quanteda corpus and TM Corpuses. And so I have to convert to a dataframe and then make back into a vcorupus.. this is very messy
data_corpus_para <-quanteda::convert(data_corpus_para )
data_corpus_para_VCorpus<-tm::VCorpus(tm::VectorSource(data_corpus_para$text))
dt.dtm = tm::DocumentTermMatrix(data_corpus_para_VCorpus)
tm::findFreqTerms(dt.dtm, 1)
如果段落分隔符是“•”,那么可以使用corpus_segment()
:
library("quanteda")
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
txt <- "
• This is the first paragraph.
This is still the first paragraph.
• Here is the third paragraph. Last sentence"
corpus(txt) %>%
corpus_segment(pattern = "•")
## Corpus consisting of 2 documents and 1 docvar.
## text1.1 :
## "This is the first paragraph. This is still the first paragra..."
##
## text1.2 :
## "Here is the third paragraph. Last sentence"
由 reprex package (v1.0.0)
于 2021 年 4 月 10 日创建
我正在抓取纽约时报的网页对其进行一些自然语言处理,我想在使用语料库时将网页分成段落,以便对段落中出现的单词进行频率统计,这些段落也包含关键字或短语 .
以下内容适用于句子,但段落是由 NYT 的 • 捐赠的,因此我需要将其替换为语料库阅读段落的方式 - 任何人有任何想法吗?我试过 gsub("•","/n",...) 和 gsub("•","/r/n") 但这没有用。
如果有人知道如何在 tm 语料库中完成所有这些操作,而不必在 quanteda 和 TM 之间切换,那将节省一些代码。
website<-read_html("https://www.nytimes.com/2017/01/03/briefing/asia-australia-briefing.html") #Read URL
#Obtain any text with the paragraph Html deliminator
text<-website%>%
html_nodes("p") %>%
html_text() %>% as.character()
#Collapse the string as it is currently text[1]=para1 and text[2]= para 2
text<- str_c(text,collapse=" ")
data_corpus_para <-
corpus_reshape(corpus((text),to="paragraphs"))
data_corpus_para <-tolower(data_corpus_para )
containstarget <-
stringr::str_detect(texts(data_corpus_para ), "pull out of peace talks") #Random string in only one of the paragraphs to proof concept
#Filter for the para's that only contain the sentence above
data_corpus_para <-
corpus_subset(data_corpus_para , containstarget)
data_corpus_para <-corpus_reshape(data_corpus_para , to = "documents")
#There are quanteda corpus and TM Corpuses. And so I have to convert to a dataframe and then make back into a vcorupus.. this is very messy
data_corpus_para <-quanteda::convert(data_corpus_para )
data_corpus_para_VCorpus<-tm::VCorpus(tm::VectorSource(data_corpus_para$text))
dt.dtm = tm::DocumentTermMatrix(data_corpus_para_VCorpus)
tm::findFreqTerms(dt.dtm, 1)
如果段落分隔符是“•”,那么可以使用corpus_segment()
:
library("quanteda")
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
txt <- "
• This is the first paragraph.
This is still the first paragraph.
• Here is the third paragraph. Last sentence"
corpus(txt) %>%
corpus_segment(pattern = "•")
## Corpus consisting of 2 documents and 1 docvar.
## text1.1 :
## "This is the first paragraph. This is still the first paragra..."
##
## text1.2 :
## "Here is the third paragraph. Last sentence"
由 reprex package (v1.0.0)
于 2021 年 4 月 10 日创建