R语言文本摘要
Text summarization in R language
我有一个长文本文件,使用 R language
的帮助我想用至少 10 到 20 行或小句子来总结文本。
如何用 R language
总结至少 10 行的文本?
您可以试试这个(来自 LSAfun 包):
genericSummary(D,k=1)
其中 'D' 指定您的文本文档和 'k' 要在摘要中使用的句子数。 (进一步的修改显示在包文档中)。
更多信息:
http://search.r-project.org/library/LSAfun/html/genericSummary.html
有一个名为 lexRankr that summarizes text in the same way that Reddit's /u/autotldr bot summarizes articles. This article 的软件包,其中包含有关如何使用它的完整演练,但只是作为一个快速示例,您可以在 R:
中自行测试
#load needed packages
library(xml2)
library(rvest)
library(lexRankr)
#url to scrape
monsanto_url = "https://www.theguardian.com/environment/2017/sep/28/monsanto-banned-from-european-parliament"
#read page html
page = xml2::read_html(monsanto_url)
#extract text from page html using selector
page_text = rvest::html_text(rvest::html_nodes(page, ".js-article__body p"))
#perform lexrank for top 3 sentences
top_3 = lexRankr::lexRank(page_text,
#only 1 article; repeat same docid for all of input vector
docId = rep(1, length(page_text)),
#return 3 sentences to mimick /u/autotldr's output
n = 3,
continuous = TRUE)
#reorder the top 3 sentences to be in order of appearance in article
order_of_appearance = order(as.integer(gsub("_","",top_3$sentenceId)))
#extract sentences in order of appearance
ordered_top_3 = top_3[order_of_appearance, "sentence"]
> ordered_top_3
[1] "Monsanto lobbyists have been banned from entering the European parliament after the multinational refused to attend a parliamentary hearing into allegations of regulatory interference."
[2] "Monsanto officials will now be unable to meet MEPs, attend committee meetings or use digital resources on parliament premises in Brussels or Strasbourg."
[3] "A Monsanto letter to MEPs seen by the Guardian said that the European parliament was not “an appropriate forum” for discussion on the issues involved."
我有一个长文本文件,使用 R language
的帮助我想用至少 10 到 20 行或小句子来总结文本。
如何用 R language
总结至少 10 行的文本?
您可以试试这个(来自 LSAfun 包):
genericSummary(D,k=1)
其中 'D' 指定您的文本文档和 'k' 要在摘要中使用的句子数。 (进一步的修改显示在包文档中)。
更多信息: http://search.r-project.org/library/LSAfun/html/genericSummary.html
有一个名为 lexRankr that summarizes text in the same way that Reddit's /u/autotldr bot summarizes articles. This article 的软件包,其中包含有关如何使用它的完整演练,但只是作为一个快速示例,您可以在 R:
中自行测试#load needed packages
library(xml2)
library(rvest)
library(lexRankr)
#url to scrape
monsanto_url = "https://www.theguardian.com/environment/2017/sep/28/monsanto-banned-from-european-parliament"
#read page html
page = xml2::read_html(monsanto_url)
#extract text from page html using selector
page_text = rvest::html_text(rvest::html_nodes(page, ".js-article__body p"))
#perform lexrank for top 3 sentences
top_3 = lexRankr::lexRank(page_text,
#only 1 article; repeat same docid for all of input vector
docId = rep(1, length(page_text)),
#return 3 sentences to mimick /u/autotldr's output
n = 3,
continuous = TRUE)
#reorder the top 3 sentences to be in order of appearance in article
order_of_appearance = order(as.integer(gsub("_","",top_3$sentenceId)))
#extract sentences in order of appearance
ordered_top_3 = top_3[order_of_appearance, "sentence"]
> ordered_top_3
[1] "Monsanto lobbyists have been banned from entering the European parliament after the multinational refused to attend a parliamentary hearing into allegations of regulatory interference."
[2] "Monsanto officials will now be unable to meet MEPs, attend committee meetings or use digital resources on parliament premises in Brussels or Strasbourg."
[3] "A Monsanto letter to MEPs seen by the Guardian said that the European parliament was not “an appropriate forum” for discussion on the issues involved."