如何将 wiki 标记转换为文本？ R/wikipedir

Question

我正在尝试运行此代码：

library(WikipediR)

wp_content <- page_content("en","wikipedia", page_name = "Aaron Halfaker", as_wikitext = T, clean_response = T)

wp_content <- wp_content$wikitext$`*`

print(wp_content)

但输出在 wiki 标记中：

[1] "{{Infobox scientist\n| name        = Aaron Halfaker\n| native_name = \n| native_name_lang = \n| image       = File:Halfaker,_Aaron_Sept_2013.jpg\n| image_size  = \n| alt         = \n| caption     = \n| birth_date  = {{birth date and age|1983|12|27}}\n| birth_place = [[Virginia, Minnesota]]<ref>{{Cite web |url=https://twitter.com/halfak/status/826529576906059780 |title=Twitter status |last=Halfaker |first=Aaron |website=Twitter |date=31 January 2017}}</ref>\n| death_date  = \n| death_place = \n| resting_place = \n| resting_place_coordinates =  <!--{{coord|LAT|LONG|type:landmark|display=inline,title}}-->\n| other_names = \n| residence   = \n| citizenship = \n| nationality = \n| fields      = [[Human-Computer Interaction]] <br/> [[computer-supported cooperative work]]\n| workplaces  = [[Wikimedia Foundation]]\n| patrons     = \n| alma_mater  = [[The College of St. Scholastica]] (B.S., 2006)<br/> [[University of Minnesota]] (Ph.D., 2013)<ref name=\"tmn\">{{cite web|url=http://tech.mn/news/2013/12/11/aaron-halfaker-wikimedia-foundation/|title=Wicked Smart: 5 questions with U of M PhD and Wikipedian Aaron Halfaker|date=11 December 2013|publisher=TechMN|accessdate=5 January 2015}}</ref><ref>{{Cite web |url=https://www-users.cs.umn.edu/~halfak/docs/curriculum_vitae |title=Aaron Halfaker Curriculum Vitae}}</ref>\n..."

如何将其转换为纯文本，或立即以纯文本形式获取。我也试过了as_wiktext = F，但是没成功

语言 - R。包 - Wikipedir v1.5.0

Answer 1

as_wikitext = T 下载带有 wiki 标记的文本。默认情况下，page_content 下载带有 HTML 标记的页面。幸运的是，有许多 HTML 解析器可用，其中最好的解析器之一是 rvest。以下代码将页面下载为 HTML，使用 rvest::read_html 将其解析为 HTML 结构，然后使用 rvest::html_text

将其解析为纯文本

library(WikipediR)
library(rvest)
#> Loading required package: xml2

wp_content <- page_content(language = 'en', project = 'wikipedia', 'page_name' = 'Aaron Halfaker', as_wikitext = F)

html_text(read_html(wp_content$parse$text$`*`))
#> [1] "Aaron HalfakerBorn (1983-12-27) December 27, 1983 (age 36)Virginia, Minnesota[1]Alma materThe College of St. Scholastica (B.S., 2006)University of Minnesota (Ph.D., 2013)..."

^{由 reprex package (v0.3.0)}

于 2020-09-02 创建

如何将 wiki 标记转换为文本？ R/wikipedir

How to convert wiki markup into text? R/wikipedir

wikipedia

r

wikipedia-api