如何将 wiki 标记转换为文本? R/wikipedir
How to convert wiki markup into text? R/wikipedir
我正在尝试 运行 此代码:
library(WikipediR)
wp_content <- page_content("en","wikipedia", page_name = "Aaron Halfaker", as_wikitext = T, clean_response = T)
wp_content <- wp_content$wikitext$`*`
print(wp_content)
但输出在 wiki 标记中:
[1] "{{Infobox scientist\n| name = Aaron Halfaker\n| native_name = \n| native_name_lang = \n| image = File:Halfaker,_Aaron_Sept_2013.jpg\n| image_size = \n| alt = \n| caption = \n| birth_date = {{birth date and age|1983|12|27}}\n| birth_place = [[Virginia, Minnesota]]<ref>{{Cite web |url=https://twitter.com/halfak/status/826529576906059780 |title=Twitter status |last=Halfaker |first=Aaron |website=Twitter |date=31 January 2017}}</ref>\n| death_date = \n| death_place = \n| resting_place = \n| resting_place_coordinates = <!--{{coord|LAT|LONG|type:landmark|display=inline,title}}-->\n| other_names = \n| residence = \n| citizenship = \n| nationality = \n| fields = [[Human-Computer Interaction]] <br/> [[computer-supported cooperative work]]\n| workplaces = [[Wikimedia Foundation]]\n| patrons = \n| alma_mater = [[The College of St. Scholastica]] (B.S., 2006)<br/> [[University of Minnesota]] (Ph.D., 2013)<ref name=\"tmn\">{{cite web|url=http://tech.mn/news/2013/12/11/aaron-halfaker-wikimedia-foundation/|title=Wicked Smart: 5 questions with U of M PhD and Wikipedian Aaron Halfaker|date=11 December 2013|publisher=TechMN|accessdate=5 January 2015}}</ref><ref>{{Cite web |url=https://www-users.cs.umn.edu/~halfak/docs/curriculum_vitae |title=Aaron Halfaker Curriculum Vitae}}</ref>\n..."
如何将其转换为纯文本,或立即以纯文本形式获取。
我也试过了as_wiktext = F
,但是没成功
语言 - R。
包 - Wikipedir v1.5.0
as_wikitext = T
下载带有 wiki 标记的文本。默认情况下,page_content
下载带有 HTML 标记的页面。幸运的是,有许多 HTML 解析器可用,其中最好的解析器之一是 rvest
。以下代码将页面下载为 HTML,使用 rvest::read_html
将其解析为 HTML 结构,然后使用 rvest::html_text
将其解析为纯文本
library(WikipediR)
library(rvest)
#> Loading required package: xml2
wp_content <- page_content(language = 'en', project = 'wikipedia', 'page_name' = 'Aaron Halfaker', as_wikitext = F)
html_text(read_html(wp_content$parse$text$`*`))
#> [1] "Aaron HalfakerBorn (1983-12-27) December 27, 1983 (age 36)Virginia, Minnesota[1]Alma materThe College of St. Scholastica (B.S., 2006)University of Minnesota (Ph.D., 2013)..."
由 reprex package (v0.3.0)
于 2020-09-02 创建
我正在尝试 运行 此代码:
library(WikipediR)
wp_content <- page_content("en","wikipedia", page_name = "Aaron Halfaker", as_wikitext = T, clean_response = T)
wp_content <- wp_content$wikitext$`*`
print(wp_content)
但输出在 wiki 标记中:
[1] "{{Infobox scientist\n| name = Aaron Halfaker\n| native_name = \n| native_name_lang = \n| image = File:Halfaker,_Aaron_Sept_2013.jpg\n| image_size = \n| alt = \n| caption = \n| birth_date = {{birth date and age|1983|12|27}}\n| birth_place = [[Virginia, Minnesota]]<ref>{{Cite web |url=https://twitter.com/halfak/status/826529576906059780 |title=Twitter status |last=Halfaker |first=Aaron |website=Twitter |date=31 January 2017}}</ref>\n| death_date = \n| death_place = \n| resting_place = \n| resting_place_coordinates = <!--{{coord|LAT|LONG|type:landmark|display=inline,title}}-->\n| other_names = \n| residence = \n| citizenship = \n| nationality = \n| fields = [[Human-Computer Interaction]] <br/> [[computer-supported cooperative work]]\n| workplaces = [[Wikimedia Foundation]]\n| patrons = \n| alma_mater = [[The College of St. Scholastica]] (B.S., 2006)<br/> [[University of Minnesota]] (Ph.D., 2013)<ref name=\"tmn\">{{cite web|url=http://tech.mn/news/2013/12/11/aaron-halfaker-wikimedia-foundation/|title=Wicked Smart: 5 questions with U of M PhD and Wikipedian Aaron Halfaker|date=11 December 2013|publisher=TechMN|accessdate=5 January 2015}}</ref><ref>{{Cite web |url=https://www-users.cs.umn.edu/~halfak/docs/curriculum_vitae |title=Aaron Halfaker Curriculum Vitae}}</ref>\n..."
如何将其转换为纯文本,或立即以纯文本形式获取。
我也试过了as_wiktext = F
,但是没成功
语言 - R。 包 - Wikipedir v1.5.0
as_wikitext = T
下载带有 wiki 标记的文本。默认情况下,page_content
下载带有 HTML 标记的页面。幸运的是,有许多 HTML 解析器可用,其中最好的解析器之一是 rvest
。以下代码将页面下载为 HTML,使用 rvest::read_html
将其解析为 HTML 结构,然后使用 rvest::html_text
library(WikipediR)
library(rvest)
#> Loading required package: xml2
wp_content <- page_content(language = 'en', project = 'wikipedia', 'page_name' = 'Aaron Halfaker', as_wikitext = F)
html_text(read_html(wp_content$parse$text$`*`))
#> [1] "Aaron HalfakerBorn (1983-12-27) December 27, 1983 (age 36)Virginia, Minnesota[1]Alma materThe College of St. Scholastica (B.S., 2006)University of Minnesota (Ph.D., 2013)..."
由 reprex package (v0.3.0)
于 2020-09-02 创建