查询页面并使用表格抓取它

Question

我想使用表格从维基数据中查询页面并抓取特定部分，但我找不到任何专门针对此的内容，而且由于我是这方面的初学者，我不知道从哪里开始. 所以，我有一个 Q 标识符列表，我想用它们来查询页面，然后检查那里是否有一个特定的部分（或者如果可能的话从中抓取数据）否则 return false。我从我发现的开始 here

=ImportXml(concat("https://en.wikipedia.org/w/api.php?action=query&prop=pageprops&
ppprop=wikibase_item&redirects=1&format=xml&titles=",G1),"//@wikibase_item")

但使用 link 代替 wiki 数据（它生成有效的 link，但我不确定这是否等同于 API，如果我可以进一步查询它以获取数据）和我想要获取的 wiki 属性代码（死亡日期，/wiki/Property:P570），但我得到“导入的内容为空”。错误，为此 link。理想情况下，我想获得死亡日期值（2014 年 11 月 20 日），或者至少是 TRUE ，这意味着该部分存在，并且该人已经死亡。

=IMPORTXML(CONCAT("https://www.wikidata.org/wiki/",A2),"/wiki/Property:P570")

所以我可能有一些 Q-links 根本没有这个 section/property，为此我应该得到一个错误，但我不知道为什么没有为这个工作，我是否必须将 Xpath 设置为 div，或者我可以使用 wiki 属性?

我希望这是有道理的，我会放样本 sheet here。谢谢

Answer 1

尝试：

=QUERY(IMPORTXML("https://www.wikidata.org/wiki/"&A1, "//*"), 
 "select Col2 where Col1 = 'date of death' and Col2 is not null")

或：

=QUERY(IMPORTXML("https://www.wikidata.org/wiki/"&A1, "//*"), 
 "select Col2 where Col1 = 'date of death' and Col2 is not null")<>""

没有匹配项：

=IFERROR(QUERY(IMPORTXML("https://www.wikidata.org/wiki/"&A1, "//*"), 
 "select Col2 where Col1 = 'date of death' and Col2 is not null"), FALSE)

=IFERROR(REGEXEXTRACT(QUERY(IMPORTXML("wikidata.org/wiki/"&A2, "//*"), 
 "select Col2 where Col1 = 'date of birth' and Col2 is not null"), 
 "(.*) \d.*reference.*"), FALSE)

查询页面并使用表格抓取它

Querying page and Scraping it using Sheets

regex

google-sheets

wikipedia-api

web-scraping

google-sheets-formula