IMPORTXML 错误警报：“内部导入错误” - 如何通过脚本直接导入数据来避免错误？

Question

Solution I'm looking for: Since IMPORTXML is failing to import the data, is there any way to import the data directly via script without needing to create formulas in the spreadsheet? I cannot create a custom formula because these formulas have limitations of use, as I will use a lot, the limit will always happen and it cannot happen.

Xpath 完全正确，但有时会莫名其妙地出现此错误，我无法理解可能发生的情况。

=
IMPORTXML("https://int.soccerway.com"&
IMPORTXML(A1,
"//*[@class='container left']//*[@class='last-five']/a[1]/@href"),
"//*[@class='playerstats lineups table']//@href")

错误
内部导入错误

Link 到电子表格：
https://docs.google.com/spreadsheets/d/1nA3NFKhrON8wJgBiX0XcyEQqk6OZVoB3VrGaROsNPjc/edit?usp=sharing

Answer 1

尝试：

=IFERROR(IMPORTXML("https://int.soccerway.com"&
 IMPORTXML("https://int.soccerway.com/matches/2020/06/28/austria/bundesliga/lask-linz/wolfsberger-athletik-club/3246469/",
 "//*[@class='container right']//*[@class='last-five']/a[2]/@href"),
 "//*[@class='playerstats lineups table']//@href"), 
 IMPORTXML("https://int.soccerway.com"&
 IMPORTXML("https://int.soccerway.com/matches/2020/06/28/austria/bundesliga/lask-linz/wolfsberger-athletik-club/3246469/",
 "//*[@class='container right']//*[@class='last-five']/a[2]/@href"),
 "//*[@class='playerstats lineups table']//@href"))

Answer 2

要完成，您可以将第一个 XPath 缩短为：

(//a[@title][2])[2]/@href

EDIT ：因为这个 XPath 有时会失败，所以坚持使用 :

//div[@class='container right']/div[@class='last-five']/a[2]/@href

对于 select 只有球员（不是教练）和替补球员（上场的球员）你可以使用：

//div[@class="combined-lineups-container"]//a[@href[contains(.,"players")]][not(parent::p[@class="substitute substitute-out"] or count(ancestor::td/p)=1)]/@href

编辑 :

这里有一个 WORKBOOK which works with IMPORTXML or IMPORTFROMWEB addon（请求数量受免费计划限制）。

第一个 sheet 是 IMPORTXML （一个班轮）。公式：

=IMPORTXML("https://int.soccerway.com/"&IMPORTXML(C1;"//div[@class='container right']/div[@class='last-five']/a[2]/@href");"//div[@class='combined-lineups-container']//a[@href[contains(.,'players')]]/@href")

第二个 sheet 与 IMPORTHTML（分为两部分）。使用的 XPath（获取 url、玩家 url、参加比赛的玩家 url）：

//div[@class="container right"]/div[@class="last-five"]/a[2]/@href
//div[@class="combined-lineups-container"]//a[@href[contains(.,"players")]]/@href
//div[@class="combined-lineups-container"]//a[@href[contains(.,"players")]][not(parent::p[@class="substitute substitute-out"] or count(ancestor::td/p)=1)]/@href

第三个 sheet 与 IMPORTHTML（一个班轮）。使用的公式：

=IMPORTFROMWEB("https://int.soccerway.com/"&IMPORTFROMWEB(C1;"//div[@class='container right']/div[@class='last-five']/a[2]/@href");"//div[@class='combined-lineups-container']//a[@href[contains(.,'players')]]/@href")

如果 IMPORTXML 或 IMPORTFROMWEB 失败的替代方案：IMPORTDATA + 正则表达式。

要用起始 url 生成第二个 url（第二个 url），请使用类似 :

="https://int.soccerway.com"&REGEXEXTRACT(INDEX(QUERY(IMPORTDATA(A2);"select * WHERE Col1 ENDS WITH '>D</a>' or Col1 ENDS WITH '>W</a>' or Col1 ENDS WITH '>L</a>'");7;1);"href=""(.*?)""")

QUERY 可以用“匹配”进行优化。

要获取球员姓名（Players v1），请使用：

=ARRAYFORMULA(REGEXEXTRACT(QUERY(IMPORTDATA(B2);"select Col1 WHERE Col1 STARTS WITH '<a' and Col1 CONTAINS 'flag_16 left' and Col1 CONTAINS 'players'");"href=""(.*?)"""))

可以参考我的sheetHERE.

蓝色背景的单元格包含公式（主要是 ARRAYFORMULA）
黄色背景的单元格：获取数据的捷径
粉红色背景的单元格：过滤真正参加比赛的球员的另一种方式（有点复杂，可以优化）

编辑 2：“阵容”sheet 已添加到 IMPORTDATA 工作簿。这是一个提取首发主客场球队最后 3 场比赛的阵容（22 名球员）的示例 url。示例：Lugano vs. Basel - 1 July 2020.

有时，Soccerway 没有阵容。在这种情况下，将返回“无阵容”。

Answer 3

在使用 IMPORTXML 时，通常在尝试进行网络抓取时会发生“神秘”错误，并且我们无法采取任何措施来避免这些错误，特别是当数据源属于第三方时,我们只能设置应急措施。

要做到这一点，您可以使用算法 exponential backoff，简而言之，它是这样工作的：

使用循环尝试获取数据，获取到数据后退出。在每次迭代中包括一个应该增加的延迟。您应该决定是否要为迭代次数设置限制，或者是否应该在获取数据之前完成此操作。

通常您应该设置一个限制和某种警报，以便您可以调查正在发生的事情。

另一方面，Google Apps 脚本不包含解析 HTML.

的好工具

有 XmlService，但它只适用于格式正确的 XHTML。
虽然可以使用正则表达式来提取一些文本，但这是 hacky。

IMPORTXML 错误警报：“内部导入错误” - 如何通过脚本直接导入数据来避免错误？

IMPORTXML error alert: “Internal import error” - How to avoid the error by importing the data directly via script?

xpath

google-sheets

web-scraping

google-apps-script

google-sheets-importxml