Google 工作表 importxml 过滤具有可变子项的特定 xpath 项

Question

在 google spreadsheet 中，我想过滤 importxml() 函数，因为 xml 源文档格式不正确（一些子项目可变地存在）所以我的列对齐不好（不同类型的数据混合在一起）。我使用的查询语法没有任何效果。我试过 select 或排除，以及两者。也许你可以帮助我。非常感谢。

=importxml(A1;"*/*[name()='NameId'] | */*[name()='ScientificName'] | */*[name()!='DisplayDate'] | */*[name()!='Family'] | */*[name()!='RankAbbreviation'] | */*[name()!='NomenclatureStatusID'] | */*[name()!='NomenclatureStatusName'] | */*[name()!='TotalRows']")

这里是sheet（法语参数）：https://docs.google.com/spreadsheets/d/1wY_rt9ZRIMesXDFX_DoN-FTwcBkhNaugGgWd0X2bE4g/edit?usp=sharing

Answer 1

尝试：

=QUERY(IMPORTXML(A1;
 "*/*[name()='NameId'] | 
  */*[name()='ScientificName'] |
  */*[name()!='DisplayDate'] |
  */*[name()!='Family'] |
  */*[name()!='RankAbbreviation'] |
  */*[name()!='NomenclatureStatusID'] |
  */*[name()!='NomenclatureStatusName'] |
  */*[name()!='TotalRows']");
 "where Col10 matches '^\d+$'"; 0)

Answer 2

问题和解决方法：

当我看到URL的XML数据时，好像每个child中的名字都不一样。我认为你的问题的原因是由于这个。不幸的是，似乎在 IMPORTXML 和 XPath 中，当名称不存在时，它不能直接替换为空值。因此，在这种情况下，作为实现目标的解决方法，我想建议使用由 Google Apps Script 的 XmlService 创建的自定义函数，而不是 IMPORTXML.

示例脚本的流程如下。

从 URL 检索 XML 数据。
解析 XML 数据。
检索 header 行。
创建一个结果数组，并放入它。

将以上流程反映到一个示例脚本中，就变成了这样。

示例脚本：

请将以下脚本复制并粘贴到电子表格的脚本编辑器中。并且，请输入 =SAMPLE("http://services.tropicos.org/Name/Search?name=adonis&type=wildcard&apikey=7602bfa6-cd59-4029-a28d-3aeb0ff8836e&format=xml") 的自定义公式。这样，所有值都会被解析并显示在单元格中。

function SAMPLE(url, ignoreHeaders, orderedHeaders) {
  // 1. Retrieve XML data from URL.
  const res = UrlFetchApp.fetch(url);
  
  // 2. Parse XML data.
  const xmlObj = XmlService.parse(res.getContentText());
  const root = xmlObj.getRootElement();
  const names = root.getChildren();

  // 3. Retrieve header row.
  let h = [];
  if (orderedHeaders) {
    h = orderedHeaders.split(",").map(e => e.trim());
  } else {
    const hObj = names.reduce((o, e) => {
      e.getChildren().forEach(f => {
        o[f.getName()] = true
      });
      return o;
    }, {});
    if (ignoreHeaders) {
      ignoreHeaders.split(",").forEach(e => {
        if (hObj[e.trim()]) delete hObj[e.trim()];
      });
    }
    h = Object.keys(hObj);
  }

  // 4. Create an result array.
  const ns = root.getNamespace();
  const result = names.reduce((ar, e) => {
    const temp = h.map(h => {
      const t = e.getChild(h, ns);
      if (t) {
        const v = t.getValue();
        return isNaN(v) ? v : Number(v);
      }
      return "";
    });
    ar.push(temp);
    return ar;
  }, [h]);
  return result;
}

将 =SAMPLE("URL") 放入单元格时，将检索所有值。
ignoreHeaders 用作您要忽略的 header。这是来自您的 xpath 中的 */*[name()!='DisplayDate'] | */*[name()!='Family'] | */*[name()!='RankAbbreviation'] | */*[name()!='NomenclatureStatusID'] | */*[name()!='NomenclatureStatusName'] | */*[name()!='TotalRows']。
- 当您想忽略 DisplayDate,Family,RankAbbreviation,NomenclatureStatusID,NomenclatureStatusName,TotalRows 的 header 时，请输入自定义函数，例如 =SAMPLE("URL";"DisplayDate,Family,RankAbbreviation,NomenclatureStatusID,NomenclatureStatusName,TotalRows")。这样，这些 header 就会从检索到的值中被忽略。
orderedHeaders 用作自定义订购的 header。
- 当您想要使用 NameId,ScientificName,ScientificNameWithAuthors,Family,RankAbbreviation,NomenclatureStatusName,Author,DisplayReference,DisplayDate,TotalRows 等自定义排序的 header 检索值时，请放入 =SAMPLE("URL";;"NameId,ScientificName,ScientificNameWithAuthors,Family,RankAbbreviation,NomenclatureStatusName,Author,DisplayReference,DisplayDate,TotalRows")

结果：

当http://services.tropicos.org/Name/Search?name=adonis&type=wildcard&apikey=7602bfa6-cd59-4029-a28d-3aeb0ff8836e&format=xml的URL被放入单元格“A1”时，

模式 1：

当=SAMPLE(A1)到单元格“B1”时，得到如下结果。

模式二：

当=SAMPLE(A1;"DisplayDate,Family,RankAbbreviation,NomenclatureStatusID,NomenclatureStatusName,TotalRows")到单元格“B1”时，得到如下结果。

模式 3：

当=SAMPLE(A1;;"NameId,ScientificName,ScientificNameWithAuthors,Family,RankAbbreviation,NomenclatureStatusName,Author,DisplayReference,DisplayDate,TotalRows")到单元格“B1”时，得到如下结果。

参考文献：

Answer 3

为了超过 google sheet 允许的 30 秒限制时间以得到 Tanaike 给出的很好的解决方案，我添加了一个来自函数 onOpen() 的调用，其中 url 由单元格 A1 给出。再次感谢

function onOpen(){
    var classeur = SpreadsheetApp.getActive().getActiveSheet();
    var xmlOK = classeur.getRange(2,1);
  if (xmlOK.isBlank()){ 
    var url = classeur.getRange(1,1).getValue();
    var ignoreHeaders = "RankAbbreviation,NomenclatureStatusID,NomenclatureStatusName,TotalRows,Symbol";
    var formule = '=FILTERXML("'+url+'";"'+ignoreHeaders+'")';
    classeur.getRange(2,1).setFormula(formule);
  }
}

Google 工作表 importxml 过滤具有可变子项的特定 xpath 项

Google sheets importxml filtering specific xpath items with variable subitems

filtering

xml-parsing

google-apps-script

google-sheets-formula