删除重复项:意外的 Sparql 示例行为,缺少结果

Removing duplicates: unexpected Sparql Sample behaviour, missing result

我正在查询 IdRef Sparql endpoint to get researchers co-authors. In order to get more complete results, I'm doing a federated query against HAL endpoint

我的查询运行良好但会生成重复项,我的目标是使用授权标识符(ORCID、ISNI 或其他)删除重复项。

至此,我实现了如下查询,但现在我的问题是少了一个结果。

我的查询是:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>

SELECT distinct ?aut ?auturi
WHERE {
  SELECT distinct (SAMPLE(?auturi) AS ?auturi) (SAMPLE(?aut) AS ?aut) ?ids
  WHERE {
    {
      ?uri ?rel <http://www.idref.fr/139753753/id>. #entities our author has a link with
      ?uri ?relcontrib ?auturi. #other with a link to these entities
      ?auturi a foaf:Person. #filter for persons
      ?auturi skos:prefLabel ?aut. #get authors' name
      FILTER (?auturi != <http://www.idref.fr/139753753/id>) #exclude the same author we're querying
      OPTIONAL {
        ?auturi owl:sameAs ?ids. #get authors' identifiers
      }
    } UNION {
      <http://www.idref.fr/139753753/id> owl:sameAs ?id.
      FILTER (CONTAINS(STR(?id), "archives-ouvertes.fr"))
      BIND(URI(REPLACE(STR(?id), "#.*", "")) as ?idHal) #get an ID to query HAL
      SERVICE <http://sparql.archives-ouvertes.fr/sparql> {
        ?idHal foaf:publications ?uri. #same as above
        ?auturi foaf:publications ?uri.
        ?auturi foaf:name ?aut.
        FILTER (?idHal != ?auturi)
        OPTIONAL {
          ?auturi owl:sameAs ?ids.
        }
      }
    }
  }
}

如您所见,我正在使用带有示例的子查询来执行“重复数据删除”,但它没有按预期工作(或者至少按我的预期工作):一个结果被删除离开。可以看到here the un-sampled subquery, it returns an extra result matching this uri: https://data.archives-ouvertes.fr/author/marie-masclet-de-barbarin.rdf

起初我以为是因为这个结果没有匹配的owl:sameAs对象,但是集合中的another result也没有,但在最终结果集中。

我对这种行为感到很困惑,我怀疑这是因为我不完全理解 sample 是如何工作的。也许有更准确的方法来实现我正在寻找的东西。

编辑:结果(有重复)如下:

# auturi  aut
1   http://www.idref.fr/057577889/id Lantenois, Annick (1956-....)
2   http://www.idref.fr/033888760/id Cubaud, Pierre
3   http://www.idref.fr/028984838/id Suber, Peter
4   http://www.idref.fr/165836652/id Cramer, Florian (1969-....)
5   http://www.idref.fr/050447823/id Mounier, Pierre (1970-....)
6   http://www.idref.fr/174428006/id Ena, Alexandra (19..-....)
7   http://www.idref.fr/052212807/id Lebert, Marie
8   https://data.archives-ouvertes.fr/author/pierre-mounier Pierre Mounier
9   https://data.archives-ouvertes.fr/author/patrice-bellot Patrice Bellot
10 https://data.archives-ouvertes.fr/author/marlene-delhaye Marlène Delhaye
11 https://data.archives-ouvertes.fr/author/denis-bertin Denis Bertin
12 https://data.archives-ouvertes.fr/author/emma-bester Emma Bester
13 https://data.archives-ouvertes.fr/author/marie-masclet-de-barbarin Marie Masclet de Barbarin

基本上唯一重复的是#5 和#8。它们可以被这样识别,因为它们共享一个共同的 ?ids 对象(为清楚起见,此处未在结果中显示。查看完整结果,?idshere

Marie Masclet de Barbarin 被隐藏正是因为还有另一个人 Emma Bester,她也没有 owl:sameAs 优势。 考虑这个查询:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>

SELECT DISTINCT ?auturi ?aut ?ids
  WHERE {
<http://www.idref.fr/139753753/id> owl:sameAs ?id.
      FILTER (CONTAINS(STR(?id), "archives-ouvertes.fr"))
      BIND(URI(REPLACE(STR(?id), "#.*", "")) as ?idHal) #get an ID to query HAL
      SERVICE <http://sparql.archives-ouvertes.fr/sparql> {
        ?idHal foaf:publications ?uri. #same as above
        ?auturi foaf:publications ?uri.
        ?auturi foaf:name ?aut.
        FILTER (?idHal != ?auturi)
        OPTIONAL {
          ?auturi owl:sameAs ?ids.
        }
  }
}

这会产生 12 个结果:

请注意,这些人中的许多人都有多个 owl:sameAs 值,并且彼此之间都不相同。 但是,Marie 和 Emma 没有值,因此数据库为它们分配了一个 'null' 值。

因此,在对作者姓名和 uri 进行采样时(按 ?ids 分组),我们可以使用以下查询:

SELECT DISTINCT (SAMPLE(?auturi) AS ?auturi) (SAMPLE(?aut) AS ?aut) ?ids
  WHERE {
<http://www.idref.fr/139753753/id> owl:sameAs ?id.
      FILTER (CONTAINS(STR(?id), "archives-ouvertes.fr"))
      BIND(URI(REPLACE(STR(?id), "#.*", "")) as ?idHal) #get an ID to query HAL
      SERVICE <http://sparql.archives-ouvertes.fr/sparql> {
        ?idHal foaf:publications ?uri. #same as above
        ?auturi foaf:publications ?uri.
        ?auturi foaf:name ?aut.
        FILTER (?idHal != ?auturi)
        OPTIONAL {
          ?auturi owl:sameAs ?ids.
        }
  }
}

然而,这只有 11 个结果,其中缺少 Marie:

为什么?因为 ?ids 对两个不同的作者有一个空值,并且通过抽样我们只要求这些作者中的一个,所以第二个被跳过。

那么为什么 Marie 跳过了 100% 而不是 50%?这很可能是由三元组加载到存储中的顺序决定的,因此 SAMPLE 函数在给定特定加载顺序的情况下是确定性的,即如果您获取数据并将其加载到可能具有不同的triplestore,有可能Emma会被跳过

如何解决这个问题? 困难的部分是 Pierre Mounier 几乎作为两个不同的实体存在,有两个 ?ids 甚至两个文本名称,"Pierre Mounier""Mounier, Pierre (1970-...)"。 因此,采样 ?auturi 和按 ?aut 分组的明显解决方案将 显示 Marie,但 不会 重复删除 Pierre .

更好的解决方案是使用 COALESCE?ids 绑定到每个作者不同的东西,而不是让两者都为 null。这是这样做的:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>

SELECT DISTINCT ?auturi ?aut ?idsClean
  WHERE {
<http://www.idref.fr/139753753/id> owl:sameAs ?id.
      FILTER (CONTAINS(STR(?id), "archives-ouvertes.fr"))
      BIND(URI(REPLACE(STR(?id), "#.*", "")) as ?idHal) #get an ID to query HAL
      SERVICE <http://sparql.archives-ouvertes.fr/sparql> {
        ?idHal foaf:publications ?uri. #same as above
        ?auturi foaf:publications ?uri.
        ?auturi foaf:name ?aut.
        FILTER (?idHal != ?auturi)
        OPTIONAL {
          ?auturi owl:sameAs ?ids.
        }
    BIND(COALESCE(?ids, CONCAT("No ID: ", ?aut)) AS ?idsClean)
  }
}

这将 return:

将此方法用于更大的查询,我们得到:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>

SELECT distinct ?aut ?auturi
WHERE {
  SELECT distinct (SAMPLE(?auturi) AS ?auturi) (SAMPLE(?aut) AS ?aut) ?ids_clean
  WHERE {
    {
      ?uri ?rel <http://www.idref.fr/139753753/id>. #entities our author has a link with
      ?uri ?relcontrib ?auturi. #other with a link to these entities
      ?auturi a foaf:Person. #filter for persons
      ?auturi skos:prefLabel ?aut. #get authors' name
      FILTER (?auturi != <http://www.idref.fr/139753753/id>) #exclude the same author we're querying
      OPTIONAL {
        ?auturi owl:sameAs ?ids. #get authors' identifiers
      }
    BIND(COALESCE(?ids, CONCAT("No ID: ", ?aut)) AS ?ids_clean)
    } UNION {
      <http://www.idref.fr/139753753/id> owl:sameAs ?id.
      FILTER (CONTAINS(STR(?id), "archives-ouvertes.fr"))
      BIND(URI(REPLACE(STR(?id), "#.*", "")) as ?idHal) #get an ID to query HAL
      SERVICE <http://sparql.archives-ouvertes.fr/sparql> {
        ?idHal foaf:publications ?uri. #same as above
        ?auturi foaf:publications ?uri.
        ?auturi foaf:name ?aut.
        FILTER (?idHal != ?auturi)
        OPTIONAL {
          ?auturi owl:sameAs ?ids.
        }
        BIND(COALESCE(?ids, CONCAT("No ID: ", ?aut)) AS ?ids_clean)
      }
    }
  }
}

这会产生正确的 12 个结果: