select 按组出现最频繁的值
select value that occurs most frequently by group
我有关于医院病人的 RDF 数据,包括他们的出生日期。经常有 多个三元组与他们的出生日期有关,并且 其中一些三元组可能是错误的。我的小组已决定使用此规则:任何出现频率最高的日期将暂时被认为是正确的。很清楚如何使用我们选择的任何编程语言在 SPARQL 之外执行此操作。
SPARQL 中是否可以聚合聚合?
我已经阅读了类似的问题SPARQL selecting MAX value of a counter,但我还没有。
鉴于这些三元组:
@prefix turbo: <http://example.org/ontologies/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<http://example.org/ontologies/b6be95364ec943af2ef4ab161c11c855>
a <http://example.org/ontologies/StudyPartWithBBDonation> ;
turbo:hasBirthDateO turbo:3950b2b6-f575-4074-b0e8-f9fa3378f3be, turbo:4250aafa-4b0c-4f73-92b6-7639f427b61d, turbo:a3e6676e-a214-4af4-b8ef-34a8e20170bf .
turbo:3950b2b6-f575-4074-b0e8-f9fa3378f3be turbo:hasDateValue "1971-12-30"^^xsd:date .
turbo:4250aafa-4b0c-4f73-92b6-7639f427b61d turbo:hasDateValue "1971-12-30"^^xsd:date .
turbo:a3e6676e-a214-4af4-b8ef-34a8e20170bf turbo:hasDateValue "1971-12-30"^^xsd:date .
turbo:6e200ca0d5150282787464a2bda55814
a turbo:StudyPartWithBBDonation ;
turbo:hasBirthDateO turbo:b09519f5-b123-40d5-bb4a-737ec9f8b9a8, turbo:06c56881-a6c7-4d1d-993b-add8862dffd7, turbo:12ef184d-c8d6-4d93-a558-a3ba47bb56ca .
turbo:b09519f5-b123-40d5-bb4a-737ec9f8b9a8 turbo:hasDateValue "2000-04-04"^^xsd:date .
turbo:06c56881-a6c7-4d1d-993b-add8862dffd7 turbo:hasDateValue "2000-04-04"^^xsd:date .
turbo:12ef184d-c8d6-4d93-a558-a3ba47bb56ca turbo:hasDateValue "2000-04-05"^^xsd:date .
这个查询
PREFIX turbo: <http://example.org/ontologies/>
SELECT ?part ?xsddate (COUNT(?xsddate) AS ?datecount)
{ ?part rdf:type turbo:StudyPartWithBBDonation ;
turbo:hasBirthDateO ?dob .
?dob turbo:hasDateValue ?xsddate
}
GROUP BY ?part ?xsddate
给出以下内容:
+----------------------------------------+------------------------+------------------+
| part | xsddate | datecount |
+----------------------------------------+------------------------+------------------+
| turbo:6e200ca0d5150282787464a2bda55814 | "2000-04-05"^^xsd:date | "1"^^xsd:integer |
| turbo:b6be95364ec943af2ef4ab161c11c855 | "1971-12-30"^^xsd:date | "3"^^xsd:integer |
| turbo:6e200ca0d5150282787464a2bda55814 | "2000-04-04"^^xsd:date | "2"^^xsd:integer |
+----------------------------------------+------------------------+------------------+
我只想查看参与研究的每位患者计数最高的日期:
+----------------------------------------+------------------------+------------------+
| part | xsddate | datecount |
+----------------------------------------+------------------------+------------------+
| turbo:b6be95364ec943af2ef4ab161c11c855 | "1971-12-30"^^xsd:date | "3"^^xsd:integer |
| turbo:6e200ca0d5150282787464a2bda55814 | "2000-04-04"^^xsd:date | "2"^^xsd:integer |
+----------------------------------------+------------------------+------------------+
我想我快到了。现在我需要获取同一行的计数和最大计数!
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX turbo: <http://example.org/ontologies/>
SELECT ?part ?xsddate ?datecount ?countmax
WHERE
{ { SELECT ?part ?xsddate (COUNT(?xsddate) AS ?datecount)
WHERE
{ ?part rdf:type turbo:StudyPartWithBBDonation ;
turbo:hasBirthDateO ?dob .
?dob turbo:hasDateValue ?xsddate
}
GROUP BY ?part ?xsddate
}
UNION
{ SELECT ?part (MAX(?datecount) AS ?countmax)
WHERE
{ SELECT ?part ?xsddate (COUNT(?xsddate) AS ?datecount)
WHERE
{ ?part rdf:type turbo:StudyPartWithBBDonation ;
turbo:hasBirthDateO ?dob .
?dob turbo:hasDateValue ?xsddate
}
GROUP BY ?part ?xsddate
}
GROUP BY ?part
}
}
给予
+----------------------------------------+------------------------+------------------+------------------+
| part | xsddate | datecount | countmax |
+----------------------------------------+------------------------+------------------+------------------+
| turbo:6e200ca0d5150282787464a2bda55814 | "2000-04-05"^^xsd:date | "1"^^xsd:integer | |
| turbo:b6be95364ec943af2ef4ab161c11c855 | "1971-12-30"^^xsd:date | "3"^^xsd:integer | |
| turbo:6e200ca0d5150282787464a2bda55814 | "2000-04-04"^^xsd:date | "2"^^xsd:integer | |
| turbo:6e200ca0d5150282787464a2bda55814 | | | "2"^^xsd:integer |
| turbo:b6be95364ec943af2ef4ab161c11c855 | | | "3"^^xsd:integer |
+----------------------------------------+------------------------+------------------+------------------+
本质上,您只需要在查询中将 UNION
替换为 .
(或者您可以删除此 UNION
,正如@AKSW 在下面的评论中指出的那样) .
然而,在 GraphDB 中,您会收到一个错误:
Variable ?datecount
is already used in a previous projection. Bindings
are not propagated through projections since Sesame 2.8, so this may
lead to logical errors in the query.
因此,以这种方式更改您的查询:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX turbo: <http://example.org/ontologies/>
SELECT ?part ?xsddate ?datecount_ ?countmax
WHERE
{ { SELECT ?part ?xsddate (COUNT(?xsddate) AS ?datecount_)
WHERE
{ ?part rdf:type turbo:StudyPartWithBBDonation ;
turbo:hasBirthDateO ?dob .
?dob turbo:hasDateValue ?xsddate
}
GROUP BY ?part ?xsddate
}
.
{ SELECT ?part (MAX(?datecount) AS ?countmax)
WHERE
{ SELECT ?part ?xsddate (COUNT(?xsddate) AS ?datecount)
WHERE
{ ?part rdf:type turbo:StudyPartWithBBDonation ;
turbo:hasBirthDateO ?dob .
?dob turbo:hasDateValue ?xsddate
}
GROUP BY ?part ?xsddate
}
GROUP BY ?part
}
}
在 Blazegraph 中,您可以使用 named subqueries:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX turbo: <http://example.org/ontologies/>
SELECT ?part ?xsddate ?datecount ?countmax
WITH
{ SELECT ?part ?xsddate (COUNT(?xsddate) AS ?datecount)
WHERE
{ ?part rdf:type turbo:StudyPartWithBBDonation ;
turbo:hasBirthDateO ?dob .
?dob turbo:hasDateValue ?xsddate
}
GROUP BY ?part ?xsddate
} AS %sub
WHERE
{ { SELECT ?part (MAX(?datecount) AS ?countmax)
WHERE { INCLUDE %sub } GROUP BY ?part
}
INCLUDE %sub
}
我对 Stanislav 精彩回答的阐述
- 重命名
{}
模式之一中的 ?datecount
- 添加了过滤器
- 将共识 DOB 插入三元组中的命名图中
.
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX turbo: <http://example.org/ontologies/>
INSERT {
GRAPH turbo:DOB_conclusions {
?part turbo:hasBirthDateO ?DOBconc .
?DOBconc turbo:hasDateValue ?xsddate .
?DOBconc turbo:conclusionated true .
?DOBconc rdf:type <http://www.ebi.ac.uk/efo/EFO_0004950> .
}
}
WHERE
{ { SELECT ?part ?xsddate (COUNT(?xsddate) AS ?datecount)
WHERE
{ ?part rdf:type turbo:StudyPartWithBBDonation ;
turbo:hasBirthDateO ?dob .
?dob turbo:hasDateValue ?xsddate
}
GROUP BY ?part ?xsddate
}
.
{ SELECT ?part (MAX(?datecount2) AS ?countmax)
WHERE
{ SELECT ?part ?xsddate (COUNT(?xsddate) AS ?datecount2)
WHERE
{ ?part rdf:type turbo:StudyPartWithBBDonation ;
turbo:hasBirthDateO ?dob .
?dob turbo:hasDateValue ?xsddate
}
GROUP BY ?part ?xsddate
}
GROUP BY ?part
}
FILTER ( ?datecount = ?countmax )
BIND(uri(concat("http://transformunify.org/ontologies/", struuid())) AS ?DOBconc)
}
我有关于医院病人的 RDF 数据,包括他们的出生日期。经常有 多个三元组与他们的出生日期有关,并且 其中一些三元组可能是错误的。我的小组已决定使用此规则:任何出现频率最高的日期将暂时被认为是正确的。很清楚如何使用我们选择的任何编程语言在 SPARQL 之外执行此操作。
SPARQL 中是否可以聚合聚合?
我已经阅读了类似的问题SPARQL selecting MAX value of a counter,但我还没有。
鉴于这些三元组:
@prefix turbo: <http://example.org/ontologies/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<http://example.org/ontologies/b6be95364ec943af2ef4ab161c11c855>
a <http://example.org/ontologies/StudyPartWithBBDonation> ;
turbo:hasBirthDateO turbo:3950b2b6-f575-4074-b0e8-f9fa3378f3be, turbo:4250aafa-4b0c-4f73-92b6-7639f427b61d, turbo:a3e6676e-a214-4af4-b8ef-34a8e20170bf .
turbo:3950b2b6-f575-4074-b0e8-f9fa3378f3be turbo:hasDateValue "1971-12-30"^^xsd:date .
turbo:4250aafa-4b0c-4f73-92b6-7639f427b61d turbo:hasDateValue "1971-12-30"^^xsd:date .
turbo:a3e6676e-a214-4af4-b8ef-34a8e20170bf turbo:hasDateValue "1971-12-30"^^xsd:date .
turbo:6e200ca0d5150282787464a2bda55814
a turbo:StudyPartWithBBDonation ;
turbo:hasBirthDateO turbo:b09519f5-b123-40d5-bb4a-737ec9f8b9a8, turbo:06c56881-a6c7-4d1d-993b-add8862dffd7, turbo:12ef184d-c8d6-4d93-a558-a3ba47bb56ca .
turbo:b09519f5-b123-40d5-bb4a-737ec9f8b9a8 turbo:hasDateValue "2000-04-04"^^xsd:date .
turbo:06c56881-a6c7-4d1d-993b-add8862dffd7 turbo:hasDateValue "2000-04-04"^^xsd:date .
turbo:12ef184d-c8d6-4d93-a558-a3ba47bb56ca turbo:hasDateValue "2000-04-05"^^xsd:date .
这个查询
PREFIX turbo: <http://example.org/ontologies/>
SELECT ?part ?xsddate (COUNT(?xsddate) AS ?datecount)
{ ?part rdf:type turbo:StudyPartWithBBDonation ;
turbo:hasBirthDateO ?dob .
?dob turbo:hasDateValue ?xsddate
}
GROUP BY ?part ?xsddate
给出以下内容:
+----------------------------------------+------------------------+------------------+
| part | xsddate | datecount |
+----------------------------------------+------------------------+------------------+
| turbo:6e200ca0d5150282787464a2bda55814 | "2000-04-05"^^xsd:date | "1"^^xsd:integer |
| turbo:b6be95364ec943af2ef4ab161c11c855 | "1971-12-30"^^xsd:date | "3"^^xsd:integer |
| turbo:6e200ca0d5150282787464a2bda55814 | "2000-04-04"^^xsd:date | "2"^^xsd:integer |
+----------------------------------------+------------------------+------------------+
我只想查看参与研究的每位患者计数最高的日期:
+----------------------------------------+------------------------+------------------+
| part | xsddate | datecount |
+----------------------------------------+------------------------+------------------+
| turbo:b6be95364ec943af2ef4ab161c11c855 | "1971-12-30"^^xsd:date | "3"^^xsd:integer |
| turbo:6e200ca0d5150282787464a2bda55814 | "2000-04-04"^^xsd:date | "2"^^xsd:integer |
+----------------------------------------+------------------------+------------------+
我想我快到了。现在我需要获取同一行的计数和最大计数!
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX turbo: <http://example.org/ontologies/>
SELECT ?part ?xsddate ?datecount ?countmax
WHERE
{ { SELECT ?part ?xsddate (COUNT(?xsddate) AS ?datecount)
WHERE
{ ?part rdf:type turbo:StudyPartWithBBDonation ;
turbo:hasBirthDateO ?dob .
?dob turbo:hasDateValue ?xsddate
}
GROUP BY ?part ?xsddate
}
UNION
{ SELECT ?part (MAX(?datecount) AS ?countmax)
WHERE
{ SELECT ?part ?xsddate (COUNT(?xsddate) AS ?datecount)
WHERE
{ ?part rdf:type turbo:StudyPartWithBBDonation ;
turbo:hasBirthDateO ?dob .
?dob turbo:hasDateValue ?xsddate
}
GROUP BY ?part ?xsddate
}
GROUP BY ?part
}
}
给予
+----------------------------------------+------------------------+------------------+------------------+
| part | xsddate | datecount | countmax |
+----------------------------------------+------------------------+------------------+------------------+
| turbo:6e200ca0d5150282787464a2bda55814 | "2000-04-05"^^xsd:date | "1"^^xsd:integer | |
| turbo:b6be95364ec943af2ef4ab161c11c855 | "1971-12-30"^^xsd:date | "3"^^xsd:integer | |
| turbo:6e200ca0d5150282787464a2bda55814 | "2000-04-04"^^xsd:date | "2"^^xsd:integer | |
| turbo:6e200ca0d5150282787464a2bda55814 | | | "2"^^xsd:integer |
| turbo:b6be95364ec943af2ef4ab161c11c855 | | | "3"^^xsd:integer |
+----------------------------------------+------------------------+------------------+------------------+
本质上,您只需要在查询中将 UNION
替换为 .
(或者您可以删除此 UNION
,正如@AKSW 在下面的评论中指出的那样) .
然而,在 GraphDB 中,您会收到一个错误:
Variable
?datecount
is already used in a previous projection. Bindings are not propagated through projections since Sesame 2.8, so this may lead to logical errors in the query.
因此,以这种方式更改您的查询:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX turbo: <http://example.org/ontologies/>
SELECT ?part ?xsddate ?datecount_ ?countmax
WHERE
{ { SELECT ?part ?xsddate (COUNT(?xsddate) AS ?datecount_)
WHERE
{ ?part rdf:type turbo:StudyPartWithBBDonation ;
turbo:hasBirthDateO ?dob .
?dob turbo:hasDateValue ?xsddate
}
GROUP BY ?part ?xsddate
}
.
{ SELECT ?part (MAX(?datecount) AS ?countmax)
WHERE
{ SELECT ?part ?xsddate (COUNT(?xsddate) AS ?datecount)
WHERE
{ ?part rdf:type turbo:StudyPartWithBBDonation ;
turbo:hasBirthDateO ?dob .
?dob turbo:hasDateValue ?xsddate
}
GROUP BY ?part ?xsddate
}
GROUP BY ?part
}
}
在 Blazegraph 中,您可以使用 named subqueries:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX turbo: <http://example.org/ontologies/>
SELECT ?part ?xsddate ?datecount ?countmax
WITH
{ SELECT ?part ?xsddate (COUNT(?xsddate) AS ?datecount)
WHERE
{ ?part rdf:type turbo:StudyPartWithBBDonation ;
turbo:hasBirthDateO ?dob .
?dob turbo:hasDateValue ?xsddate
}
GROUP BY ?part ?xsddate
} AS %sub
WHERE
{ { SELECT ?part (MAX(?datecount) AS ?countmax)
WHERE { INCLUDE %sub } GROUP BY ?part
}
INCLUDE %sub
}
我对 Stanislav 精彩回答的阐述
- 重命名
{}
模式之一中的?datecount
- 添加了过滤器
- 将共识 DOB 插入三元组中的命名图中
.
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX turbo: <http://example.org/ontologies/>
INSERT {
GRAPH turbo:DOB_conclusions {
?part turbo:hasBirthDateO ?DOBconc .
?DOBconc turbo:hasDateValue ?xsddate .
?DOBconc turbo:conclusionated true .
?DOBconc rdf:type <http://www.ebi.ac.uk/efo/EFO_0004950> .
}
}
WHERE
{ { SELECT ?part ?xsddate (COUNT(?xsddate) AS ?datecount)
WHERE
{ ?part rdf:type turbo:StudyPartWithBBDonation ;
turbo:hasBirthDateO ?dob .
?dob turbo:hasDateValue ?xsddate
}
GROUP BY ?part ?xsddate
}
.
{ SELECT ?part (MAX(?datecount2) AS ?countmax)
WHERE
{ SELECT ?part ?xsddate (COUNT(?xsddate) AS ?datecount2)
WHERE
{ ?part rdf:type turbo:StudyPartWithBBDonation ;
turbo:hasBirthDateO ?dob .
?dob turbo:hasDateValue ?xsddate
}
GROUP BY ?part ?xsddate
}
GROUP BY ?part
}
FILTER ( ?datecount = ?countmax )
BIND(uri(concat("http://transformunify.org/ontologies/", struuid())) AS ?DOBconc)
}