从维基数据中提取 RDF 三元组
Extracting RDF triples from Wikidata
我正在遵循 this 维基数据查询指南。
我可以使用 with:
获得某个实体(如果我知道它的代码)
from wikidata.client import Client
client = Client()
entity = client.get('Q20145', load=True)
entity
>>><wikidata.entity.Entity Q20145 'IU'>
entity.description
>>>m'South Korean singer-songwriter, record producer, and actress'
但是我怎样才能得到那个实体的 RDF 三元组呢?即所有出边和入边的形式为(subject, predicate, object)
看起来像 this SO question managed to get the triples, but only from a data dump here。我正在尝试从图书馆本身获取它。
如果您只需要出边,可以通过调用 https://www.wikidata.org/wiki/Special:EntityData/Q20145.nt
直接检索它们
from rdflib import Graph
g = Graph()
g.parse('https://www.wikidata.org/wiki/Special:EntityData/Q20145.nt', format="nt")
for subj, pred, obj in g:
print(subj, pred, obj)
要获取入边和出边,需要查询数据库。在维基数据上,这是使用 Wikidata Query Service 和查询语言 SPARQL 完成的。获取所有边的 SPARQL 表达式很简单 DESCRIBE wd:Q20145
.
使用Python,您可以使用以下代码检索查询结果:
import requests
import json
endpoint_url = "https://query.wikidata.org/sparql"
headers = { 'User-Agent': 'MyBot' }
payload = {
'query': 'DESCRIBE wd:Q20145',
'format': 'json'
}
r = requests.get(endpoint_url, params=payload, headers=headers)
results = r.json()
triples = []
for result in results["results"]["bindings"]:
triples.append((result["subject"], result["predicate"], result["object"]))
print(triples)
这为您提供了来自复杂基础数据模型的完整结果来源。如果要分别查询传入和传出边,请将 DESCRIBE wd:Q20145
写成 CONSTRUCT {?s ?p ?o} WHERE {BIND(wd:Q20145 AS ?s) ?s ?p ?o}
仅具有传出边或 CONSTRUCT {?s ?p ?o} WHERE {BIND(wd:Q20145 AS ?o) ?s ?p ?o}
仅具有传入边。
根据您的目标,您可能想要过滤掉一些三元组,例如声明三元组,你可能想简化一些三元组。
获得更清晰结果的一种可能性是将最后四行替换为:
triples = []
for result in results["results"]["bindings"]:
subject = result["subject"]["value"].replace('http://www.wikidata.org/entity/', '')
object = result["object"]["value"].replace('http://www.wikidata.org/entity/', '')
predicate = result["predicate"]["value"].replace('http://www.wikidata.org/prop/direct/', '')
if 'statement/' in subject or 'statement/' in object:
continue
triples.append((subject, predicate, object))
print(triples)
But how can I get the RDF triples of that entity?
通过使用 SPARQL DESCRIBE
查询 (source), you get a single result RDF graph containing all the outgoing and incoming edges in the form of (subject, predicate, object). This can be achieved using the following Python example code (source):
from SPARQLWrapper import SPARQLWrapper
queryString = """DESCRIBE wd:Q20145"""
sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
sparql.setQuery(queryString)
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
for result in results["results"]["bindings"]:
print(result)
如果你只想得到出边,使用CONSTRUCT {?s ?p ?o} WHERE {BIND(wd:Q20145 AS ?s) ?s ?p ?o}
,对于入边,使用CONSTRUCT {?s ?p ?o} WHERE {BIND(wd:Q20145 AS ?o) ?s ?p ?o}
(感谢@
UninformedUser).
示例代码:
from SPARQLWrapper import SPARQLWrapper
queryString = """CONSTRUCT {?s ?p ?o} WHERE {BIND(wd:Q20145 AS ?s) ?s ?p ?o}"""
sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
sparql.setQuery(queryString)
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
for result in results["results"]["bindings"]:
print(result)
我正在遵循 this 维基数据查询指南。
我可以使用 with:
获得某个实体(如果我知道它的代码)from wikidata.client import Client
client = Client()
entity = client.get('Q20145', load=True)
entity
>>><wikidata.entity.Entity Q20145 'IU'>
entity.description
>>>m'South Korean singer-songwriter, record producer, and actress'
但是我怎样才能得到那个实体的 RDF 三元组呢?即所有出边和入边的形式为(subject, predicate, object)
看起来像 this SO question managed to get the triples, but only from a data dump here。我正在尝试从图书馆本身获取它。
如果您只需要出边,可以通过调用 https://www.wikidata.org/wiki/Special:EntityData/Q20145.nt
直接检索它们from rdflib import Graph
g = Graph()
g.parse('https://www.wikidata.org/wiki/Special:EntityData/Q20145.nt', format="nt")
for subj, pred, obj in g:
print(subj, pred, obj)
要获取入边和出边,需要查询数据库。在维基数据上,这是使用 Wikidata Query Service 和查询语言 SPARQL 完成的。获取所有边的 SPARQL 表达式很简单 DESCRIBE wd:Q20145
.
使用Python,您可以使用以下代码检索查询结果:
import requests
import json
endpoint_url = "https://query.wikidata.org/sparql"
headers = { 'User-Agent': 'MyBot' }
payload = {
'query': 'DESCRIBE wd:Q20145',
'format': 'json'
}
r = requests.get(endpoint_url, params=payload, headers=headers)
results = r.json()
triples = []
for result in results["results"]["bindings"]:
triples.append((result["subject"], result["predicate"], result["object"]))
print(triples)
这为您提供了来自复杂基础数据模型的完整结果来源。如果要分别查询传入和传出边,请将 DESCRIBE wd:Q20145
写成 CONSTRUCT {?s ?p ?o} WHERE {BIND(wd:Q20145 AS ?s) ?s ?p ?o}
仅具有传出边或 CONSTRUCT {?s ?p ?o} WHERE {BIND(wd:Q20145 AS ?o) ?s ?p ?o}
仅具有传入边。
根据您的目标,您可能想要过滤掉一些三元组,例如声明三元组,你可能想简化一些三元组。 获得更清晰结果的一种可能性是将最后四行替换为:
triples = []
for result in results["results"]["bindings"]:
subject = result["subject"]["value"].replace('http://www.wikidata.org/entity/', '')
object = result["object"]["value"].replace('http://www.wikidata.org/entity/', '')
predicate = result["predicate"]["value"].replace('http://www.wikidata.org/prop/direct/', '')
if 'statement/' in subject or 'statement/' in object:
continue
triples.append((subject, predicate, object))
print(triples)
But how can I get the RDF triples of that entity?
通过使用 SPARQL DESCRIBE
查询 (source), you get a single result RDF graph containing all the outgoing and incoming edges in the form of (subject, predicate, object). This can be achieved using the following Python example code (source):
from SPARQLWrapper import SPARQLWrapper
queryString = """DESCRIBE wd:Q20145"""
sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
sparql.setQuery(queryString)
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
for result in results["results"]["bindings"]:
print(result)
如果你只想得到出边,使用CONSTRUCT {?s ?p ?o} WHERE {BIND(wd:Q20145 AS ?s) ?s ?p ?o}
,对于入边,使用CONSTRUCT {?s ?p ?o} WHERE {BIND(wd:Q20145 AS ?o) ?s ?p ?o}
(感谢@
UninformedUser).
示例代码:
from SPARQLWrapper import SPARQLWrapper
queryString = """CONSTRUCT {?s ?p ?o} WHERE {BIND(wd:Q20145 AS ?s) ?s ?p ?o}"""
sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
sparql.setQuery(queryString)
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
for result in results["results"]["bindings"]:
print(result)