如何使用 DBpedia 属性构建主题层次结构?
How to use DBpedia properties to build a topic hierarchy?
我正在尝试按照下面提到的两个 DBpedia 属性构建主题层次结构。
- skos:更广泛 属性
- dcterms:主题 属性
我的本意是给词找准话题吧。例如,给定这个词; 'suport vector machine',我想从中找出分类算法,机器学习等话题
但是,有时我对如何构建主题层次结构感到有点困惑,因为我得到了超过 5 个主题 URI 和许多更广泛属性的 URI。有没有办法衡量强度或其他东西并减少我从 DBpedia 获得的额外 URI 并仅分配最高可能的 URI?
好像有两个问题。
- 如何限制 DBpedia Spotlight 结果的数量。
- 如何限制特定结果的主题和类别数量。
我现在的代码如下
from SPARQLWrapper import SPARQLWrapper, JSON
import requests
import urllib.parse
## initial consts
BASE_URL = 'http://api.dbpedia-spotlight.org/en/annotate?text={text}&confidence={confidence}&support={support}'
TEXT = 'First documented in the 13th century, Berlin was the capital of the Kingdom of Prussia (1701–1918), the German Empire (1871–1918), the Weimar Republic (1919–33) and the Third Reich (1933–45). Berlin in the 1920s was the third largest municipality in the world. After World War II, the city became divided into East Berlin -- the capital of East Germany -- and West Berlin, a West German exclave surrounded by the Berlin Wall from 1961–89. Following German reunification in 1990, the city regained its status as the capital of Germany, hosting 147 foreign embassies.'
CONFIDENCE = '0.5'
SUPPORT = '120'
REQUEST = BASE_URL.format(
text=urllib.parse.quote_plus(TEXT),
confidence=CONFIDENCE,
support=SUPPORT
)
HEADERS = {'Accept': 'application/json'}
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
all_urls = []
r = requests.get(url=REQUEST, headers=HEADERS)
response = r.json()
resources = response['Resources']
for res in resources:
all_urls.append(res['@URI'])
for url in all_urls:
sparql.setQuery("""
SELECT * WHERE {<"""
+url+
""">skos:broader|dct:subject ?resource
}
""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
for result in results["results"]["bindings"]:
print('resource ---- ', result['resource']['value'])
如果需要,我很乐意提供更多示例。
您似乎正在尝试检索与给定段落相关的维基百科类别。
小建议
首先,我建议您执行单个请求,将 DBpedia Spotlight 结果收集到 VALUES
,例如,这样:
values = '(<{0}>)'.format('>) (<'.join(all_urls))
其次,如果您谈论的主题是 层次结构,您应该使用 SPARQL 1.1 property paths.
这两个建议有点不相容。当查询包含多个起点(即 VALUES
)和任意长度路径(即 *
和 +
运算符)时,Virtuoso 效率非常低。
下面我使用 dct:subject/skos:broader
属性 路径,即检索 'next-level' 类别。
方法一
第一种方式是按资源的普遍受欢迎程度排序资源,例如。 G。他们的 PageRank:
values = '(<{0}>)'.format('>) (<'.join(all_urls))
sparql.setQuery(
"""PREFIX vrank:<http://purl.org/voc/vrank#>
SELECT DISTINCT ?resource ?rank
FROM <http://dbpedia.org>
FROM <http://people.aifb.kit.edu/ath/#DBpedia_PageRank>
WHERE {
VALUES (?s) {""" + values +
""" }
?s dct:subject/skos:broader ?resource .
?resource vrank:hasRank/vrank:rankValue ?rank.
} ORDER BY DESC(?rank)
LIMIT 10
""")
结果是:
dbc:Member_states_of_the_United_Nations
dbc:Country_subdivisions_of_Europe
dbc:Republics
dbc:Demography
dbc:Population
dbc:Countries_in_Europe
dbc:Third-level_administrative_country_subdivisions
dbc:International_law
dbc:Former_countries_in_Europe
dbc:History_of_the_Soviet_Union_and_Soviet_Russia
方法二
第二种方法是计算给定文本的类别频率...
values = '(<{0}>)'.format('>) (<'.join(all_urls))
sparql.setQuery(
"""SELECT ?resource count(?resource) AS ?count WHERE {
VALUES (?s) {""" + values +
""" }
?s dct:subject ?resource
} GROUP BY ?resource
# https://github.com/openlink/virtuoso-opensource/issues/254
HAVING (count(?resource) > 1)
ORDER BY DESC(count(?resource))
LIMIT 10
""")
结果是:
dbc:Wars_by_country
dbc:Wars_involving_the_states_and_peoples_of_Europe
dbc:Wars_involving_the_states_and_peoples_of_Asia
dbc:Wars_involving_the_states_and_peoples_of_North_America
dbc:20th_century_in_Germany
dbc:Modern_history_of_Germany
dbc:Wars_involving_the_Balkans
dbc:Decades_in_Germany
dbc:Modern_Europe
dbc:Wars_involving_the_states_and_peoples_of_South_America
用dct:subject
代替dct:subject/skos:broader
,结果更好:
dbc:Former_polities_of_the_Cold_War
dbc:Former_republics
dbc:States_and_territories_established_in_1949
dbc:20th_century_in_Germany_by_period
dbc:1930s_in_Germany
dbc:Modern_history_of_Germany
dbc:1990_disestablishments_in_West_Germany
dbc:1933_disestablishments_in_Germany
dbc:1949_establishments_in_West_Germany
dbc:1949_establishments_in_Germany
结论
结果不是很好。我看到两个原因:DBpedia 类别非常随机,工具非常原始。结合方法一和方法二,或许可以达到更好的效果。反正还是需要大语料库的实验。
我正在尝试按照下面提到的两个 DBpedia 属性构建主题层次结构。
- skos:更广泛 属性
- dcterms:主题 属性
我的本意是给词找准话题吧。例如,给定这个词; 'suport vector machine',我想从中找出分类算法,机器学习等话题
但是,有时我对如何构建主题层次结构感到有点困惑,因为我得到了超过 5 个主题 URI 和许多更广泛属性的 URI。有没有办法衡量强度或其他东西并减少我从 DBpedia 获得的额外 URI 并仅分配最高可能的 URI?
好像有两个问题。
- 如何限制 DBpedia Spotlight 结果的数量。
- 如何限制特定结果的主题和类别数量。
我现在的代码如下
from SPARQLWrapper import SPARQLWrapper, JSON
import requests
import urllib.parse
## initial consts
BASE_URL = 'http://api.dbpedia-spotlight.org/en/annotate?text={text}&confidence={confidence}&support={support}'
TEXT = 'First documented in the 13th century, Berlin was the capital of the Kingdom of Prussia (1701–1918), the German Empire (1871–1918), the Weimar Republic (1919–33) and the Third Reich (1933–45). Berlin in the 1920s was the third largest municipality in the world. After World War II, the city became divided into East Berlin -- the capital of East Germany -- and West Berlin, a West German exclave surrounded by the Berlin Wall from 1961–89. Following German reunification in 1990, the city regained its status as the capital of Germany, hosting 147 foreign embassies.'
CONFIDENCE = '0.5'
SUPPORT = '120'
REQUEST = BASE_URL.format(
text=urllib.parse.quote_plus(TEXT),
confidence=CONFIDENCE,
support=SUPPORT
)
HEADERS = {'Accept': 'application/json'}
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
all_urls = []
r = requests.get(url=REQUEST, headers=HEADERS)
response = r.json()
resources = response['Resources']
for res in resources:
all_urls.append(res['@URI'])
for url in all_urls:
sparql.setQuery("""
SELECT * WHERE {<"""
+url+
""">skos:broader|dct:subject ?resource
}
""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
for result in results["results"]["bindings"]:
print('resource ---- ', result['resource']['value'])
如果需要,我很乐意提供更多示例。
您似乎正在尝试检索与给定段落相关的维基百科类别。
小建议
首先,我建议您执行单个请求,将 DBpedia Spotlight 结果收集到 VALUES
,例如,这样:
values = '(<{0}>)'.format('>) (<'.join(all_urls))
其次,如果您谈论的主题是 层次结构,您应该使用 SPARQL 1.1 property paths.
这两个建议有点不相容。当查询包含多个起点(即 VALUES
)和任意长度路径(即 *
和 +
运算符)时,Virtuoso 效率非常低。
下面我使用 dct:subject/skos:broader
属性 路径,即检索 'next-level' 类别。
方法一
第一种方式是按资源的普遍受欢迎程度排序资源,例如。 G。他们的 PageRank:
values = '(<{0}>)'.format('>) (<'.join(all_urls))
sparql.setQuery(
"""PREFIX vrank:<http://purl.org/voc/vrank#>
SELECT DISTINCT ?resource ?rank
FROM <http://dbpedia.org>
FROM <http://people.aifb.kit.edu/ath/#DBpedia_PageRank>
WHERE {
VALUES (?s) {""" + values +
""" }
?s dct:subject/skos:broader ?resource .
?resource vrank:hasRank/vrank:rankValue ?rank.
} ORDER BY DESC(?rank)
LIMIT 10
""")
结果是:
dbc:Member_states_of_the_United_Nations
dbc:Country_subdivisions_of_Europe
dbc:Republics
dbc:Demography
dbc:Population
dbc:Countries_in_Europe
dbc:Third-level_administrative_country_subdivisions
dbc:International_law
dbc:Former_countries_in_Europe
dbc:History_of_the_Soviet_Union_and_Soviet_Russia
方法二
第二种方法是计算给定文本的类别频率...
values = '(<{0}>)'.format('>) (<'.join(all_urls))
sparql.setQuery(
"""SELECT ?resource count(?resource) AS ?count WHERE {
VALUES (?s) {""" + values +
""" }
?s dct:subject ?resource
} GROUP BY ?resource
# https://github.com/openlink/virtuoso-opensource/issues/254
HAVING (count(?resource) > 1)
ORDER BY DESC(count(?resource))
LIMIT 10
""")
结果是:
dbc:Wars_by_country
dbc:Wars_involving_the_states_and_peoples_of_Europe
dbc:Wars_involving_the_states_and_peoples_of_Asia
dbc:Wars_involving_the_states_and_peoples_of_North_America
dbc:20th_century_in_Germany
dbc:Modern_history_of_Germany
dbc:Wars_involving_the_Balkans
dbc:Decades_in_Germany
dbc:Modern_Europe
dbc:Wars_involving_the_states_and_peoples_of_South_America
用dct:subject
代替dct:subject/skos:broader
,结果更好:
dbc:Former_polities_of_the_Cold_War
dbc:Former_republics
dbc:States_and_territories_established_in_1949
dbc:20th_century_in_Germany_by_period
dbc:1930s_in_Germany
dbc:Modern_history_of_Germany
dbc:1990_disestablishments_in_West_Germany
dbc:1933_disestablishments_in_Germany
dbc:1949_establishments_in_West_Germany
dbc:1949_establishments_in_Germany
结论
结果不是很好。我看到两个原因:DBpedia 类别非常随机,工具非常原始。结合方法一和方法二,或许可以达到更好的效果。反正还是需要大语料库的实验。