如何加速找到两篇维基百科文章之间最短路径的程序

Question

我最近编写了一个程序，用于查找两篇维基百科文章之间的最短路径。问题是从页面获取所有链接并将它们放入图表中需要很长时间。找到路径是比较容易的部分。基本上我正在做的是：

startingPage = 'Lisbon'
target = 'Adolf Hitler'
graph = nx.DiGraph()
graph.add_node(startingPage)
found = pages.import_page(graph, startingPage)

while found != True:
    for node in list(graph):
        if graph.out_degree(node) == 0:
            found = pages.import_page(graph, node)
        if found == True:
            break;

我的 import_page 函数是这个：

def import_page(graph, starting, target):
    general_str = 'https://en.wikipedia.org/w/api.php?action=query&prop=links&pllimit=max&format=json&titles='
    data_str = general_str + starting
    encoded_data_str = data_str.encode('utf-8') #Sanitize input
    response = url.urlopen(encoded_data_str)
    data = json.loads(response.read())
    pageId = data['query']['pages'].keys()
    print starting
    if data['query']['pages'].keys()[0] == '-1': #Check if the page doesn't exist in Wikipedia
        return False
    elif data['query']['pages'][pageId[0]].keys()[2] != 'links': #Check if the page has no links in it
        return False

    for jsonObject in data['query']['pages'][pageId[0]]['links']:

        graph.add_node(jsonObject['title'])
        graph.add_edge(starting, jsonObject['title'])

        if jsonObject['title'] == target:
            return True

    while data.keys()[0] != 'batchcomplete':

        continueId = data['continue']['plcontinue']
        continue_str = data_str + '&plcontinue=' + continueId
        encoded_continue_str = continue_str.encode('utf-8') #Sanitize input
        response = url.urlopen(encoded_continue_str)
        data = json.loads(response.read())

        for jsonObject in data['query']['pages'][pageId[0]]['links']:
            graph.add_node(jsonObject['title'])
            graph.add_edge(starting, jsonObject['title'])
            if jsonObject['title'] == target:
                return True

    return False

问题是任何大于 2/3 链接的距离都会花费大量时间。关于如何加快速度的任何想法？

Answer 1

使用简单的算法和网络几乎不可能确定地找到最短路径API。如果最短路径有 N 步，则需要遍历所有长度为 N-1 或更短的可能路径才能确定。有数百万篇文章和数十到数百个链接，除非你真的很幸运并且最短路径只有 1-2 个链接，否则这是不可行的。如果说 10 步之外，您将不得不发出数十亿次请求，这将需要数年时间。

如果您大多数时候只是想找到一个合理的短路径，您可以尝试类似 A* search algorithm with a good heuristic. For example, you could hypothesize some sort of small-world property 的方法，并尝试识别与其他主题中心接近的主题中心以及那个话题。或者，您可以根据与目标处于同一主题或同一历史时期的候选人进行评分。

Answer 2

我使用了@Tgr 指出的方法，利用了一个小世界。如果您使用 加权网络 ，您可以将搜索限制在足够大以包含相关中心的子图中，并且足够小以在网络中处理 RESTful API.

您可能需要检查 iGraph 模块而不是 networkx，以减少内存占用。

使用我向您建议的方法，我已经能够获得连接多达 5 条查询的维基百科文章的最短路径，实时创建的子图占用内存高达 100MB。两个主题之间的最短路径不到1秒。

我很乐意与我的项目分享一个 link，它实际上计算维基百科的加权知识网络以允许搜索多个主题之间的联系 - 它会破坏 SO 政策还是对OP 和关于他的问题的讨论？

编辑

感谢@Tgr 对政策的汇报。

Nifty.works是一个寻找跨学科领域之间联系的原型平台。知识图谱是与英文维基百科配对的维基数据的一个子集。

作为 OP 的示例，此示例显示 最短路径 在 5 篇维基百科文章 之间查询：subgraph for connections between articles: "Shortest Path Problem", "A star search", "networkx", "knowledge graph" and "semantic network"

我将维基百科的知识图计算为一个加权网络。该网络具有小世界属性。通过划分知识图的一部分（子图）来查询文章之间的连接（路径）。

使用这种方法，即使在小型服务器计算机上，也可以足够快地提供图形搜索以提供知识发现方面的见解。

在这里您可以找到 examples of gamification of shortest paths between two articles of English Wikipedia, each pair has a distance bigger than 3 links - that is, they are not first neighbours: e.g. "Machine Learning" and "Life" -here a json of the queried subgraph)。

您甚至可能想添加参数来调整加权子图的大小，以获得不同的结果。例如，请参阅以下之间的差异：

最后，也看看这个问题：

如何加速找到两篇维基百科文章之间最短路径的程序

How to speed up program that finds the shortest path between two wikipedia articles

python

algorithm

graph

wikipedia-api

networkx