提高搜索效率的方法

Ways to improve searching efficiency

我需要一些反馈。对于我的一个项目,我正在创建六度维基百科。为了简短起见,我完成了所有数据清理并将其插入 table 到 MSSQL 中。到目前为止一切正常。我能够搜索从起点到终点的连接,直到三度,然后处理时间太长。正在寻找可以更改代码以提高效率的方法。我对此很陌生,这是我的第一次,所以可能不是最好的方式(尽管我知道这可能是我能做到的最糟糕的方式)。

如有任何反馈,我们将不胜感激。

# -*- coding: utf-8 -*-
"""
Spyder Editor

This is a temporary script file.
"""

import pyodbc
import time
#import re

#rex = re.compile('(\(\'[a-zA-Z0-9]+\', \')(\w\))')
start_time = time.time()


listinit = []
listseconditeration = []
listthirditeration = []
listfourthiteration = []
listfifthiteration = []
listsixthiteration = []

start = input ("Select start location :")
finish = input ("Select finish location :")
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER=johndoe-PC\SQLEXPRESS;DATABASE=master;UID=-----;PWD=------;Trusted_Connection=yes')
cursor = cnxn.cursor()
cursor.execute("select * from join_table where link1 like '%s'" % (start))

rows = cursor.fetchall()
for row in rows:
    listinit.append(row)

for element in listinit:
    var1 = str(element)
    var1 = var1.replace("'","")
    var1 = var1.replace("(","")
    var1 = var1.replace(")","")
    var1 = var1.replace(",","")
    var1 = var1.replace(" ","")
    var1 = var1.replace(start,"")
    listseconditeration.append(var1)


if (finish) in (listseconditeration):
    print("one degree away")
    print("%s minutes" % (time.time() - start_time))


for element in listseconditeration:
    var2 = str(element)

    cursor.execute("select * from join_table where link1 like '%s'" % (var2))
    rows1 = cursor.fetchall()

    for row in rows1:

        listthirditeration.append(row)

        for element in listthirditeration:
            var3 = str(element)
            var3 = var3.replace("'","")
            var3 = var3.replace("(","")
            var3 = var3.replace(")","")
            var3 = var3.replace(",","")
            var3 = var3.replace(" ","")
            var3 = var3.replace(var2, "")
            listfourthiteration.append(var3)



if (finish) in (listfourthiteration):
    print("two degree away")
    print("%s minutes" % (time.time() - start_time))



for element in listfourthiteration:
    var4 = str(element)

    cursor.execute("select * from join_table where link1 like '%s'" % (var4))
    rows2 = cursor.fetchall()

    for row in rows2:

        listfifthiteration.append(row)

        for element in listfifthiteration:
            var5 = str(element)
            var5 = var5.replace("'","")
            var5 = var5.replace("(","")
            var5 = var5.replace(")","")
            var5 = var5.replace(",","")
            var5 = var5.replace(" ","")
            var5 = var5.replace(var4, "")
            listsixthiteration.append(var5)
        print(row)


if (finish) in (listsixthiteration):
    print("three degree away")

代码有几个问题。

第一个也是最重要的问题是代码不会阻止重新处理已经访问过的元素。例如,当你做 chicago->autumn 时,你的例行程序中没有任何东西可以阻止返回芝加哥并进行循环。这会使您的搜索量增长得非常快。

一个解决方案是维护一个集合而不是已经访问过的元素:

seen = set()

...

    for element in listthirditeration:
        var3 = str(element)
        var3 = var3.replace("'","")
        var3 = var3.replace("(","")
        var3 = var3.replace(")","")
        var3 = var3.replace(",","")
        var3 = var3.replace(" ","")
        var3 = var3.replace(var2, "")

        if var3 not in visited:
            visited.add(var3)
            listfourthiteration.append(var3)

...重复其他类似的循环...

第二个问题是代码复制了元素列表,并使它们比需要的更长。

cursor.execute("select * from join_table where link1 like '%s'" % (var2))

rows1 = cursor.fetchall()           # <<== fetchall() returns a list of results

for row in rows1:

    listthirditeration.append(row)  # <<== this makes a second copy of the results

进行这样的广度优先搜索时要使用的技巧是进行两次搜索——一次从头到尾,另一次从尾到头,并让它们在中间相遇。