为什么在实现索引时查询时间减少，但整体代码时间增加？

Question

我使用 unixtime 和 userid 遍历许多消息，我想在其中找到每个用户在 24 小时时间段内的消息数。我在 codereview 上发布了我的代码以获得一些帮助。从那里我优化了

cur.execute('SELECT unixtime FROM MessageType1 WHERE userID ='+str(userID[index])+' ORDER BY unixtime asc')

查询，因为我发现它在我的代码总时间 7.2 秒中减少了 6.7 秒运行宁。我通过使用 CREATE INDEX unixtime_times ON MessageType1 (unixtime asc) 在 unixtime 列上创建索引来优化它。现在该查询需要 0.00117 秒而不是 6.7 秒，但是运行我的代码花费的总时间从 7.2 秒减少到 15.8 秒。除了索引，一切都没有改变。经过进一步检查，似乎 messages = cur.fetchall() 在实施索引后花费了 15.3 秒。任何人都知道为什么？提前致谢！

con = lite.connect(databasepath)
    userID = []
    messages = []
    messageFrequency = []
    with con:
        cur = con.cursor()
        #Get all UserID
        cur.execute('SELECT DISTINCT userid FROM MessageType1')
        userID = cur.fetchall()
        userID = {index:x[0] for index,x in enumerate(userID)}
        #For each UserID
        for index in range(len(userID)):
            messageFrequency.append(0)
            #Get all MSG with UserID = UserID sorted by UNIXTIME
            cur.execute('SELECT unixtime FROM MessageType1 WHERE userID ='+str(userID[index])+' ORDER BY unixtime asc')
            messages = cur.fetchall()
            messages = {index:x[0] for index,x in enumerate(messages)}
            #Loop through every MSG
            for messageIndex in range(len(messages)):
                frequency = 0
                message = messages[messageIndex]
                for nextMessageIndex in range(messageIndex+1, len(messages)):
                #Loop through every message that is within 24 hours
                    nextmessage = messages[nextMessageIndex]
                    if  nextmessage < message+(24*60*60):
                    #Count the number of occurences
                        frequency += 1
                    else:
                        break
                #Add best benchmark for every message to a list that should be plotted.
                if messageFrequency[-1]<frequency:
                    messageFrequency[-1] = frequency

Answer 1

此查询的最佳索引：

SELECT unixtime
FROM MessageType1
WHERE userID ='+str(userID[index])+'
ORDER BY unixtime asc

是MessageType1(UserId, unixtime)。

只有 unixtime 上的索引，数据库基本上有两种可能的执行计划：

它可以忽略索引，读取行"sequentially"，过滤它们，然后进行排序。
它可以return 索引中的行，按排序顺序，然后过滤输出。

我的猜测是它根据您的时间选择了第二种方法。处理的 "fetch" 组件结束了查询的执行，因此速度非常快。然后它必须读取整个 table 以获得您想要的结果。

由于位置问题，这种方法可能比仅按顺序读取数据花费更长的时间。如果没有索引，它会读取第一页和第一页上的所有记录。有了索引，每条记录都在一个随机页面上——没有位置。当 table 大于可用于页面缓存的内存时，这可能会特别成问题，并且您最终会遇到称为 "thrashing".

的情况

为什么在实现索引时查询时间减少，但整体代码时间增加？

Why does query time decrease, but overall code time increase when implementing indexes?

python

sql

sqlite

indexing