从 python 中的压缩数据库过滤数据

Filter data from zipped databases in python

我有以下问题。我有两个大型数据库。第一个数据库称为 an_visit_ab。它很大,压缩为 an_visit_ab.sql.bz2。第二个数据库也很大,但我可以将它存储在我的计算机上 an_visit.db.

数据库结构:

an_visit_abvisit_id升序排列:

visit_id    ab_test_id  variant
594817698   1   0
594817704   1   1
594817705   1   1
594817706   1   0
594817709   1   0

an_visit也是按visit_id升序排列的:

visit_id    datetime_add    is_bot
1   2017-06-21 18:10:38 0
3   2017-06-21 18:10:39 1

我需要从 an_visit_ab 中过滤掉所有 visit_id,其中 is_botan_visit 中的 1。通常,我会在 SQL 中进行简单的 JOIN。但是因为我什至无法解压缩 an_visit_ab.sql.bz2 我需要一个不同的策略。

到目前为止我得到的是:

  1. 我需要逐行阅读 an_visit_ab.sql.bz2。因为两个数据库都是按 visit_id 排序的,所以我可以查看 an_visit 对应的 visit_id 并检查 is_bot 是否为 1.
  2. 我在 python 做。我有以下代码逐行读取 an_visit_ab.sql.bz2
from bz2 import BZ2File as bzopen

AN_VISIT_AB = "an_visit_ab.sql.bz2"

with bzopen(AN_VISIT_AB, "r") as bzfile:
    for i, line in enumerate(bzfile):
        print(line)

输出如下所示:

b'-- MySQL dump 10.13  Distrib 5.7.33, for Linux (x86_64)\n'
b'--\n'
b'-- Host: localhost    Database: XXX\n'
b'-- ------------------------------------------------------\n'
b'-- Server version\t5.7.33-0ubuntu0.16.04.1\n'
b'\n'
b'/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;\n'
b'/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;\n'
b'/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;\n'
b'/*!40101 SET NAMES utf8 */;\n'
b'/*!40103 SET @OLD_TIME_ZONE=@@TIME_ZONE */;\n'
b"/*!40103 SET TIME_ZONE='+00:00' */;\n"
b'/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;\n'
b'/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;\n'
b"/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;\n"
b'/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;\n'
b'\n'
b'--\n'
b'-- Table structure for table `an_visit_ab`\n'
b'--\n'
b'\n'
b'DROP TABLE IF EXISTS `an_visit_ab`;\n'
b'/*!40101 SET @saved_cs_client     = @@character_set_client */;\n'
b'/*!40101 SET character_set_client = utf8 */;\n'
b'CREATE TABLE `an_visit_ab` (\n'
b'  `visit_id` int(11) unsigned NOT NULL,\n'
b'  `ab_test_id` int(10) unsigned NOT NULL,\n'
b'  `variant` tinyint(4) NOT NULL,\n'
b'  PRIMARY KEY (`visit_id`,`ab_test_id`,`variant`),\n'
b'  UNIQUE KEY `visit_id_ab_test_id` (`visit_id`,`ab_test_id`),\n'
b'  KEY `ab_test_id_ab_test_variant` (`ab_test_id`,`variant`),\n'
b'  KEY `ab_test_id` (`ab_test_id`),\n'
b'  CONSTRAINT `an_visit_ab_ibfk_1` FOREIGN KEY (`visit_id`) REFERENCES `an_visit` (`visit_id`),\n'
b'  CONSTRAINT `an_visit_ab_ibfk_4` FOREIGN KEY (`ab_test_id`, `variant`) REFERENCES `an_new_ab_test_variant` (`ab_test_id`, `variant`)\n'
b') ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;\n'
b'/*!40101 SET character_set_client = @saved_cs_client */;\n'
b'\n'
b'--\n'
b'-- Dumping data for table `an_visit_ab`\n'
b'--\n'
b'\n'
b'LOCK TABLES `an_visit_ab` WRITE;\n'
b'/*!40000 ALTER TABLE `an_visit_ab` DISABLE KEYS */;\n'

(594878944,1,1),(594878945,1,0),(594878946,1,0),(594878947,1,1),(594878948,1,0),(594878949,1,1),(594878950,1,0),(594878951,1,0),(594878952,1,1),(594878953,1,1),(594878954,1,1),(594878955,1,1),(594878956,1,0),(594878957,1,1),(594878958,1,1),(594878959,1,1),(594878960,1,0),(594878961,1,0),(594878962,1,0),(594878963,1,1),(594878964,1,1),(594878965,1,0),(594878966,1,1),(594878967,1,0),(594878968,1,1),(594878969,1,1),(594878970,1,1),(594878971,1,0),(594878972,1,1),(594878973,1,0),(594878974,1,0),(594878975,1,1),(594878976,1,1),(594878977,1,0),(594878978,1,1),(594878979,1,1),(594878980,1,0),(594878981,1,1),(594878982,1,1),(594878983,1,0),(594878984,1,1),(594878985,1,1),(594878986,1,0),(594878987,1,1),(594878988,1,1),(594878989,1,0),(594878990,1,1),(594878991,1,0),(594878992,1,0),(594878993,1,1),(594878994,1,0),(594878995,1,1),(594878996,1,1),(594878997,1,1),(594878998,1,1),(594878999,1,0),(594879000,1,0),(594879001,1,0),(594879002,1,0),(594879003,1,0),(594879004,1,1),(594879005,1,1),(594879006,1,0),(594879007,1,1),(594879008,1,1),(594879009,1,1),(594879010,1,0),(594879011,1,1),(594879012,1,1),(594879013,1,0),(594879014,1,0),(594879015,1,0),(594879016,1,1),(594879017,1,0),(594879018,1,0),(594879019,1,1),(594879020,1,1),(594879021,1,1),(594879022,1,1),(594879023,1,0),(594879024,1,1),(594879025,1,0),(594879026,1,0),(594879027,1,1),(594879028,1,0),(594879029,1,1),(594879030,1,0),(594879031,1,1),(594879032,1,0),(594879033,1,1),(594879034,1,1),(594879035,1,0),(594879036,1,1),(594879037,1,1),(594879038,1,0),(594879039,1,1),(594879040,1,1),(594879041,1,1),(594879042,1,1),(594879043,1,0),(594879044,1,0),(594879045,1,0),(594879046,1,0),(594879047,1,1),(594879048,1,0),(594879049,1,1),(594879050,1,0),(594879051,1,1),(594879052,1,1),(594879053,1,1),(594879054,1,0),(594879055,1,1),(594879056,1,0),(594879057,1,0),(594879058,1,0),(594879059,1,0),(594879060,1,1),(594879061,1,1),(594879062,1,0),(594879063,1,0),(594879064,1,0),(594879065,1,1),(594879066,1,1),(594879067,1,0),(594879068,1,1),(594879069,1,0),(594879070,1,0),(594879071,1,0),(594879072,1,0),(594879073,1,0),(594879074,1,0),(594879075,1,0),(594879076,1,0),(594879077,1,0),(594879078,1,1),(594879079,1,1),(594879080,1,1),(594879081,1,1),(594879082,1,1),(594879083,1,0),(594879084,1,0),(594879085,1,0),(594879086,1,0),(594879087,1,1),(594879088,1,1),(594879089,1,0),(594879090,1,1),(594879091,1,0),(594879092,1,1),(594879093,1,0),(594879094,1,0),(594879095,1,0),(594879096,1,1),(594879097,1,1),(594879098,1,0),(594879099,1,1),(594879100,1,0),(594879101,1,1),(594879102,1,1),(594879103,1,0),(594879104,1,0),(594879105,1,1),(594879106,1,1),(594879107,1,0),(594879108,1,1),(594879109,1,1),(594879110,1,0),(594879111,1,1),(594879112,1,1),(594879113,1,0),(594879114,1,0),(594879115,1,1),(594879116,1,1),(594879117,1,1),(594879118,1,1),(594879119,1,0),(594879120,1,1),(594879121,1,1),(594879122,1,0),(594879123,1,1),(594879124,1,1),(594879125,1,0),(594879126,1,1),(594879127,1,1),(594879128,1,1),(594879129,1,1),(594879130,1,0),(594879131,1,1),(594879132,1,0),(594879133,1,0),(594879134,1,0),(594879135,1,1),(594879136,1,0),(594879137,1,0),(594879138,1,0),(594879139,1,1),(594879140,1,1),(594879141,1,0),(594879142,1,0),(594879143,1,1),(594879144,1,0),(594879145,1,0),(594879146,1,1),(594879147,1,0),(594879148,1,0),(594879149,1,1),(594879150,1,0),(594879151,1,0),(594879152,1,0),(594879153,1,1),(594879154,1,1),(594879155,1,1),(594879156,1,1),(594879157,1,1),(594879158,1,1),(594879159,1,1),(594879160,1,1),(594879161,1,0),(594879162,1,1),(594879163,1,1),(594879164,1,0),(594879165,1,0),(594879166,1,0),(594879167,1,1),(594879168,1,1),(594879169,1,1),(594879170,1,0),(594879171,1,0),(594879172,1,1),(594879173,1,0),(594879174,1,1),(594879175,1,1),(594879176,1,1),(594879177,1,0),(594879178,1,1),(594879179,1,1),(594879180,1,0),(594879181,1,1),(594879182,1,0),(594879183,1,1),(594879184,1,0),(594879185,1,0),(594879186,1,0),(594879187,1,1),(594879188,1,0),(594879189,1,0),(594879190,1,1),(594879191,1,0),(594879192,1,0),(594879193,1,1),(594879194,1,1),(594879195,1,1),(594879196,1,1),(594879197,1,1),(594879198,1,0),(594879199,1,1),(594879200,1,1),(594879201,1,0),(594879202,1,0),(594879203,1,0),(594879204,1,1),(594879205,1,1),(594879206,1,1),(594879207,1,0),(594879208,1,1),(594879209,1,1),(594879210,1,1),(594879211,1,0),(594879212,1,0),(594879213,1,0),(594879214,1,0),(594879215,1,0),(594879216,1,1),(594879217,1,1),(594879218,1,0),(594879219,1,1),(594879220,1,0),(594879221,1,1),(594879222,1,0),(594879223,1,0),(594879224,1,1),(594879225,1,1),(594879226,1,0),(594879227,1,0),(594879228,1,0),(594879229,1,1),(594879230,1,1),(594879231,1,1),(594879232,1,1),(594879233,1,1),(594879234,1,0),(594879235,1,0),(594879236,1,0),(594879237,1,1),(594879238,1,0),(594879239,1,0),(594879240,1,0),(594879241,1,1),(594879242,1,0),(594879243,1,0),(594879244,1,1),(594879245,1,0),(594879246,1,1),(594879247,1,0),(594879248,1,1),(594879249,1,1),(594879250,1,0),(594879251,1,0),(594879252,1,1),(594879253,1,0),(594879254,1,1),(594879255,1,0),(594879256,1,0),(594879257,1,1),(594879258,1,1),(594879259,1,1),(594879260,1,0),(594879261,1,1),(594879262,1,0),(594879263,1,0),(594879264,1,0);\n'

但是我卡在这里了,我不知道如何继续。我想问你:

  1. 我的策略好吗?或者有什么比逐行阅读更好的方法吗?
  2. 你能告诉我,我怎样才能从 an_visit_ab.sql.bz2 中筛选出 is_bot,就像我在上面尝试过的那样?

拜托,如果我的问题有什么不清楚的地方,请不要立即降级,给我机会更宝贵地解释一下:-)。谢谢

第 1 步:这是您如何首先处理第二个 MySQL 转储的示例,an_visit.db 通过逐行阅读并创建 visit_id 条目的 set 具有 is_bot 设置为 1,假设没有太多这样的条目以至于集合无法放入内存。我假设 MySQL 转储是使用 INSERT INTO 语句创建的,其中列名提到一次,行值用逗号分隔并分成多行(您没有打印足够的转储来验证这个):

def create_id_set():
    """
    Create a set of visit_id values whose is_bot value is 1.
    """

    id_set = set()
    state = 0
    # I am assuming that the file is not compressed.
    # If the file is compressed, then replace the regular call to open with:
    # with bz2.open('an.visit.db', 'rt') as f:
    with open('an_visit.db') as f:
        for line in f:
            if state == 0: # We have not seen the INSERT statement yet
                if line.startswith('INSERT INTO'):
                    state = 1
            elif state == 1: # We are processing the rows:
                line = line.strip()
                if line.endswith(';'):
                    state = 2 # Our next state
                    line = line.rstrip(';')
                else:
                    line.rstrip(',')
                line = line.replace('),(', ')|(') # Replace so we can split
                values_list = line.split('|')
                for row in values_list:
                    columns = row[1:-1].split(', ') # Skip outer parentheses
                    if columns[2] == '1': # is_bot is 1?
                        id_set.add(columns[0]) # Add visit_id
            elif state == 2: # We are done
                return id_set

id_set = create_id_set()

如果条目太多无法创建集合,那么我会将此函数转换为生成下一个 visit_id 的生成器。然后在下面描述的步骤 2 中,您将并行迭代此迭代器,这会使下一步稍微复杂一些。

第 2 步:完成设置后,您现在需要打开第一个 MySQL 输入(最好在文本模式下使用 bz2.open(path, 'rt')) dump 和第二个 bz2 压缩输出文件,它将包含过滤后的输出(你不能就地更新),也在文本模式下使用标志 'wt')。使用类似于我在步骤 1 中使用的逻辑,您将需要读取每一行并写出所有行的输出,包括 INSERT INTO 语句和最后一行值(状态 == 2)之后的所有行。您需要通过保存原始行值来更改逻辑,包括在拆分之前将 comma-separated 值括起来的左括号和右括号。然后,您将根据步骤 1 中返回的 set 测试 visit_id 值,以确定是否应输出该行值。您需要在此行值之后输出逗号或分号,具体取决于这是否是最终行值或未输出。你也应该在每写N个值后写一个换行符。