SQLite:从路径列表创建目录结构 Table
SQLite: Create Directory Structure Table from A List Of Paths
我想创建一个目录结构 table,如question中所述,其中:
Directory = "Primary Key" id字段,一般为整数
Directory_Parent = "Foreign Key" id字段,指向同一个table
中另一个Directory的id
值 = 包含 directory/folder 名称的字符串
给定 Tree/Fruit/Apples/
Directory | Directory_Parent | Value
0 null Root
1 0 Tree
2 1 Fruit
3 2 Apples
已在主键 0 处创建了一个 Root 文件夹,其父项为空。
我的路径是从 CSV 导入的,目前在 table 中有 2 列:
FileID Path
1 videos/gopro/father/mov001.mp4
2 videos/gopro/father/mov002.mp4
3 pictures/family/father/Oldman.jpg
4 pictures/family/father/Oldman2.jpg
5 documents/legal/father/estate/will.doc
6 documents/legal/father/estate/will2.doc
7 documents/legal/father/estate/newyork/albany/will.doc
8 video/gopro/father/newyork/albany/holiday/christmas/2002/mov001.mp4
9 pictures/family/father/newyork/albany/holiday/christmas/2002/july/Oldman.jpg
10 pictures/family/father/newyork/albany/holiday/christmas/2002/june/Oldman2.jpg
此 table 包含 100 万个文件条目。
如上所述,解析此数据并将文件夹结构移动到新 table 中的快速优化方法是什么?
在此 demo 中,文件夹以“/”分隔并移动到新列中(如果有帮助的话)。
SQL 缺乏编程语言的灵活性和工具,这将为这个问题提供快速和优化的解决方案。
此外,SQLite 在字符串操作方面是数据库中最差的,因为它不支持像 SQL Server 的 STRING_SPLIT()
or MySql's SUBSTRING_INDEX()
这样非常有用的功能。
不过这个问题很有趣,我试了一下。
我用这个语句创建了 table dir_struct
:
CREATE TABLE dir_struct (
Directory INTEGER PRIMARY KEY,
Directory_Parent INTEGER REFERENCES dir_struct(Directory),
Value TEXT
);
然后我插入 'root'
行:
INSERT INTO dir_struct (Directory, Directory_Parent, Value) VALUES (0, null, 'root');
此外,我将 OFF
外键强制执行为:
PRAGMA foreign_keys = OFF;
虽然默认关闭,以防万一。
首先,您需要一个递归 CTE,将路径拆分为各个目录(很像您上一个问题的答案)。
然后在第二个 CTE 中,通过条件聚合,每个目录进入自己的列(最多 10 个目录的限制)。
3d CTE 删除重复项,第 4 个 CTE 使用 ROW_NUMBER()
window 函数为目录分配唯一 ID。
最后,通过自连接第 4 个 CTE 的结果,行被插入 table:
WITH
split AS (
SELECT 0 idx,
FileDataID,
SUBSTR(SUBSTR(Path, 1), 1, INSTR(SUBSTR(Path, 1), '/') - 1) item,
SUBSTR(SUBSTR(Path, 1), INSTR(SUBSTR(Path, 1), '/') + 1) value
FROM listfile
UNION ALL
SELECT idx + 1,
FileDataID,
SUBSTR(value, 1, INSTR(value, '/') - 1),
SUBSTR(value, INSTR(value, '/') + 1)
FROM split
WHERE value LIKE '%_/_%'
),
cols AS (
SELECT DISTINCT
MAX(CASE WHEN idx = 0 THEN item END) path0,
MAX(CASE WHEN idx = 1 THEN item END) path1,
MAX(CASE WHEN idx = 2 THEN item END) path2,
MAX(CASE WHEN idx = 3 THEN item END) path3,
MAX(CASE WHEN idx = 4 THEN item END) path4,
MAX(CASE WHEN idx = 5 THEN item END) path5,
MAX(CASE WHEN idx = 6 THEN item END) path6,
MAX(CASE WHEN idx = 7 THEN item END) path7,
MAX(CASE WHEN idx = 8 THEN item END) path8,
MAX(CASE WHEN idx = 9 THEN item END) path9
FROM split
GROUP BY FileDataID
),
paths AS (
SELECT path0, path1, path2, path3, path4, path5, path6, path7, path8, path9 FROM cols UNION
SELECT path0, path1, path2, path3, path4, path5, path6, path7, path8, null FROM cols UNION
SELECT path0, path1, path2, path3, path4, path5, path6, path7, null, null FROM cols UNION
SELECT path0, path1, path2, path3, path4, path5, path6, null, null, null FROM cols UNION
SELECT path0, path1, path2, path3, path4, path5, null, null, null, null FROM cols UNION
SELECT path0, path1, path2, path3, path4, null, null, null, null, null FROM cols UNION
SELECT path0, path1, path2, path3, null, null, null, null, null, null FROM cols UNION
SELECT path0, path1, path2, null, null, null, null, null, null, null FROM cols UNION
SELECT path0, path1, null, null, null, null, null, null, null, null FROM cols UNION
SELECT path0, null, null, null, null, null, null, null, null, null FROM cols
),
ids AS (
SELECT *,
ROW_NUMBER() OVER (ORDER BY path0, path1, path2, path3, path4, path5, path6, path7, path8, path9) nr,
COALESCE(path9, path8, path7, path6, path5, path4, path3, path2, path1, path0) last_child,
path0 || COALESCE('/' || path1, '') ||
COALESCE('/' || path2, '') ||
COALESCE('/' || path3, '') ||
COALESCE('/' || path4, '') ||
COALESCE('/' || path5, '') ||
COALESCE('/' || path6, '') ||
COALESCE('/' || path7, '') ||
COALESCE('/' || path8, '') ||
COALESCE('/' || path9, '') full_path
FROM paths
)
INSERT INTO dir_struct(Directory, Directory_Parent, Value)
SELECT i1.nr, COALESCE(i2.nr, 0), i1.last_child
FROM ids i1 LEFT JOIN ids i2
ON i1.full_path = i2.full_path || '/' || i1.last_child
在我的包含 187365 行的测试数据集中,这些行的插入时间(平均)为 9.5-10 分钟,对于较大的数据集而言,这会更长。
参见demo。
比较有意思的是,代码越简单,性能越差(不过你也可以测试一下):
WITH
split AS (
SELECT Path,
0 parent_len,
SUBSTR(SUBSTR(Path, 1), 1, INSTR(SUBSTR(Path, 1), '/') - 1) item,
SUBSTR(SUBSTR(Path, 1), INSTR(SUBSTR(Path, 1), '/') + 1) value
FROM listfile
UNION ALL
SELECT Path,
parent_len + LENGTH(item) + 1,
SUBSTR(value, 1, INSTR(value, '/') - 1),
SUBSTR(value, INSTR(value, '/') + 1)
FROM split
WHERE value LIKE '%_/_%'
),
row_numbers AS (
SELECT parent_path, item,
ROW_NUMBER() OVER (ORDER BY parent_path, item) rn
FROM (SELECT DISTINCT SUBSTR(Path, 1, parent_len) parent_path, item FROM split)
)
INSERT INTO dir_struct(Directory, Directory_Parent, Value)
SELECT r1.rn, COALESCE(r2.rn, 0) rn_parent, r1.item
FROM row_numbers r1 LEFT JOIN row_numbers r2
ON r1.parent_path = r2.parent_path || r2.item || '/'
此查询分配给目录的 ID 与第一个解决方案分配的不同,但它们是正确且唯一的。
这会在(平均)14-15 分钟内运行。
参见 demo.
结论是,如果这是一次性的事情,也许你可以使用它,但我不会推荐它作为这个需求的解决方案。
我想创建一个目录结构 table,如question中所述,其中:
Directory = "Primary Key" id字段,一般为整数
Directory_Parent = "Foreign Key" id字段,指向同一个table
中另一个Directory的id
值 = 包含 directory/folder 名称的字符串
给定 Tree/Fruit/Apples/
Directory | Directory_Parent | Value
0 null Root
1 0 Tree
2 1 Fruit
3 2 Apples
已在主键 0 处创建了一个 Root 文件夹,其父项为空。
我的路径是从 CSV 导入的,目前在 table 中有 2 列:
FileID Path
1 videos/gopro/father/mov001.mp4
2 videos/gopro/father/mov002.mp4
3 pictures/family/father/Oldman.jpg
4 pictures/family/father/Oldman2.jpg
5 documents/legal/father/estate/will.doc
6 documents/legal/father/estate/will2.doc
7 documents/legal/father/estate/newyork/albany/will.doc
8 video/gopro/father/newyork/albany/holiday/christmas/2002/mov001.mp4
9 pictures/family/father/newyork/albany/holiday/christmas/2002/july/Oldman.jpg
10 pictures/family/father/newyork/albany/holiday/christmas/2002/june/Oldman2.jpg
此 table 包含 100 万个文件条目。
如上所述,解析此数据并将文件夹结构移动到新 table 中的快速优化方法是什么?
在此 demo 中,文件夹以“/”分隔并移动到新列中(如果有帮助的话)。
SQL 缺乏编程语言的灵活性和工具,这将为这个问题提供快速和优化的解决方案。
此外,SQLite 在字符串操作方面是数据库中最差的,因为它不支持像 SQL Server 的 STRING_SPLIT()
or MySql's SUBSTRING_INDEX()
这样非常有用的功能。
不过这个问题很有趣,我试了一下。
我用这个语句创建了 table dir_struct
:
CREATE TABLE dir_struct (
Directory INTEGER PRIMARY KEY,
Directory_Parent INTEGER REFERENCES dir_struct(Directory),
Value TEXT
);
然后我插入 'root'
行:
INSERT INTO dir_struct (Directory, Directory_Parent, Value) VALUES (0, null, 'root');
此外,我将 OFF
外键强制执行为:
PRAGMA foreign_keys = OFF;
虽然默认关闭,以防万一。
首先,您需要一个递归 CTE,将路径拆分为各个目录(很像您上一个问题的答案)。
然后在第二个 CTE 中,通过条件聚合,每个目录进入自己的列(最多 10 个目录的限制)。
3d CTE 删除重复项,第 4 个 CTE 使用 ROW_NUMBER()
window 函数为目录分配唯一 ID。
最后,通过自连接第 4 个 CTE 的结果,行被插入 table:
WITH
split AS (
SELECT 0 idx,
FileDataID,
SUBSTR(SUBSTR(Path, 1), 1, INSTR(SUBSTR(Path, 1), '/') - 1) item,
SUBSTR(SUBSTR(Path, 1), INSTR(SUBSTR(Path, 1), '/') + 1) value
FROM listfile
UNION ALL
SELECT idx + 1,
FileDataID,
SUBSTR(value, 1, INSTR(value, '/') - 1),
SUBSTR(value, INSTR(value, '/') + 1)
FROM split
WHERE value LIKE '%_/_%'
),
cols AS (
SELECT DISTINCT
MAX(CASE WHEN idx = 0 THEN item END) path0,
MAX(CASE WHEN idx = 1 THEN item END) path1,
MAX(CASE WHEN idx = 2 THEN item END) path2,
MAX(CASE WHEN idx = 3 THEN item END) path3,
MAX(CASE WHEN idx = 4 THEN item END) path4,
MAX(CASE WHEN idx = 5 THEN item END) path5,
MAX(CASE WHEN idx = 6 THEN item END) path6,
MAX(CASE WHEN idx = 7 THEN item END) path7,
MAX(CASE WHEN idx = 8 THEN item END) path8,
MAX(CASE WHEN idx = 9 THEN item END) path9
FROM split
GROUP BY FileDataID
),
paths AS (
SELECT path0, path1, path2, path3, path4, path5, path6, path7, path8, path9 FROM cols UNION
SELECT path0, path1, path2, path3, path4, path5, path6, path7, path8, null FROM cols UNION
SELECT path0, path1, path2, path3, path4, path5, path6, path7, null, null FROM cols UNION
SELECT path0, path1, path2, path3, path4, path5, path6, null, null, null FROM cols UNION
SELECT path0, path1, path2, path3, path4, path5, null, null, null, null FROM cols UNION
SELECT path0, path1, path2, path3, path4, null, null, null, null, null FROM cols UNION
SELECT path0, path1, path2, path3, null, null, null, null, null, null FROM cols UNION
SELECT path0, path1, path2, null, null, null, null, null, null, null FROM cols UNION
SELECT path0, path1, null, null, null, null, null, null, null, null FROM cols UNION
SELECT path0, null, null, null, null, null, null, null, null, null FROM cols
),
ids AS (
SELECT *,
ROW_NUMBER() OVER (ORDER BY path0, path1, path2, path3, path4, path5, path6, path7, path8, path9) nr,
COALESCE(path9, path8, path7, path6, path5, path4, path3, path2, path1, path0) last_child,
path0 || COALESCE('/' || path1, '') ||
COALESCE('/' || path2, '') ||
COALESCE('/' || path3, '') ||
COALESCE('/' || path4, '') ||
COALESCE('/' || path5, '') ||
COALESCE('/' || path6, '') ||
COALESCE('/' || path7, '') ||
COALESCE('/' || path8, '') ||
COALESCE('/' || path9, '') full_path
FROM paths
)
INSERT INTO dir_struct(Directory, Directory_Parent, Value)
SELECT i1.nr, COALESCE(i2.nr, 0), i1.last_child
FROM ids i1 LEFT JOIN ids i2
ON i1.full_path = i2.full_path || '/' || i1.last_child
在我的包含 187365 行的测试数据集中,这些行的插入时间(平均)为 9.5-10 分钟,对于较大的数据集而言,这会更长。
参见demo。
比较有意思的是,代码越简单,性能越差(不过你也可以测试一下):
WITH
split AS (
SELECT Path,
0 parent_len,
SUBSTR(SUBSTR(Path, 1), 1, INSTR(SUBSTR(Path, 1), '/') - 1) item,
SUBSTR(SUBSTR(Path, 1), INSTR(SUBSTR(Path, 1), '/') + 1) value
FROM listfile
UNION ALL
SELECT Path,
parent_len + LENGTH(item) + 1,
SUBSTR(value, 1, INSTR(value, '/') - 1),
SUBSTR(value, INSTR(value, '/') + 1)
FROM split
WHERE value LIKE '%_/_%'
),
row_numbers AS (
SELECT parent_path, item,
ROW_NUMBER() OVER (ORDER BY parent_path, item) rn
FROM (SELECT DISTINCT SUBSTR(Path, 1, parent_len) parent_path, item FROM split)
)
INSERT INTO dir_struct(Directory, Directory_Parent, Value)
SELECT r1.rn, COALESCE(r2.rn, 0) rn_parent, r1.item
FROM row_numbers r1 LEFT JOIN row_numbers r2
ON r1.parent_path = r2.parent_path || r2.item || '/'
此查询分配给目录的 ID 与第一个解决方案分配的不同,但它们是正确且唯一的。
这会在(平均)14-15 分钟内运行。
参见 demo.
结论是,如果这是一次性的事情,也许你可以使用它,但我不会推荐它作为这个需求的解决方案。