为什么当 运行 第二次(在干净的数据库上)时,neo4j 中的相同负载 csv 会花费指数级的时间?
Why would the same load csv in neo4j take exponential longer when run a second time (on a clean db)?
我是使用 Neo4j 的新手,所以我第一次尝试加载数据库是一个学习实验。我意识到当我尝试一些我想做的查询时,我没有创建正确的模型。我从命令行清除了数据库,使用:rm -rf data/* 并重新开始(在停止数据库并在之后启动它之后)。前3次加载和我第一次导入数据几乎一模一样
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS from 'file' AS line
WITH line
MERGE (p:Provider {pid:line.pid});
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS from 'file' AS line
WITH line
MERGE (c:Credential {name:credential});
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS from 'file' AS line
WITH line
MATCH (p:Provider {pid:line.pid});
MATCH (c:Credential {name:credential});
MATCH (p)-[:IS_A]->(c);
我第二次 运行 这些加载语句的唯一区别是第一个节点被赋予了两个标签:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS from 'file' AS line
WITH line
MERGE (p:Provider:Person {pid:line.pid});
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS from 'file' AS line
WITH line
MERGE (c:Credential {name:credential});
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS from 'file' AS line
WITH line
MATCH (p:Person {pid:line.pid});
MATCH (c:Credential {name:credential});
MATCH (p)-[:IS_A]->(c);
我第一次 运行 这 3 个导入,大概花了 20 分钟。然而第二次,已经运行3天了。前两个负载仍然非常快,每个负载可能需要 5 分钟。从那以后,这种关系一直 运行。我不明白为什么要花这么长时间。
你们有房产索引吗?
CREATE INDEX ON :Person(pid)
CREATE INDEX ON :Credential(name)
或者,如果您希望在列上强制执行唯一性,您应该创建一个约束(这也会创建一个索引):
CREATE CONSTRAINT ON (n:Person) ASSERT n.pid IS UNIQUE
CREATE CONSTRAINT ON (n:Credential) ASSERT n.name IS UNIQUE
另外,我认为你的最后一个 MATCH
应该是 MERGE
,你不应该有分号:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS from 'file' AS line
WITH line
MATCH
(p:Person {pid:line.pid}),
(c:Credential {name:credential})
MERGE (p)-[:IS_A]->(c);
我是使用 Neo4j 的新手,所以我第一次尝试加载数据库是一个学习实验。我意识到当我尝试一些我想做的查询时,我没有创建正确的模型。我从命令行清除了数据库,使用:rm -rf data/* 并重新开始(在停止数据库并在之后启动它之后)。前3次加载和我第一次导入数据几乎一模一样
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS from 'file' AS line
WITH line
MERGE (p:Provider {pid:line.pid});
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS from 'file' AS line
WITH line
MERGE (c:Credential {name:credential});
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS from 'file' AS line
WITH line
MATCH (p:Provider {pid:line.pid});
MATCH (c:Credential {name:credential});
MATCH (p)-[:IS_A]->(c);
我第二次 运行 这些加载语句的唯一区别是第一个节点被赋予了两个标签:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS from 'file' AS line
WITH line
MERGE (p:Provider:Person {pid:line.pid});
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS from 'file' AS line
WITH line
MERGE (c:Credential {name:credential});
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS from 'file' AS line
WITH line
MATCH (p:Person {pid:line.pid});
MATCH (c:Credential {name:credential});
MATCH (p)-[:IS_A]->(c);
我第一次 运行 这 3 个导入,大概花了 20 分钟。然而第二次,已经运行3天了。前两个负载仍然非常快,每个负载可能需要 5 分钟。从那以后,这种关系一直 运行。我不明白为什么要花这么长时间。
你们有房产索引吗?
CREATE INDEX ON :Person(pid)
CREATE INDEX ON :Credential(name)
或者,如果您希望在列上强制执行唯一性,您应该创建一个约束(这也会创建一个索引):
CREATE CONSTRAINT ON (n:Person) ASSERT n.pid IS UNIQUE
CREATE CONSTRAINT ON (n:Credential) ASSERT n.name IS UNIQUE
另外,我认为你的最后一个 MATCH
应该是 MERGE
,你不应该有分号:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS from 'file' AS line
WITH line
MATCH
(p:Person {pid:line.pid}),
(c:Credential {name:credential})
MERGE (p)-[:IS_A]->(c);