为什么当 运行 第二次(在干净的数据库上)时,neo4j 中的相同负载 csv 会花费指数级的时间?

Why would the same load csv in neo4j take exponential longer when run a second time (on a clean db)?

我是使用 Neo4j 的新手,所以我第一次尝试加载数据库是一个学习实验。我意识到当我尝试一些我想做的查询时,我没有创建正确的模型。我从命令行清除了数据库,使用:rm -rf data/* 并重新开始(在停止数据库并在之后启动它之后)。前3次加载和我第一次导入数据几乎一模一样

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS from 'file' AS line
WITH line
MERGE (p:Provider {pid:line.pid});

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS from 'file' AS line
WITH line
MERGE (c:Credential {name:credential});

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS from 'file' AS line
WITH line
MATCH (p:Provider {pid:line.pid});
MATCH (c:Credential {name:credential});
MATCH (p)-[:IS_A]->(c);

我第二次 运行 这些加载语句的唯一区别是第一个节点被赋予了两个标签:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS from 'file' AS line
WITH line
MERGE (p:Provider:Person {pid:line.pid});

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS from 'file' AS line
WITH line
MERGE (c:Credential {name:credential});

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS from 'file' AS line
WITH line
MATCH (p:Person {pid:line.pid});
MATCH (c:Credential {name:credential});
MATCH (p)-[:IS_A]->(c);

我第一次 运行 这 3 个导入,大概花了 20 分钟。然而第二次,已经运行3天了。前两个负载仍然非常快,每个负载可能需要 5 分钟。从那以后,这种关系一直 运行。我不明白为什么要花这么长时间。

你们有房产索引吗?

CREATE INDEX ON :Person(pid)
CREATE INDEX ON :Credential(name)

或者,如果您希望在列上强制执行唯一性,您应该创建一个约束(这也会创建一个索引):

CREATE CONSTRAINT ON (n:Person) ASSERT n.pid IS UNIQUE
CREATE CONSTRAINT ON (n:Credential) ASSERT n.name IS UNIQUE

另外,我认为你的最后一个 MATCH 应该是 MERGE,你不应该有分号:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS from 'file' AS line
WITH line
MATCH
  (p:Person {pid:line.pid}),
  (c:Credential {name:credential})
MERGE (p)-[:IS_A]->(c);