Titan 数据库：在 java 代码中迭代数千个顶点的性能问题

Question

我正在使用带有 Cassandra 后端存储的 Titan 数据库（版本 1.0.0）。我的数据库非常大（数百万个顶点和边）。我正在使用 elasticsearch 进行索引。它做得很好，我相对容易和快速地收到数千个 (~40000) 个顶点作为我查询的答案。但是我遇到了性能问题，然后我尝试遍历所有顶点并检索保存在顶点属性上的基本数据。我花了将近 1 分钟！！！

使用 Java 8 的并行流显着提高了性能但还不够（10 秒而不是 1 分钟）。

考虑到我有上千个具有位置属性和时间戳的顶点。我只想检索在查询区域内具有位置（Geoshape）的顶点并收集不同的时间戳。

这是我的 java 代码的一部分，使用 Java 8 个并行流：

TitanTransaction tt = titanWraper.getNewTransaction();
PropertyKey timestampKey = tt.getPropertyKey(TIME_STAMP);
TitanGraphQuery graphQuery = tt.query().has(LOCATION, Geo.WITHIN, cLocation);
Spliterator<TitanVertex> locationsSpl = graphQuery.vertices().spliterator();

Set<String> locationTimestamps = StreamSupport.stream(locationsSpl, true)
        .map(locVertex -> {//map location vertices to timestamp String
            String timestamp = locVertex.valueOrNull(timestampKey);

            //this iteration takes about 10 sec to iterate over 40000 vertices
            return timestamp;
         })
         .distinct()
         .collect(Collectors.toSet());

使用标准 java 迭代的相同代码：

TitanTransaction tt = titanWraper.getNewTransaction();
PropertyKey timestampKey = tt.getPropertyKey(TIME_STAMP);
TitanGraphQuery graphQuery = tt.query().has(LOCATION, Geo.WITHIN, cLocation);
Set<String> locationTimestamps = new HashSet<>();
for(TitanVertex locVertex : (Iterable<TitanVertex>) graphQuery.vertices()) {
    String timestamp = locVertex.valueOrNull(timestampKey);
    locationTimestamps.add(timestamp);        
    //this iteration takes about 45 sec to iterate over 40000 vertices            
}

这样的表现让我很失望。如果结果将是大约 100 万个顶点，那就更糟了。我试图了解这个问题的原因。我期望这应该让我少用 1 秒来遍历所有顶点。

Answer 1

相同的查询，但使用 gremlin 遍历而不是图形查询具有更好的性能和更短的代码：

TitanTransaction tt = graph.newTransaction();
Set<String> locationTimestamps = tt.traversal().V().has(LOCATION, P.within(cLocation))
    .dedup(TIME_STAMP)
    .values(TIME_STAMP)
    .toSet();

Titan 数据库：在 java 代码中迭代数千个顶点的性能问题

Titan Database: performance issue to iterate over thousands vertices in java code

java

cassandra

titan