Neo4j 和 Java:Iterable<Relationship> 的快速随机样本
Neo4j and Java: fast and random sample of Iterable<Relationship>
我在 Java 中编写了一个遍历,其中 returns 是一个 Iterable。 最坏情况 是 850784 个关系的可迭代大小。
Objective:我只想抽样(不放回)20 个关系,而且我想快速完成。
解决方案 1:执行 toList()
或以某种 Collection
的形式进行转换需要太多时间(> 1 分钟)。我知道我可以利用 shuffle()
功能等,但这是不可接受的。
解决方案 2:因此,为了直接在 Iterable
上执行此操作,我使用了 guava collect
库,对于以下 3 个步骤中的每一个,我都包括了以毫秒为单位的时间(用 System.nanoTime()
计算并除以 1000000)。我需要为随机数生成器获取 Iterable
的大小,这是一个真正的瓶颈。
/* TRAVERSAL: 5 ms */
Iterable<Relationship> simrels = traversal1.traverse(user).relationships();
/* GET ITERABLE SIZE: 74669 ms */
int simrelssize = com.google.common.collect.Iterables.size(simrels);
/* RANDOM SAMPLE OF 20: 28321 ms*/
long seed = System.nanoTime();
int[] idxs = new int[20];
Random randomGenerator = new XSRandom(seed);
for (int i = 0; i < idxs.length; ++i){
int randomInt = randomGenerator.nextInt(simrelssize);
idxs[i]=randomInt;
}
Arrays.sort(idxs);
List<Relationship> simrelslist2 = new ArrayList<Relationship>();
for(int i = 0; i < idxs.length; ++i){
if (i > 0) {
int pos = idxs[i]-idxs[i-1];
simrelslist2.add(com.google.common.collect.Iterables.get(simrels, pos));
}
else{
simrelslist2.add(com.google.common.collect.Iterables.get(simrels, idxs[i]));
}
}
如何优化此代码以使其运行更快?
注意:我有一台 Windows 8.1 PC,i5 2.30GHz,内存 16GB,硬盘 1TB
应 Michal 的要求,请查找以下文件内容:
neo4j-wrapper
#********************************************************************
# Property file references
#********************************************************************
wrapper.java.additional=-Dorg.neo4j.server.properties=conf/neo4j-server.properties
wrapper.java.additional=-Djava.util.logging.config.file=conf/logging.properties
wrapper.java.additional=-Dlog4j.configuration=file:conf/log4j.properties
#********************************************************************
# JVM Parameters
#********************************************************************
wrapper.java.additional=-XX:+UseConcMarkSweepGC
wrapper.java.additional=-XX:+CMSClassUnloadingEnabled
wrapper.java.additional=-XX:-OmitStackTraceInFastThrow
# Remote JMX monitoring, uncomment and adjust the following lines as needed.
# Also make sure to update the jmx.access and jmx.password files with appropriate permission roles and passwords,
# the shipped configuration contains only a read only role called 'monitor' with password 'Neo4j'.
# For more details, see: http://download.oracle.com/javase/7/docs/technotes/guides/management/agent.html
# On Unix based systems the jmx.password file needs to be owned by the user that will run the server,
# and have permissions set to 0600.
# For details on setting these file permissions on Windows see:
# http://docs.oracle.com/javase/7/docs/technotes/guides/management/security-windows.html
#wrapper.java.additional=-Dcom.sun.management.jmxremote.port=3637
#wrapper.java.additional=-Dcom.sun.management.jmxremote.authenticate=true
#wrapper.java.additional=-Dcom.sun.management.jmxremote.ssl=false
#wrapper.java.additional=-Dcom.sun.management.jmxremote.password.file=conf/jmx.password
#wrapper.java.additional=-Dcom.sun.management.jmxremote.access.file=conf/jmx.access
# Some systems cannot discover host name automatically, and need this line configured:
#wrapper.java.additional=-Djava.rmi.server.hostname=$THE_NEO4J_SERVER_HOSTNAME
# Uncomment the following lines to enable garbage collection logging
#wrapper.java.additional=-Xloggc:data/log/neo4j-gc.log
#wrapper.java.additional=-XX:+PrintGCDetails
#wrapper.java.additional=-XX:+PrintGCDateStamps
#wrapper.java.additional=-XX:+PrintGCApplicationStoppedTime
#wrapper.java.additional=-XX:+PrintPromotionFailure
#wrapper.java.additional=-XX:+PrintTenuringDistribution
# Java Heap Size: by default the Java heap size is dynamically
# calculated based on available system resources.
# Uncomment these lines to set specific initial and maximum
# heap size in MB.
wrapper.java.initmemory=8192
wrapper.java.maxmemory=10240
#********************************************************************
# Wrapper settings
#********************************************************************
# path is relative to the bin dir
wrapper.pidfile=../data/neo4j-server.pid
#********************************************************************
# Wrapper Windows NT/2000/XP Service Properties
#********************************************************************
# WARNING - Do not modify any of these properties when an application
# using this configuration file has been installed as a service.
# Please uninstall the service before modifying this section. The
# service can then be reinstalled.
# Name of the service
wrapper.name=neo4j
# User account to be used for linux installs. Will default to current
# user if not set.
wrapper.user=
neo4j.properties
# Enable this to be able to upgrade a store from an older version.
#allow_store_upgrade=true
# The amount of memory to use for mapping the store files, either in bytes or
# as a percentage of available memory. This will be clipped at the amount of
# free memory observed when the database starts, and automatically be rounded
# down to the nearest whole page. For example, if "500MB" is configured, but
# only 450MB of memory is free when the database starts, then the database will
# map at most 450MB. If "50%" is configured, and the system has a capacity of
# 4GB, then at most 2GB of memory will be mapped, unless the database observes
# that less than 2GB of memory is free when it starts.
#mapped_memory_total_size=50%
# Enable this to specify a parser other than the default one.
#cypher_parser_version=2.0
# Keep logical logs, helps debugging but uses more disk space, enabled for
# legacy reasons To limit space needed to store historical logs use values such
# as: "7 days" or "100M size" instead of "true".
#keep_logical_logs=7 days
# Autoindexing
# Enable auto-indexing for nodes, default is false.
#node_auto_indexing=true
# The node property keys to be auto-indexed, if enabled.
#node_keys_indexable=name,age
# Enable auto-indexing for relationships, default is false.
#relationship_auto_indexing=true
# The relationship property keys to be auto-indexed, if enabled.
#relationship_keys_indexable=name,age
# Enable shell server so that remote clients can connect via Neo4j shell.
#remote_shell_enabled=true
# The network interface IP the shell will listen on (use 0.0.0 for all interfaces).
#remote_shell_host=127.0.0.1
# The port the shell will listen on, default is 1337.
#remote_shell_port=1337
# The type of cache to use for nodes and relationships.
#cache_type=hpc
# Maximum size of the heap memory to dedicate to the cached nodes.
#node_cache_size=
# Maximum size of the heap memory to dedicate to the cached relationships.
#relationship_cache_size=
# Enable online backups to be taken from this database.
online_backup_enabled=true
# Port to listen to for incoming backup requests.
online_backup_server=127.0.0.1:6362
# Uncomment and specify these lines for running Neo4j in High Availability mode.
# See the High availability setup tutorial for more details on these settings
# http://neo4j.com/docs/2.2.0-M02/ha-setup-tutorial.html
# ha.server_id is the number of each instance in the HA cluster. It should be
# an integer (e.g. 1), and should be unique for each cluster instance.
#ha.server_id=
# ha.initial_hosts is a comma-separated list (without spaces) of the host:port
# where the ha.cluster_server of all instances will be listening. Typically
# this will be the same for all cluster instances.
#ha.initial_hosts=192.168.0.1:5001,192.168.0.2:5001,192.168.0.3:5001
# IP and port for this instance to listen on, for communicating cluster status
# information iwth other instances (also see ha.initial_hosts). The IP
# must be the configured IP address for one of the local interfaces.
#ha.cluster_server=192.168.0.1:5001
# IP and port for this instance to listen on, for communicating transaction
# data with other instances (also see ha.initial_hosts). The IP
# must be the configured IP address for one of the local interfaces.
#ha.server=192.168.0.1:6001
# The interval at which slaves will pull updates from the master. Comment out
# the option to disable periodic pulling of updates. Unit is seconds.
ha.pull_interval=10
# Amount of slaves the master will try to push a transaction to upon commit
# (default is 1). The master will optimistically continue and not fail the
# transaction even if it fails to reach the push factor. Setting this to 0 will
# increase write performance when writing through master but could potentially
# lead to branched data (or loss of transaction) if the master goes down.
#ha.tx_push_factor=1
# Strategy the master will use when pushing data to slaves (if the push factor
# is greater than 0). There are two options available "fixed" (default) or
# "round_robin". Fixed will start by pushing to slaves ordered by server id
# (highest first) improving performance since the slaves only have to cache up
# one transaction at a time.
#ha.tx_push_strategy=fixed
# Policy for how to handle branched data.
#branched_data_policy=keep_all
# Clustering timeouts
# Default timeout.
#ha.default_timeout=5s
# How often heartbeat messages should be sent. Defaults to ha.default_timeout.
#ha.heartbeat_interval=5s
# Timeout for heartbeats between cluster members. Should be at least twice that of ha.heartbeat_interval.
#heartbeat_timeout=11s
你有一个Traverser
:
Traverser traverser = traversal1.traverse(user);
int size = traverser.metadata().getNumberOfRelationshipsTraversed();
Iterable<Relationship> simrels = traverser.relationships();
现在你有了尺寸,可以优化随机选择器了。
Neo4j 返回 Iterable 的原因是它在您迭代时执行遍历。为了采样,恐怕你必须 "visit" 每一个关系。是的,你可以跳过一些,但你仍然必须在一天结束时遍历所有这些。
我们为此使用 "reservoir sampling" 算法,implemented here。由于上述原因,不确定它会表现得更好。也就是说,您应该能够在不到 1 秒的时间内使用热缓存对 1M 关系进行采样。如果花费的时间比这更长,您可能需要稍微调整一下内存设置。
我在 Java 中编写了一个遍历,其中 returns 是一个 Iterable。 最坏情况 是 850784 个关系的可迭代大小。
Objective:我只想抽样(不放回)20 个关系,而且我想快速完成。
解决方案 1:执行 toList()
或以某种 Collection
的形式进行转换需要太多时间(> 1 分钟)。我知道我可以利用 shuffle()
功能等,但这是不可接受的。
解决方案 2:因此,为了直接在 Iterable
上执行此操作,我使用了 guava collect
库,对于以下 3 个步骤中的每一个,我都包括了以毫秒为单位的时间(用 System.nanoTime()
计算并除以 1000000)。我需要为随机数生成器获取 Iterable
的大小,这是一个真正的瓶颈。
/* TRAVERSAL: 5 ms */
Iterable<Relationship> simrels = traversal1.traverse(user).relationships();
/* GET ITERABLE SIZE: 74669 ms */
int simrelssize = com.google.common.collect.Iterables.size(simrels);
/* RANDOM SAMPLE OF 20: 28321 ms*/
long seed = System.nanoTime();
int[] idxs = new int[20];
Random randomGenerator = new XSRandom(seed);
for (int i = 0; i < idxs.length; ++i){
int randomInt = randomGenerator.nextInt(simrelssize);
idxs[i]=randomInt;
}
Arrays.sort(idxs);
List<Relationship> simrelslist2 = new ArrayList<Relationship>();
for(int i = 0; i < idxs.length; ++i){
if (i > 0) {
int pos = idxs[i]-idxs[i-1];
simrelslist2.add(com.google.common.collect.Iterables.get(simrels, pos));
}
else{
simrelslist2.add(com.google.common.collect.Iterables.get(simrels, idxs[i]));
}
}
如何优化此代码以使其运行更快?
注意:我有一台 Windows 8.1 PC,i5 2.30GHz,内存 16GB,硬盘 1TB
应 Michal 的要求,请查找以下文件内容:
neo4j-wrapper
#********************************************************************
# Property file references
#********************************************************************
wrapper.java.additional=-Dorg.neo4j.server.properties=conf/neo4j-server.properties
wrapper.java.additional=-Djava.util.logging.config.file=conf/logging.properties
wrapper.java.additional=-Dlog4j.configuration=file:conf/log4j.properties
#********************************************************************
# JVM Parameters
#********************************************************************
wrapper.java.additional=-XX:+UseConcMarkSweepGC
wrapper.java.additional=-XX:+CMSClassUnloadingEnabled
wrapper.java.additional=-XX:-OmitStackTraceInFastThrow
# Remote JMX monitoring, uncomment and adjust the following lines as needed.
# Also make sure to update the jmx.access and jmx.password files with appropriate permission roles and passwords,
# the shipped configuration contains only a read only role called 'monitor' with password 'Neo4j'.
# For more details, see: http://download.oracle.com/javase/7/docs/technotes/guides/management/agent.html
# On Unix based systems the jmx.password file needs to be owned by the user that will run the server,
# and have permissions set to 0600.
# For details on setting these file permissions on Windows see:
# http://docs.oracle.com/javase/7/docs/technotes/guides/management/security-windows.html
#wrapper.java.additional=-Dcom.sun.management.jmxremote.port=3637
#wrapper.java.additional=-Dcom.sun.management.jmxremote.authenticate=true
#wrapper.java.additional=-Dcom.sun.management.jmxremote.ssl=false
#wrapper.java.additional=-Dcom.sun.management.jmxremote.password.file=conf/jmx.password
#wrapper.java.additional=-Dcom.sun.management.jmxremote.access.file=conf/jmx.access
# Some systems cannot discover host name automatically, and need this line configured:
#wrapper.java.additional=-Djava.rmi.server.hostname=$THE_NEO4J_SERVER_HOSTNAME
# Uncomment the following lines to enable garbage collection logging
#wrapper.java.additional=-Xloggc:data/log/neo4j-gc.log
#wrapper.java.additional=-XX:+PrintGCDetails
#wrapper.java.additional=-XX:+PrintGCDateStamps
#wrapper.java.additional=-XX:+PrintGCApplicationStoppedTime
#wrapper.java.additional=-XX:+PrintPromotionFailure
#wrapper.java.additional=-XX:+PrintTenuringDistribution
# Java Heap Size: by default the Java heap size is dynamically
# calculated based on available system resources.
# Uncomment these lines to set specific initial and maximum
# heap size in MB.
wrapper.java.initmemory=8192
wrapper.java.maxmemory=10240
#********************************************************************
# Wrapper settings
#********************************************************************
# path is relative to the bin dir
wrapper.pidfile=../data/neo4j-server.pid
#********************************************************************
# Wrapper Windows NT/2000/XP Service Properties
#********************************************************************
# WARNING - Do not modify any of these properties when an application
# using this configuration file has been installed as a service.
# Please uninstall the service before modifying this section. The
# service can then be reinstalled.
# Name of the service
wrapper.name=neo4j
# User account to be used for linux installs. Will default to current
# user if not set.
wrapper.user=
neo4j.properties
# Enable this to be able to upgrade a store from an older version.
#allow_store_upgrade=true
# The amount of memory to use for mapping the store files, either in bytes or
# as a percentage of available memory. This will be clipped at the amount of
# free memory observed when the database starts, and automatically be rounded
# down to the nearest whole page. For example, if "500MB" is configured, but
# only 450MB of memory is free when the database starts, then the database will
# map at most 450MB. If "50%" is configured, and the system has a capacity of
# 4GB, then at most 2GB of memory will be mapped, unless the database observes
# that less than 2GB of memory is free when it starts.
#mapped_memory_total_size=50%
# Enable this to specify a parser other than the default one.
#cypher_parser_version=2.0
# Keep logical logs, helps debugging but uses more disk space, enabled for
# legacy reasons To limit space needed to store historical logs use values such
# as: "7 days" or "100M size" instead of "true".
#keep_logical_logs=7 days
# Autoindexing
# Enable auto-indexing for nodes, default is false.
#node_auto_indexing=true
# The node property keys to be auto-indexed, if enabled.
#node_keys_indexable=name,age
# Enable auto-indexing for relationships, default is false.
#relationship_auto_indexing=true
# The relationship property keys to be auto-indexed, if enabled.
#relationship_keys_indexable=name,age
# Enable shell server so that remote clients can connect via Neo4j shell.
#remote_shell_enabled=true
# The network interface IP the shell will listen on (use 0.0.0 for all interfaces).
#remote_shell_host=127.0.0.1
# The port the shell will listen on, default is 1337.
#remote_shell_port=1337
# The type of cache to use for nodes and relationships.
#cache_type=hpc
# Maximum size of the heap memory to dedicate to the cached nodes.
#node_cache_size=
# Maximum size of the heap memory to dedicate to the cached relationships.
#relationship_cache_size=
# Enable online backups to be taken from this database.
online_backup_enabled=true
# Port to listen to for incoming backup requests.
online_backup_server=127.0.0.1:6362
# Uncomment and specify these lines for running Neo4j in High Availability mode.
# See the High availability setup tutorial for more details on these settings
# http://neo4j.com/docs/2.2.0-M02/ha-setup-tutorial.html
# ha.server_id is the number of each instance in the HA cluster. It should be
# an integer (e.g. 1), and should be unique for each cluster instance.
#ha.server_id=
# ha.initial_hosts is a comma-separated list (without spaces) of the host:port
# where the ha.cluster_server of all instances will be listening. Typically
# this will be the same for all cluster instances.
#ha.initial_hosts=192.168.0.1:5001,192.168.0.2:5001,192.168.0.3:5001
# IP and port for this instance to listen on, for communicating cluster status
# information iwth other instances (also see ha.initial_hosts). The IP
# must be the configured IP address for one of the local interfaces.
#ha.cluster_server=192.168.0.1:5001
# IP and port for this instance to listen on, for communicating transaction
# data with other instances (also see ha.initial_hosts). The IP
# must be the configured IP address for one of the local interfaces.
#ha.server=192.168.0.1:6001
# The interval at which slaves will pull updates from the master. Comment out
# the option to disable periodic pulling of updates. Unit is seconds.
ha.pull_interval=10
# Amount of slaves the master will try to push a transaction to upon commit
# (default is 1). The master will optimistically continue and not fail the
# transaction even if it fails to reach the push factor. Setting this to 0 will
# increase write performance when writing through master but could potentially
# lead to branched data (or loss of transaction) if the master goes down.
#ha.tx_push_factor=1
# Strategy the master will use when pushing data to slaves (if the push factor
# is greater than 0). There are two options available "fixed" (default) or
# "round_robin". Fixed will start by pushing to slaves ordered by server id
# (highest first) improving performance since the slaves only have to cache up
# one transaction at a time.
#ha.tx_push_strategy=fixed
# Policy for how to handle branched data.
#branched_data_policy=keep_all
# Clustering timeouts
# Default timeout.
#ha.default_timeout=5s
# How often heartbeat messages should be sent. Defaults to ha.default_timeout.
#ha.heartbeat_interval=5s
# Timeout for heartbeats between cluster members. Should be at least twice that of ha.heartbeat_interval.
#heartbeat_timeout=11s
你有一个Traverser
:
Traverser traverser = traversal1.traverse(user);
int size = traverser.metadata().getNumberOfRelationshipsTraversed();
Iterable<Relationship> simrels = traverser.relationships();
现在你有了尺寸,可以优化随机选择器了。
Neo4j 返回 Iterable 的原因是它在您迭代时执行遍历。为了采样,恐怕你必须 "visit" 每一个关系。是的,你可以跳过一些,但你仍然必须在一天结束时遍历所有这些。
我们为此使用 "reservoir sampling" 算法,implemented here。由于上述原因,不确定它会表现得更好。也就是说,您应该能够在不到 1 秒的时间内使用热缓存对 1M 关系进行采样。如果花费的时间比这更长,您可能需要稍微调整一下内存设置。