Nutch 指向 Cassandra,然而,要求 Hadoop
Nutch pointed to Cassandra, yet, asks for Hadoop
Windows 10
坚果 2.3.1
卡桑德拉 3.11.1
我已经在 Cygwin 的主目录下提取并构建了 Nutch。
我相信 Cassandra 服务器正在运行:
INFO [main] 2018-02-23 16:20:41,077 StorageService.java:1442 - JOINING: Finish joining ring
INFO [main] 2018-02-23 16:20:41,820 SecondaryIndexManager.java:509 - Executing pre-join tasks for: CFS(Keyspace='test', ColumnFamily='test')
INFO [main] 2018-02-23 16:20:42,161 StorageService.java:2268 - Node localhost/127.0.0.1 state jump to NORMAL
INFO [main] 2018-02-23 16:20:43,049 NativeTransportService.java:75 - Netty using Java NIO event loop
INFO [main] 2018-02-23 16:20:43,358 Server.java:155 - Using Netty Version: [netty-buffer=netty-buffer-4.0.44.Final.452812a, netty-codec=netty-codec-4.0.44.Final.452812a, netty-codec-haproxy=netty-codec-haproxy-4.0.44.Final.452812a, netty-codec-http=netty-codec-http-4.0.44.Final.452812a, netty-codec-socks=netty-codec-socks-4.0.44.Final.452812a, netty-common=netty-common-4.0.44.Final.452812a, netty-handler=netty-handler-4.0.44.Final.452812a, netty-tcnative=netty-tcnative-1.1.33.Fork26.142ecbb, netty-transport=netty-transport-4.0.44.Final.452812a, netty-transport-native-epoll=netty-transport-native-epoll-4.0.44.Final.452812a, netty-transport-rxtx=netty-transport-rxtx-4.0.44.Final.452812a, netty-transport-sctp=netty-transport-sctp-4.0.44.Final.452812a, netty-transport-udt=netty-transport-udt-4.0.44.Final.452812a]
INFO [main] 2018-02-23 16:20:43,359 Server.java:156 - Starting listening for CQL clients on localhost/127.0.0.1:9042 (unencrypted)...
INFO [main] 2018-02-23 16:20:43,941 CassandraDaemon.java:527 - Not starting RPC server as requested. Use JMX (StorageService->startRPCServer()) or nodetool (enablethrift) to start it
我做了以下检查:
apache-cassandra-3.11.1\bin>nodetool status
Datacenter: datacenter1
========================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 127.0.0.1 273.97 KiB 256 100.0% dab932f2-d138-4a1a-acd4-f63cbb16d224 rack1
csql 连接
apache-cassandra-3.11.1\bin>cqlsh
WARNING: console codepage must be set to cp65001 to support utf-8 encoding on Windows platforms.
If you experience encoding problems, change your console codepage with 'chcp 65001' before starting cqlsh.
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.1 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
WARNING: pyreadline dependency missing. Install to enable tab completion.
cqlsh> describe keyspaces
system_schema system_auth system system_distributed test system_traces
我遵循教程“Setting up NUTCH 2.x with CASSANDRA”并在属性和 xml 文件中添加了相应的条目。
我转到 Cygwin 提示并尝试抓取。它没有使用 Cassandra,而是要求 Hadoop(可能是 HBase)
/home/apache-nutch-2.3.1
$ ./runtime/deploy/bin/crawl urls/ crawl/ 1
No SOLRURL specified. Skipping indexing.
which: no hadoop in (<dump of the classpath entries>)
Can't find Hadoop executable. Add HADOOP_HOME/bin to the path or run in local mode.
Nutch 2 仍然需要 Hadoop 运行,它只是允许您将数据存储在 HDFS 以外的地方。
运行 Nutch 没有Hadoop 的唯一方法是本地模式,仅推荐用于测试。为此,运行 ./runtime/local/bin/crawl
.
Windows 10 坚果 2.3.1 卡桑德拉 3.11.1
我已经在 Cygwin 的主目录下提取并构建了 Nutch。
我相信 Cassandra 服务器正在运行:
INFO [main] 2018-02-23 16:20:41,077 StorageService.java:1442 - JOINING: Finish joining ring
INFO [main] 2018-02-23 16:20:41,820 SecondaryIndexManager.java:509 - Executing pre-join tasks for: CFS(Keyspace='test', ColumnFamily='test')
INFO [main] 2018-02-23 16:20:42,161 StorageService.java:2268 - Node localhost/127.0.0.1 state jump to NORMAL
INFO [main] 2018-02-23 16:20:43,049 NativeTransportService.java:75 - Netty using Java NIO event loop
INFO [main] 2018-02-23 16:20:43,358 Server.java:155 - Using Netty Version: [netty-buffer=netty-buffer-4.0.44.Final.452812a, netty-codec=netty-codec-4.0.44.Final.452812a, netty-codec-haproxy=netty-codec-haproxy-4.0.44.Final.452812a, netty-codec-http=netty-codec-http-4.0.44.Final.452812a, netty-codec-socks=netty-codec-socks-4.0.44.Final.452812a, netty-common=netty-common-4.0.44.Final.452812a, netty-handler=netty-handler-4.0.44.Final.452812a, netty-tcnative=netty-tcnative-1.1.33.Fork26.142ecbb, netty-transport=netty-transport-4.0.44.Final.452812a, netty-transport-native-epoll=netty-transport-native-epoll-4.0.44.Final.452812a, netty-transport-rxtx=netty-transport-rxtx-4.0.44.Final.452812a, netty-transport-sctp=netty-transport-sctp-4.0.44.Final.452812a, netty-transport-udt=netty-transport-udt-4.0.44.Final.452812a]
INFO [main] 2018-02-23 16:20:43,359 Server.java:156 - Starting listening for CQL clients on localhost/127.0.0.1:9042 (unencrypted)...
INFO [main] 2018-02-23 16:20:43,941 CassandraDaemon.java:527 - Not starting RPC server as requested. Use JMX (StorageService->startRPCServer()) or nodetool (enablethrift) to start it
我做了以下检查:
apache-cassandra-3.11.1\bin>nodetool status
Datacenter: datacenter1
========================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 127.0.0.1 273.97 KiB 256 100.0% dab932f2-d138-4a1a-acd4-f63cbb16d224 rack1
csql 连接
apache-cassandra-3.11.1\bin>cqlsh
WARNING: console codepage must be set to cp65001 to support utf-8 encoding on Windows platforms.
If you experience encoding problems, change your console codepage with 'chcp 65001' before starting cqlsh.
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.1 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
WARNING: pyreadline dependency missing. Install to enable tab completion.
cqlsh> describe keyspaces
system_schema system_auth system system_distributed test system_traces
我遵循教程“Setting up NUTCH 2.x with CASSANDRA”并在属性和 xml 文件中添加了相应的条目。
我转到 Cygwin 提示并尝试抓取。它没有使用 Cassandra,而是要求 Hadoop(可能是 HBase)
/home/apache-nutch-2.3.1
$ ./runtime/deploy/bin/crawl urls/ crawl/ 1
No SOLRURL specified. Skipping indexing.
which: no hadoop in (<dump of the classpath entries>)
Can't find Hadoop executable. Add HADOOP_HOME/bin to the path or run in local mode.
Nutch 2 仍然需要 Hadoop 运行,它只是允许您将数据存储在 HDFS 以外的地方。
运行 Nutch 没有Hadoop 的唯一方法是本地模式,仅推荐用于测试。为此,运行 ./runtime/local/bin/crawl
.