Nutch 2.3 无法在 Cassandra 中正确存储爬网数据
Nutch 2.3 not storing crawl data correctly in Cassandra
我是 运行 爬虫,主要使用带有 Cassandra 后端的 Nutch 2.3 默认选项。作为种子列表,使用了一个包含 71 个 url 的文件,我正在使用以下命令进行爬网:
bin/crawl ~/dev/urls/ crawlid1 5
键存储在 Cassandra 中并创建了 f、p 和 sc 列族,但是,如果我尝试读取 WebPage 对象,内容和文本字段为空,尽管输出表明 fetch 和据称解析器作业 运行.
此外,尽管 db.update.additions.allowed 的默认值为 [=34],但没有新的 link 添加到 link 数据库=]真.
完成后,我尝试用下面的代码读出爬取的数据。这仅显示了一些正在填充的字段。查看 FetcherJob 和 ParserJob 中的代码,我看不出 content 或 text 字段应该为空的任何原因。我可能缺少一些基本设置,但谷歌搜索我的问题并没有产生任何结果。我还在ParserMapper和FetcherMapper中设置了断点,它们似乎被执行了。
有谁知道如何使用 Nutch 2 在 Cassandra 中存储 fetched/parsed 内容?
import static java.nio.charset.StandardCharsets.UTF_8;
import java.io.Closeable;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.Map.Entry;
import org.apache.gora.query.Query;
import org.apache.gora.query.Result;
import org.apache.gora.store.DataStore;
import org.apache.gora.store.DataStoreFactory;
import org.apache.gora.util.GoraException;
import org.apache.hadoop.conf.Configuration;
import org.apache.nutch.storage.WebPage;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
/**
* Reads the rows from a {@link DataStore} as a {@link WebPage}.
*
* @author Jeroen Vlek, jv@datamantics.com Created: Feb 25, 2015
*
*/
public class NutchWebPageReader implements Closeable {
private static final Logger LOGGER = LoggerFactory.getLogger(NutchWebPageReader.class);
DataStore<String, WebPage> dataStore;
/**
* Initializes the datastore field with the {@link Configuration} as defined
* in gora.properties in the classpath.
*/
public NutchWebPageReader() {
try {
dataStore = DataStoreFactory.getDataStore(String.class, WebPage.class, new Configuration());
} catch (GoraException e) {
throw new RuntimeException(e);
}
}
/**
* @param args
*/
public static void main(String[] args) {
Map<String, WebPage> pages = null;
try (NutchWebPageReader pageReader = new NutchWebPageReader()) {
pages = pageReader.getAllPages();
} catch (IOException e) {
LOGGER.error("Could not close page reader.", e);
}
LOGGER.info("Found {} results.", pages.size());
for (Entry<String, WebPage> entry : pages.entrySet()) {
String key = entry.getKey();
WebPage page = entry.getValue();
String content = "null";
if (page.getContent() != null) {
new String(page.getContent().array(), UTF_8);
}
LOGGER.info("{} with content {}", key, content);
}
}
/**
* @return
*
*/
public Map<String, WebPage> getAllPages() {
Query<String, WebPage> query = dataStore.newQuery();
Result<String, WebPage> result = query.execute();
Map<String, WebPage> resultMap = new HashMap<>();
try {
while (result.next()) {
resultMap.put(result.getKey(), dataStore.get(result.getKey()));
}
} catch (Exception e) {
LOGGER.error("Something went wrong while processing the query result.", e);
}
return resultMap;
}
/*
* (non-Javadoc)
*
* @see java.io.Closeable#close()
*/
@Override
public void close() throws IOException {
dataStore.close();
}
}
这是我的傻瓜-site.xml:
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.cassandra.store.CassandraStore</value>
<description>Default class for storing data</description>
</property>
<property>
<name>http.agent.name</name>
<value>Nibbler</value>
</property>
<property>
<name>fetcher.verbose</name>
<value>true</value>
<description>If true, fetcher will log more verbosely.</description>
</property>
<property>
<name>fetcher.parse</name>
<value>true</value>
<description>If true, fetcher will parse content. NOTE: previous
releases would
default to true. Since 2.0 this is set to false as a safer default.</description>
</property>
<property>
<name>http.content.limit</name>
<value>999999999</value>
</property>
编辑
我使用的是 Cassandra 2.0.12,但我只是用 2.0.2 试了一下,但没有解决问题。所以我使用的版本:
- Nutch:2.3(git 克隆在标签 "release-2.3" 签出)
- 强罗:0.5 英寸
纳奇
- 卡桑德拉:2.0.2
将 result.get() 更改为 dataStore.get(result.getKey()) 结果在一些字段中实际被填充,但内容和文本仍然是空的。
一些输出:
[jvlek@orochimaru nutch]$ runtime/local/bin/nutch inject ~/dev/urls/
InjectorJob: starting at 2015-03-02 18:34:29
InjectorJob: Injecting urlDir: /home/jvlek/dev/urls
InjectorJob: Using class org.apache.gora.cassandra.store.CassandraStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 69
Injector: finished at 2015-03-02 18:34:32, elapsed: 00:00:02
[jvlek@orochimaru nutch]$ runtime/local/bin/nutch readdb -url http://www.wired.com/
key: http://www.wired.com/
baseUrl: null
status: 0 (null)
fetchTime: 1425317669727
prevFetchTime: 0
fetchInterval: 2592000
retriesSinceFetch: 0
modifiedTime: 0
prevModifiedTime: 0
protocolStatus: (null)
parseStatus: (null)
title: null
score: 1.0
marker _injmrk_ : y
marker dist : 0
reprUrl: null
metadata _csh_ : ??
[jvlek@orochimaru nutch]$ runtime/local/bin/nutch generate -batchId 1
GeneratorJob: starting at 2015-03-02 18:34:50
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: normalizing: true
GeneratorJob: finished at 2015-03-02 18:34:54, time elapsed: 00:00:03
GeneratorJob: generated batch id: 1 containing 66 URLs
[jvlek@orochimaru nutch]$ runtime/local/bin/nutch readdb -url http://www.wired.com/
key: http://www.wired.com/
baseUrl: null
status: 0 (null)
fetchTime: 1425317669727
prevFetchTime: 0
fetchInterval: 2592000
retriesSinceFetch: 0
modifiedTime: 0
prevModifiedTime: 0
protocolStatus: (null)
parseStatus: (null)
title: null
score: 1.0
marker _injmrk_ : y
marker _gnmrk_ : 1
marker dist : 0
reprUrl: null
batchId: 1
metadata _csh_ : ??
您使用的是哪个版本的 Gora?
能否请您删除数据库并执行:
nutch inject ~/dev/urls/
nutch generate -batchId 1
nutch fetch 1
然后
nutch readdb -url <some known url> -content
它显示的信息是否正确?如果答案是肯定的,那么做:
nutch parse 1
nutch updatedb
nutch readdb -url <some known url> -content
这是 Gora 中的一个错误。已打开阻止票:
我是 运行 爬虫,主要使用带有 Cassandra 后端的 Nutch 2.3 默认选项。作为种子列表,使用了一个包含 71 个 url 的文件,我正在使用以下命令进行爬网:
bin/crawl ~/dev/urls/ crawlid1 5
键存储在 Cassandra 中并创建了 f、p 和 sc 列族,但是,如果我尝试读取 WebPage 对象,内容和文本字段为空,尽管输出表明 fetch 和据称解析器作业 运行.
此外,尽管 db.update.additions.allowed 的默认值为 [=34],但没有新的 link 添加到 link 数据库=]真.
完成后,我尝试用下面的代码读出爬取的数据。这仅显示了一些正在填充的字段。查看 FetcherJob 和 ParserJob 中的代码,我看不出 content 或 text 字段应该为空的任何原因。我可能缺少一些基本设置,但谷歌搜索我的问题并没有产生任何结果。我还在ParserMapper和FetcherMapper中设置了断点,它们似乎被执行了。
有谁知道如何使用 Nutch 2 在 Cassandra 中存储 fetched/parsed 内容?
import static java.nio.charset.StandardCharsets.UTF_8;
import java.io.Closeable;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.Map.Entry;
import org.apache.gora.query.Query;
import org.apache.gora.query.Result;
import org.apache.gora.store.DataStore;
import org.apache.gora.store.DataStoreFactory;
import org.apache.gora.util.GoraException;
import org.apache.hadoop.conf.Configuration;
import org.apache.nutch.storage.WebPage;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
/**
* Reads the rows from a {@link DataStore} as a {@link WebPage}.
*
* @author Jeroen Vlek, jv@datamantics.com Created: Feb 25, 2015
*
*/
public class NutchWebPageReader implements Closeable {
private static final Logger LOGGER = LoggerFactory.getLogger(NutchWebPageReader.class);
DataStore<String, WebPage> dataStore;
/**
* Initializes the datastore field with the {@link Configuration} as defined
* in gora.properties in the classpath.
*/
public NutchWebPageReader() {
try {
dataStore = DataStoreFactory.getDataStore(String.class, WebPage.class, new Configuration());
} catch (GoraException e) {
throw new RuntimeException(e);
}
}
/**
* @param args
*/
public static void main(String[] args) {
Map<String, WebPage> pages = null;
try (NutchWebPageReader pageReader = new NutchWebPageReader()) {
pages = pageReader.getAllPages();
} catch (IOException e) {
LOGGER.error("Could not close page reader.", e);
}
LOGGER.info("Found {} results.", pages.size());
for (Entry<String, WebPage> entry : pages.entrySet()) {
String key = entry.getKey();
WebPage page = entry.getValue();
String content = "null";
if (page.getContent() != null) {
new String(page.getContent().array(), UTF_8);
}
LOGGER.info("{} with content {}", key, content);
}
}
/**
* @return
*
*/
public Map<String, WebPage> getAllPages() {
Query<String, WebPage> query = dataStore.newQuery();
Result<String, WebPage> result = query.execute();
Map<String, WebPage> resultMap = new HashMap<>();
try {
while (result.next()) {
resultMap.put(result.getKey(), dataStore.get(result.getKey()));
}
} catch (Exception e) {
LOGGER.error("Something went wrong while processing the query result.", e);
}
return resultMap;
}
/*
* (non-Javadoc)
*
* @see java.io.Closeable#close()
*/
@Override
public void close() throws IOException {
dataStore.close();
}
}
这是我的傻瓜-site.xml:
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.cassandra.store.CassandraStore</value>
<description>Default class for storing data</description>
</property>
<property>
<name>http.agent.name</name>
<value>Nibbler</value>
</property>
<property>
<name>fetcher.verbose</name>
<value>true</value>
<description>If true, fetcher will log more verbosely.</description>
</property>
<property>
<name>fetcher.parse</name>
<value>true</value>
<description>If true, fetcher will parse content. NOTE: previous
releases would
default to true. Since 2.0 this is set to false as a safer default.</description>
</property>
<property>
<name>http.content.limit</name>
<value>999999999</value>
</property>
编辑
我使用的是 Cassandra 2.0.12,但我只是用 2.0.2 试了一下,但没有解决问题。所以我使用的版本:
- Nutch:2.3(git 克隆在标签 "release-2.3" 签出)
- 强罗:0.5 英寸 纳奇
- 卡桑德拉:2.0.2
将 result.get() 更改为 dataStore.get(result.getKey()) 结果在一些字段中实际被填充,但内容和文本仍然是空的。
一些输出:
[jvlek@orochimaru nutch]$ runtime/local/bin/nutch inject ~/dev/urls/
InjectorJob: starting at 2015-03-02 18:34:29
InjectorJob: Injecting urlDir: /home/jvlek/dev/urls
InjectorJob: Using class org.apache.gora.cassandra.store.CassandraStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 69
Injector: finished at 2015-03-02 18:34:32, elapsed: 00:00:02
[jvlek@orochimaru nutch]$ runtime/local/bin/nutch readdb -url http://www.wired.com/
key: http://www.wired.com/
baseUrl: null
status: 0 (null)
fetchTime: 1425317669727
prevFetchTime: 0
fetchInterval: 2592000
retriesSinceFetch: 0
modifiedTime: 0
prevModifiedTime: 0
protocolStatus: (null)
parseStatus: (null)
title: null
score: 1.0
marker _injmrk_ : y
marker dist : 0
reprUrl: null
metadata _csh_ : ??
[jvlek@orochimaru nutch]$ runtime/local/bin/nutch generate -batchId 1
GeneratorJob: starting at 2015-03-02 18:34:50
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: normalizing: true
GeneratorJob: finished at 2015-03-02 18:34:54, time elapsed: 00:00:03
GeneratorJob: generated batch id: 1 containing 66 URLs
[jvlek@orochimaru nutch]$ runtime/local/bin/nutch readdb -url http://www.wired.com/
key: http://www.wired.com/
baseUrl: null
status: 0 (null)
fetchTime: 1425317669727
prevFetchTime: 0
fetchInterval: 2592000
retriesSinceFetch: 0
modifiedTime: 0
prevModifiedTime: 0
protocolStatus: (null)
parseStatus: (null)
title: null
score: 1.0
marker _injmrk_ : y
marker _gnmrk_ : 1
marker dist : 0
reprUrl: null
batchId: 1
metadata _csh_ : ??
您使用的是哪个版本的 Gora?
能否请您删除数据库并执行:
nutch inject ~/dev/urls/
nutch generate -batchId 1
nutch fetch 1
然后
nutch readdb -url <some known url> -content
它显示的信息是否正确?如果答案是肯定的,那么做:
nutch parse 1
nutch updatedb
nutch readdb -url <some known url> -content
这是 Gora 中的一个错误。已打开阻止票: