将 Nutch 1.17 与 Eclipse 集成 (Ubuntu 18.04)
Integrating Nutch 1.17 with Eclipse (Ubuntu 18.04)
我不知道该指南是否可能已过时,或者我做错了什么。
我刚开始使用 nutch,我已经通过终端通过一些网站将它与 solr 和 crawled/indexed 集成在一起。
现在我正在尝试在 java 应用程序中使用它们,所以我一直在关注这里的教程:
https://cwiki.apache.org/confluence/display/NUTCH/RunNutchInEclipse#RunNutchInEclipse-RunningNutchinEclipse
我通过Eclipse下载了Subclipse、IvyDE和m2e,我下载了ant,所以我应该具备所有的先决条件。
教程中的m2e link 坏了,所以我在别处找到了。事实证明eclipse在安装时已经有了它。
当我在终端中 运行 'ant eclipse' 时,我收到一大堆错误消息。
由于字数限制,将 link 与整个错误消息一起放入 pastebin
here
我真的不确定我做错了什么。
方向并没有那么复杂,所以我真的不知道我搞砸了。
以防万一,这是我们需要修改的 nutch-site.xml。
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>plugin.folders</name>
<value>/home/user/trunk/build/plugins</value>
</property>
<!-- HTTP properties -->
<property>
<name>http.agent.name</name>
<value>MarketDataCrawler</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.robots.agents</name>
<value></value>
<description>Any other agents, apart from 'http.agent.name', that the robots
parser would look for in robots.txt. Multiple agents can be provided using
comma as a delimiter. eg. mybot,foo-spider,bar-crawler
The ordering of agents does NOT matter and the robots parser would make
decision based on the agent which matches first to the robots rules.
Also, there is NO need to add a wildcard (ie. "*") to this string as the
robots parser would smartly take care of a no-match situation.
If no value is specified, by default HTTP agent (ie. 'http.agent.name')
would be used for user agent matching by the robots parser.
</description>
</property>
</configuration>
一大堆错误都和Ivy有关,不知道Nutch的Ivy版本和eclipse安装的插件是否兼容
按照 LOG 文件中的指导
[ivy:resolve] SERVER ERROR: HTTPS Required url=http://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.6.1/slf4j-api-1.6.1.pom
[ivy:resolve] SERVER ERROR: HTTPS Required url=http://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.6.1/slf4j-api-1.6.1.jar
[ivy:resolve] SERVER ERROR: HTTPS Required url=http://repo1.maven.org/maven2/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.pom
您应该在 ivy/ivy.xml 中使用更新的存储库 URL。一种选择是在 ivy.xml.
中将每个 URL 从 http 更改为 https
我想,您使用的是一些旧版本,否则这个问题应该已经解决了。
我不知道该指南是否可能已过时,或者我做错了什么。 我刚开始使用 nutch,我已经通过终端通过一些网站将它与 solr 和 crawled/indexed 集成在一起。 现在我正在尝试在 java 应用程序中使用它们,所以我一直在关注这里的教程: https://cwiki.apache.org/confluence/display/NUTCH/RunNutchInEclipse#RunNutchInEclipse-RunningNutchinEclipse
我通过Eclipse下载了Subclipse、IvyDE和m2e,我下载了ant,所以我应该具备所有的先决条件。 教程中的m2e link 坏了,所以我在别处找到了。事实证明eclipse在安装时已经有了它。
当我在终端中 运行 'ant eclipse' 时,我收到一大堆错误消息。 由于字数限制,将 link 与整个错误消息一起放入 pastebin here
我真的不确定我做错了什么。 方向并没有那么复杂,所以我真的不知道我搞砸了。
以防万一,这是我们需要修改的 nutch-site.xml。
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>plugin.folders</name>
<value>/home/user/trunk/build/plugins</value>
</property>
<!-- HTTP properties -->
<property>
<name>http.agent.name</name>
<value>MarketDataCrawler</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.robots.agents</name>
<value></value>
<description>Any other agents, apart from 'http.agent.name', that the robots
parser would look for in robots.txt. Multiple agents can be provided using
comma as a delimiter. eg. mybot,foo-spider,bar-crawler
The ordering of agents does NOT matter and the robots parser would make
decision based on the agent which matches first to the robots rules.
Also, there is NO need to add a wildcard (ie. "*") to this string as the
robots parser would smartly take care of a no-match situation.
If no value is specified, by default HTTP agent (ie. 'http.agent.name')
would be used for user agent matching by the robots parser.
</description>
</property>
</configuration>
一大堆错误都和Ivy有关,不知道Nutch的Ivy版本和eclipse安装的插件是否兼容
按照 LOG 文件中的指导
[ivy:resolve] SERVER ERROR: HTTPS Required url=http://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.6.1/slf4j-api-1.6.1.pom
[ivy:resolve] SERVER ERROR: HTTPS Required url=http://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.6.1/slf4j-api-1.6.1.jar
[ivy:resolve] SERVER ERROR: HTTPS Required url=http://repo1.maven.org/maven2/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.pom
您应该在 ivy/ivy.xml 中使用更新的存储库 URL。一种选择是在 ivy.xml.
中将每个 URL 从 http 更改为 https我想,您使用的是一些旧版本,否则这个问题应该已经解决了。