使用 apache nutch 抓取视频
Crawling video with apache nutch
如何使用 Apache Nutch 像这样获取视频标签:
<video width="320" height="240" controls>
<source src="video/video.mp4" type="video/mp4">
<source src="video/video.ogg" type="video/ogg">
Your browser does not support the video tag.
</video>
Apache nutch 可以获取图像标签,但在视频源中不起作用。有人可以指导我吗?
感谢您的帮助
您需要将其插入 parse-plugins.xml.
<mimeType name="video/mp4">
<plugin id="parse-tika" />
</mimeType>
<mimeType name="video/ogg">
<plugin id="parse-tika" />
</mimeType>
并在插件中添加parse-tika includes 属性 of nutch-site.xml.
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(text|html|tika)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|urlnormalizer-(pass|regex|basic)</value>
</property>
我已经通过在插件 parse-html (DOMContentUtils.java)
上添加源标记解决了这个问题
linkParams.put("frame", new LinkParams("frame", "src", 0));
linkParams.put("iframe", new LinkParams("iframe", "src", 0));
linkParams.put("script", new LinkParams("script", "src", 0));
linkParams.put("link", new LinkParams("link", "href", 0));
linkParams.put("img", new LinkParams("img", "src", 0));
linkParams.put("source", new LinkParams("source", "src", 0))
然后用 ant 重建。
希望对对方有帮助
如何使用 Apache Nutch 像这样获取视频标签:
<video width="320" height="240" controls>
<source src="video/video.mp4" type="video/mp4">
<source src="video/video.ogg" type="video/ogg">
Your browser does not support the video tag.
</video>
Apache nutch 可以获取图像标签,但在视频源中不起作用。有人可以指导我吗?
感谢您的帮助
您需要将其插入 parse-plugins.xml.
<mimeType name="video/mp4">
<plugin id="parse-tika" />
</mimeType>
<mimeType name="video/ogg">
<plugin id="parse-tika" />
</mimeType>
并在插件中添加parse-tika includes 属性 of nutch-site.xml.
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(text|html|tika)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|urlnormalizer-(pass|regex|basic)</value>
</property>
我已经通过在插件 parse-html (DOMContentUtils.java)
上添加源标记解决了这个问题linkParams.put("frame", new LinkParams("frame", "src", 0));
linkParams.put("iframe", new LinkParams("iframe", "src", 0));
linkParams.put("script", new LinkParams("script", "src", 0));
linkParams.put("link", new LinkParams("link", "href", 0));
linkParams.put("img", new LinkParams("img", "src", 0));
linkParams.put("source", new LinkParams("source", "src", 0))
然后用 ant 重建。
希望对对方有帮助