stormCrawler 不只抓取页面的主要内容

stormCrawler not crawling only main content of page

默认情况下,抓取工具抓取整个页面,包括所有页面通用的页眉和页脚。我们的要求是 Crawler 应该只抓取页面的主要内容(在 div#body-wrapper 下)

我们使用 parsefilters.json 实现了同样的效果。

{
      "class": "com.digitalpebble.stormcrawler.parse.filter.ContentFilter",
      "name": "ContentFilter",
      "params": {
        "pattern": "//DIV[@id=\"body-wrapper\"]",
        "pattern2": "//DIV[@itemprop=\"articleBody\"]",
        "pattern3": "//ARTICLE"
       }
    }

更新 parsefilters.json 后,它只抓取 div,但它包括所有空格、换行符、JS、CSS 代码等,如下所示。

"content" : "\n\t\t\t\n\n\t\t\t\t\n\t\t\t\t\t 发展您的业务 ..................... \n\n\n\n\n\n\t\n\t\t\n\t\t\t\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\n\t\n\t\t\n\n\n\n\n\t\n\n\t\n\t\t\n\t\t\t\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\n\t\t\n\n\t\t\n\n\n\n\t\n\t\t\n\t\t\n\n\n\t\t\t\n\t\t\t\t\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\t\t\n\n\t\t\n\n\t\t\n\n\t\t\n\t\n\t\t\t\t\n\t\t\t\n\t\t \n\t\t\n\n\t\t\n\t\t\t\n\t\t\t\t\n\n\n\t\n\n\n\n\t\n\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\n\t\t\t\t\n\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\n\n\n\t\n\n\t\t\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\n\n\t\t\t\t\t\t\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\n\t\n\t\t\n\t\n.landing-page-indicators { \n\ttop:inherit !important;\n}\n\n\t.slide-share .slide-share-indicators li {\n\t 宽度:10px;\n\t 高度:10px;\n\t border-radius: 10px;\n\t border: none;\n\t margin: 0px 0 0 14px;\n}\n.slide-share .cta-btn- inline { \n margin-left:0px;\n}\n .slide-share .slide-share-indicators .active {\n\t background-color: #f33;\n}\n .slide-share . slide-share-item-img {\n\t 宽度:100%;\n\t 高度:360px;\n\t 最大高度:370px;\n\t 背景大小:封面;\n\t background-position: center;\n}\n .slide-share .carousel-indicators {\n\t margin-bottom: 0px;\n\t bottom: 24px;\n}\n . slide-share .slide-share-item-caption {\n\t width: 100%;\n\t -webkit-transition: height 0.4s ease;\n\t transition: height 0.4s ease;\n\t padding: 24px 16px;\n\t padding-bottom:0px;\n\t position: absolute;\n\t bottom: 5%;\n\t display: block;\n\t 颜色:黑色;\n}\n .slide-share .slide- share-item-caption:hover {\n\t text-decoration: none;\n}\n .slide-share .slide-share-item-desc {\n\t 最大宽度: 992px;\n\t 宽度:100%;\n\t 位置:相对;\n\t 边距:0 自动;\n}\n .slide-share .slide-share-item-desc h2 { \n\t margin-bottom: 8px;\n\t font-size: 36px;\n\t font-weight: 700;\n}\n .slide-share .slide-share-item-desc p {\n\t line-height: 1.5;\n\t margin-bottom: 24px;\n\t font-size:24px;\n\t font-weight: 400;\n\t width:60%;\n}\n .slide-share .slide-share-arrows {\n\t 顶部:50px;\n\t 边距:30px;\n\t 宽度:0 ;\n\t align-items: initial;\n}\n .slide-share .slide-share-arrow-icon {\n\t 颜色:#fff;\n\t 字体大小:25px ;\n\t margin-top: 75px;\n}\n.slide-share .slide-share-item-desc {\n background-color: transparent;\n}\n .slide-share . slide-share-arrow-icon:hover {\n\t color: #ee1818;\n\t font-size: 25px;\n}\n\n.slide-share .carousel-item .shade { \n 宽度:60%;\n 高度:100%;\n 位置:绝对;\n 背景图像ge: linear-gradient(to right, #2e2e2e, transparent);\n opacity: .6;\n \n}\n\n @media (max-width: 991px) and (min-width: 768px) { \n\t .slide-share .slide-share-item-desc h2 {\n\t\t width: 100%;\n\t}\n\t .slide-share .slide-share -item-desc p {\n\t\t width: 100%;\n\t}\n}\n @media (max-width: 768px) {\n\t .slide-share .slide -share-item-desc h2 {\n\t\t width: 100%;\n\t\t font-size: 24px;\n\t\t margin-bottom: 16px;\n\n\t}\n\t .slide-share .slide-share-item-desc p {\n\t\t 字体大小:16px;\n\t\t 显示:none;\n\t}\n\t.slide-share-item-img.left-center {\n\tbackground-position: left center;\n\t} \n\n\t.slide -share-item-img.right-center {\n\tbackground-position: right center;\n\t} \n\t.slide-share-item-img.center-center {\n\tbackground-position: centercenter;\n\t}\n}\n \n\n\n\n\n \n\t\n\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t \t\t\t\n\t\t\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t \t\t\n\t\t\t\t\t\n\t\t\t\t\n\t\ t\t\t\t\n\t\t

但是Crawler在抓取整个页面时(默认配置),它不会添加空格、换行符、JS、CSS代码等

我们如何抓取页面的某些部分但没有空格、换行符、JS、CSS 等

请多多指教。

谢谢。

StormCrawler 1.13 起,ContentFilter 已被弃用,取而代之的是 TextExtractor。

根据发行说明,

[...] the main new feature is the addition of the TextExtractor (#678) for the JsoupParserBolt. Unlike the ContentParseFilter, which it replaces, it is configured from the main configuration and is not a ParseFilter as it operates directly on the objects generated by Jsoup. The TextExtractor allows restricting the text to specific elements to avoid boilerplate code and navigation elements but provides a far cleaner text content compared to the ContentParseFilter which merges some tokens. The TextExtractor can also be used to define exclusion zones which will be applied either to the restricted zones or the whole document if no such zone were defined or found. This is useful for instance to remove SCRIPT or STYLE elements.

原型生成的配置使用 TextExtractor,其配置与 ContentFilter 过去所做的类似。