MapReduceIndexerTool 应该如何看起来像吗啡线？

Question

我想高效地搜索大量日志（大约 1 TB，放置在多台机器上）。

为此，我想构建一个由 Flume、Hadoop 和 Solr 组成的基础架构。 Flume 将从几台机器上获取日志并将它们放入 HDFS。

现在，我希望能够使用 map reduce 作业为这些日志编制索引，以便能够使用 Solr 搜索它们。我发现 MapReduceIndexerTool 为我做了这个，但我发现它需要一个 morphline.

我知道 morphline 通常会对它所获取的数据执行一组操作，但是如果我想使用 MapReduceIndexerTool，我应该执行哪些操作？

我找不到适用于此 map reduce 作业的 morphline 的任何示例。

恭敬地感谢您。

Answer 1

Cloudera 有一个 guide，它与 morphline 下给出的用例几乎相似。

In this figure, a Flume Source receives syslog events and sends them to a Flume Morphline Sink, which converts each Flume event to a record and pipes it into a readLine command. The readLine command extracts the log line and pipes it into a grok command. The grok command uses regular expression pattern matching to extract some substrings of the line. It pipes the resulting structured record into the loadSolr command. Finally, the loadSolr command loads the record into Solr, typically a SolrCloud. In the process, raw data or semi-structured data is transformed into structured data according to application modelling requirements.

示例中给出的用例是 MapReduceIndexerTool、Apache Flume Morphline Solr Sink 和 Apache Flume MorphlineInterceptor 以及 Morphline Lily HBase Indexer 等生产工具运行作为其操作的一部分，如下图所示：

Answer 2

一般来说，在morplhine中你只需要读取你的数据，将其转换为solr文件，然后调用loadSolr创建索引。

例如，这是我使用 MapReduceIndexerTools 将 Avro 数据上传到 Solr 的 moprhline 文件：

SOLR_LOCATOR : {
  collection : collection1
  zkHost : "127.0.0.1:2181/solr"
}
morphlines : [
  {
    id : morphline1
    importCommands : ["org.kitesdk.**"]    
    commands : [
      {
        readAvroContainer {}
      }    
      {
        extractAvroPaths {
          flatten : false
          paths : {
            id : /id
            field1_s : /field1
            field2_s : /field2
          }
        }
      }
      {
        sanitizeUnknownSolrFields {
          solrLocator : ${SOLR_LOCATOR}
        }
      }
      {
        loadSolr {
          solrLocator : ${SOLR_LOCATOR}
        }
      }
    ]
  }
]

当运行它读取 avro 容器，将 avro 字段映射到 solr 文档字段，删除所有其他字段并使用提供的 Solr 连接详细信息来创建索引。它基于 this tutorial.

这是我用来索引文件并将它们合并到运行ning 集合的命令：

sudo -u hdfs hadoop --config /etc/hadoop/conf \
jar /usr/lib/solr/contrib/mr/search-mr-job.jar org.apache.solr.hadoop.MapReduceIndexerTool \
--morphline-file /local/path/morphlines_file  \
--output-dir hdfs://localhost/mrit/out \
--zk-host localhost:2181/solr \
--collection collection1 \ 
--go-live \
hdfs:/mrit/in/my-avro-file.avro

Solr 应该配置为与 HDFS 一起工作并且集合应该存在。

所有这些设置都适用于 CDH 5.7 Hadoop 上的 Solr 4.10。

MapReduceIndexerTool 应该如何看起来像吗啡线？

How should look like a morphline for MapReduceIndexerTool?

hadoop

mapreduce

morphline