如何使用 Spark Streaming 从序列文件中读取数据

Question

我有序列文件，键为文本，值为自定义数据类型。

但是Spark Streaming无法从序列文件中读取数据。

JavaPairInputDStream<Text, CustomDataType> myRDD =
        jssc.fileStream(path, Text.class, CustomDataType.class, SequenceFileInputFormat.class,
            new Function<Path, Boolean>() {
          @Override
          public Boolean call(Path v1) throws Exception {
            return Boolean.TRUE;
          }
        }, false);

以下是来自 IDE 的语法错误。

Bound mismatch: The generic method fileStream(String, Class<K>, Class<V>, Class<F>, Function<Path,Boolean>, boolean) of type JavaStreamingContext is not applicable for the arguments (String, Class<Text>, Class<DeltaCounter>, Class<SequenceFileInputFormat>, new Function<Path,Boolean>(){}, boolean). The inferred type SequenceFileInputFormat is not a valid substitute for the bounded parameter <F extends InputFormat<K,V>>

如何在Spark流中读取序列文件？

Answer 1

您需要在导入中使用正确的包。您可能正在导入旧的 org.apache.hadoop.mapred。使用此代码：

import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;

如何使用 Spark Streaming 从序列文件中读取数据

How to read data from sequence files using Spark Streaming

apache-spark

spark-streaming