Apache Tika 1.16 TXTParser 无法检测 sbt 构建中的字符编码

Apache Tika 1.16 TXTParser Failed to detect character encoding in sbt build

我正在使用 sbt 程序集在 Eclipse 中构建一个项目。我有一个非常大且复杂的 build.sbt 文件,因为我有很多冲突。

使用 tika 1.16 中的 PDF、OOXML 和 OpenDocument 解析器,对于 pdf、pptx、odt 和 docx 文件一切正常。但是,当我尝试使用 TXTParser 解析 txt 文件(UTF-8 编码)时,出现以下错误:

org.apache.tika.exception.TikaException: Failed to detect the character encoding of a document
    at org.apache.tika.detect.AutoDetectReader.detect(AutoDetectReader.java:77)
    at org.apache.tika.detect.AutoDetectReader.<init>(AutoDetectReader.java:108)
    at org.apache.tika.detect.AutoDetectReader.<init>(AutoDetectReader.java:114)
    at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:79)`

来自我的 Scala 代码中的这一行:

val content = theParser.parse(stream.open(), chandler, meta, pContext)

其中 stream 是一个 PortableDataStream,chandler 是一个新的 BodyContentHandler,meta 是一个新的 Metadata,pContext 是一个新的 ParseContext。

如果我改用 AutoDetectParser,则会收到以下错误:

org.apache.jena.shared.SyntaxError: unknown
    at org.apache.jena.rdf.model.impl.NTripleReader.read(NTripleReader.java:73)
    at org.apache.jena.rdf.model.impl.NTripleReader.read(NTripleReader.java:58)
    at org.apache.jena.rdf.model.impl.ModelCom.read(ModelCom.java:305)

来自我的 Scala 代码中的这一行:

val response = model.read(stream, null, "N-TRIPLES")

其中流是 InputStream。

我认为这是由于 Tika 的空响应(所以是同样的问题)。

我很确定这可能是我过于复杂的 build.sbt 文件中的依赖性问题,但经过许多小时的尝试,我确实需要帮助。

一个积极的方面是,如果没有输入 txt 文件,一切都会完美无缺,所以这可能是我的最后一期了!

最后,这是我使用 sbt clean assembly 构建的 build.sbt 文件:

scalaVersion := "2.11.8"
version      := "1.0.0"
name := "crawldocs"
conflictManager := ConflictManager.strict
mainClass in assembly := Some("com.addlesee.crawling.CrawlHiccup")
libraryDependencies ++= Seq(
  "org.apache.tika" % "tika-core" % "1.16",
  "org.apache.tika" % "tika-parsers" % "1.16" excludeAll(
    ExclusionRule(organization = "*", name = "guava")
  ),
    "com.blazegraph" % "bigdata-core" % "2.0.0" excludeAll(
    ExclusionRule(organization = "*", name = "collection-0.7"),
    ExclusionRule(organization = "*", name = "commons-cli"),
    ExclusionRule(organization = "*", name = "commons-codec"),
    ExclusionRule(organization = "*", name = "commons-csv"),
    ExclusionRule(organization = "*", name = "commons-io"),
    ExclusionRule(organization = "*", name = "commons-lang3"),
    ExclusionRule(organization = "*", name = "commons-logging"),
    ExclusionRule(organization = "*", name = "httpclient"),
    ExclusionRule(organization = "*", name = "httpclient-cache"),
    ExclusionRule(organization = "*", name = "httpcore"),
    ExclusionRule(organization = "*", name = "httpmime"),
    ExclusionRule(organization = "*", name = "jackson-annotations"),
    ExclusionRule(organization = "*", name = "jackson-core"),
    ExclusionRule(organization = "*", name = "jackson-databind"),
    ExclusionRule(organization = "*", name = "jcl-over-slf4j"),
    ExclusionRule(organization = "*", name = "jena-cmds"),
    ExclusionRule(organization = "*", name = "jena-rdfconnection"),
    ExclusionRule(organization = "*", name = "jena-tdb"),
    ExclusionRule(organization = "*", name = "jsonld-java"),
    ExclusionRule(organization = "*", name = "libthrift"),
    ExclusionRule(organization = "*", name = "log4j"),
    ExclusionRule(organization = "*", name = "slf4j-api"),
    ExclusionRule(organization = "*", name = "slf4j-log4j12"),
    ExclusionRule(organization = "*", name = "xercesImpl"),
    ExclusionRule(organization = "*", name = "xml-apis")
  ),
    "org.scalaj" %% "scalaj-http" % "2.3.0",
  "org.apache.jena" % "apache-jena" % "3.4.0" excludeAll(
    ExclusionRule(organization = "*", name = "commons-cli"),
    ExclusionRule(organization = "*", name = "commons-codec"),
    ExclusionRule(organization = "*", name = "commons-csv"),
    ExclusionRule(organization = "*", name = "commons-lang3"),
    ExclusionRule(organization = "*", name = "httpclient"),
    ExclusionRule(organization = "*", name = "httpclient-cache"),
    ExclusionRule(organization = "*", name = "httpcore"),
    ExclusionRule(organization = "*", name = "jackson-core"),
    ExclusionRule(organization = "*", name = "jackson-databind"),
    ExclusionRule(organization = "*", name = "jcl-over-slf4j"),
    ExclusionRule(organization = "*", name = "jena-rdfconnection"),
    ExclusionRule(organization = "*", name = "slf4j-api")
  ),
    "org.apache.jena" % "apache-jena-libs" % "3.4.0" excludeAll(
    ExclusionRule(organization = "*", name = "commons-cli"),
    ExclusionRule(organization = "*", name = "commons-codec"),
    ExclusionRule(organization = "*", name = "commons-csv"),
    ExclusionRule(organization = "*", name = "commons-lang3"),
    ExclusionRule(organization = "*", name = "httpclient"),
    ExclusionRule(organization = "*", name = "httpclient-cache"),
    ExclusionRule(organization = "*", name = "httpcore"),
    ExclusionRule(organization = "*", name = "jackson-core"),
    ExclusionRule(organization = "*", name = "jackson-databind"),
    ExclusionRule(organization = "*", name = "jcl-over-slf4j"),
    ExclusionRule(organization = "*", name = "jena-rdfconnection"),
    ExclusionRule(organization = "*", name = "slf4j-api")
  ),
    "org.noggit" % "noggit" % "0.6",
    "com.typesafe.scala-logging" %% "scala-logging" % "3.7.2" excludeAll(
    ExclusionRule(organization = "*", name = "slf4j-api")
  ),
  "org.apache.spark" % "spark-core_2.11" % "2.2.0" excludeAll(
    ExclusionRule(organization = "*", name = "breeze_2.11"),
    ExclusionRule(organization = "*", name = "hadoop-hdfs"),
    ExclusionRule(organization = "*", name = "hadoop-annotations"),
    ExclusionRule(organization = "*", name = "hadoop-common"),
    ExclusionRule(organization = "*", name = "hadoop-mapreduce-client-app"),
    ExclusionRule(organization = "*", name = "hadoop-mapreduce-client-common"),
    ExclusionRule(organization = "*", name = "hadoop-mapreduce-client-core"),
    ExclusionRule(organization = "*", name = "hadoop-mapreduce-client-jobclient"),
    ExclusionRule(organization = "*", name = "hadoop-mapreduce-client-shuffle"),
    ExclusionRule(organization = "*", name = "hadoop-yarn-api"),
    ExclusionRule(organization = "*", name = "hadoop-yarn-client"),
    ExclusionRule(organization = "*", name = "hadoop-yarn-common"),
    ExclusionRule(organization = "*", name = "hadoop-yarn-server-common"),
    ExclusionRule(organization = "*", name = "hadoop-yarn-server-web-proxy"),
    ExclusionRule(organization = "*", name = "activation"),
    ExclusionRule(organization = "*", name = "hive-exec"),
    ExclusionRule(organization = "*", name = "scala-compiler"),
    ExclusionRule(organization = "*", name = "spire_2.11"),
    ExclusionRule(organization = "*", name = "commons-compress"),
    ExclusionRule(organization = "*", name = "slf4j-api"),
    ExclusionRule(organization = "*", name = "guava"),
    ExclusionRule(organization = "*", name = "commons-codec"),
    ExclusionRule(organization = "*", name = "commons-io"),
    ExclusionRule(organization = "*", name = "gson"),
    ExclusionRule(organization = "*", name = "httpclient"),
    ExclusionRule(organization = "*", name = "zookeeper"),
    ExclusionRule(organization = "*", name = "jettison"),
    ExclusionRule(organization = "*", name = "jackson-core"),
    ExclusionRule(organization = "*", name = "httpcore"),
    ExclusionRule(organization = "*", name = "bcprov-jdk15on"),
    ExclusionRule(organization = "*", name = "jul-to-slf4j"),
    ExclusionRule(organization = "*", name = "jcl-over-slf4j"),
    ExclusionRule(organization = "*", name = "commons-cli"),
    ExclusionRule(organization = "*", name = "slf4j-log4j12"),
    ExclusionRule(organization = "*", name = "curator-framework")
  ),
  "org.scala-lang" % "scala-xml" % "2.11.0-M4",
  "org.apache.hadoop" % "hadoop-mapreduce-client-core" % "2.7.3" excludeAll(
    ExclusionRule(organization = "*", name = "commons-codec"),
    ExclusionRule(organization = "*", name = "commons-cli"),
    ExclusionRule(organization = "*", name = "slf4j-api"),
    ExclusionRule(organization = "*", name = "commons-io"),
    ExclusionRule(organization = "*", name = "jettison"),
    ExclusionRule(organization = "*", name = "avro"),
    ExclusionRule(organization = "*", name = "commons-compress"),
    ExclusionRule(organization = "*", name = "slf4j-log4j12"),
    ExclusionRule(organization = "*", name = "netty")
  ),
  "org.apache.hadoop" % "hadoop-common" % "2.7.3" excludeAll(
    ExclusionRule(organization = "*", name = "commons-codec"),
    ExclusionRule(organization = "*", name = "commons-cli"),
    ExclusionRule(organization = "*", name = "slf4j-api"),
    ExclusionRule(organization = "*", name = "commons-math3"),
    ExclusionRule(organization = "*", name = "commons-io"),
    ExclusionRule(organization = "*", name = "jets3t"),
    ExclusionRule(organization = "*", name = "gson"),
    ExclusionRule(organization = "*", name = "avro"),
    ExclusionRule(organization = "*", name = "httpclient"),
    ExclusionRule(organization = "*", name = "zookeeper"),
    ExclusionRule(organization = "*", name = "commons-compress"),
    ExclusionRule(organization = "*", name = "slf4j-log4j12"),
    ExclusionRule(organization = "*", name = "commons-net"),
    ExclusionRule(organization = "*", name = "curator-recipes"),
    ExclusionRule(organization = "*", name = "jsr305")
  )
)
assemblyMergeStrategy in assembly := {
 case PathList("META-INF", xs @ _*) => MergeStrategy.discard
 case x => MergeStrategy.first
}

上面的代码调用了仅出于遗留原因而存在的旧 N 元组分析。旧的 reader 只是 ASCII。 UTF-8 会破坏它。

apache-jena-libs(即 type=pom)未被处理,或者您正在重新打包 jar 并且尚未处理 Java 的 ServiceLoader 放置文件的 META-INF/service . Jena 使用它进行初始化。您必须通过连接同名文件来合并 META_INF/service/* 文件。

详情:https://jena.apache.org/documentation/notes/jena-repack.html

终于修复了...

我在 MergeStrategy 的丢弃行上方添加了 case x if x.contains("EncodingDetector") => MergeStrategy.deduplicate。 build.sbt 底部的以下 assemblyMergeStrategy 解决了我的问题:

assemblyMergeStrategy in assembly := {
 case x if x.contains("EncodingDetector") => MergeStrategy.deduplicate
 case PathList("META-INF", xs @ _*) => MergeStrategy.discard
 case x => MergeStrategy.first
}