找出从 S3 下载的压缩文件的 MIME 类型 Java

Find out MIME Type of compressed files downloaded from S3 for Java

客户端应该将压缩文件上传到 S3 文件夹中。然后下载并解压缩压缩文件以对其包含的文件执行各种操作。最初我们告诉我们的客户将其文件压缩成 ZIP 文件,但事实证明这对我们的客户来说太难了。相反,它提交了一个带有 ZIP 扩展名的 RAR 文件……多么聪明。由于显而易见的原因,人们无法使用 ZIP 解压算法解压 RAR 文件。

因此,我正在寻找一种方法来找出 S3 下载文件的文件类型,因为我正在 Java 项目中使用亚马逊的 SDK 在 Linux OS。我会根据获取的文件类型来处理如何解压缩文件。

我看过很多堆栈溢出问题,例如 this one,但 none 仅通过查看它们(及其注释)似乎 100% 有效。

找出压缩文件类型的最佳方法是什么?

TL;DR;

当以编程方式将文件上传到 Amazon S3 时,可以指定对象的 Content-Type。如果指定 none,正如@Michael-bot 所阐明的,默认分配的值将是 binary/octet-stream。或者,如果决定通过 Amazon S3 的 GUI 上传文件,文件会从其文件扩展名(遗憾的是,不是其内容)获取其 Content-Type。如果您相信上传文件的人能够正确设置 Content-Type,请继续查看 ObjectMetadata,但如果您不能(像我一样),则需要其他解决方案。

因此,如果您正在寻找适用于最常见文件压缩类型的解决方案,Files.probeContentType, Apache Tika and SimpleMagic 似乎是可以接受的解决方案。

最后我选择了 Files.probeContentType 因为它不需要额外的库并且在 Linux 机器上工作得很好( 只要文件没有错误扩展名,对此有一个解决方法:删除文件扩展名并让它发挥它的魔力).


测试设置

一开始会以为从Amazon的S3下载文件时的response对象包含了文件类型。它确实包含此信息,但是当文件的扩展名与其内容不匹配时就会出现问题。

import com.amazonaws.services.s3.model.S3Object;

final S3Object s3Object = ...;
final String contentType = s3Object.getObjectMetadata().getContentType();

即使文件的内容是 Rar 文件,此代码也会 return application/zip。所以这个解决方案对我不起作用。

出于这个原因,我花时间构建了一个示例项目,该项目使用不同的方法和可用的库测试了各种场景。顺便说一句,我正在使用Java 8

测试的文件类型是:

  • 带 Zip 扩展名和不带扩展名的 Zip 文件
  • 一个带有 Rar 扩展名、Zip 扩展名和不带扩展名的 Rar 文件
  • 一个带有 7z 扩展名、Zip 扩展名和不带扩展名的 7z 文件
  • A Tar.xz 带 Tar.xz 扩展名、Zip 扩展名和不带扩展名
  • A Tar.gz 带 Tar.gz 扩展名、Zip 扩展名和不带扩展名

请注意,此处介绍的实现仅用于测试目的。它们不以任何方式被认可用于生产代码,因为它们没有考虑文件锁定问题以及我的想象力懒得考虑的其他问题。 =)


MimetypesFileTypeMap

实施

import java.io.File;
import javax.activation.MimetypesFileTypeMap;

final File file = new File(basePath + "/" + fileName);
try {
    return MimetypesFileTypeMap.getDefaultFileTypeMap().getContentType(file);
} catch (final Exception exception) {
    return "<EXCEPTION: " + exception.getMessage() + ">";
}

结果

Rar with Rar extension is:       application/octet-stream
Rar with Zip extension is:       application/octet-stream
Zip with Zip extension is:       application/octet-stream
7z with 7z extension is:         application/octet-stream
7z with Zip extension is:        application/octet-stream
Tar.xz with Tar.xz extension is: application/octet-stream
Tar.xz with Zip extension is:    application/octet-stream
Tar.gz with Tar.gz extension is: application/octet-stream
Tar.gz with Zip extension is:    application/octet-stream
Rar without extension is:        application/octet-stream
Zip without extension is:        application/octet-stream
7z without extension is:         application/octet-stream
Tar.xz without extension is:     application/octet-stream
Tar.gz without extension is:     application/octet-stream

结论

在无法识别文件类型时,通过此方法得到的值 return 是 application/octet-stream。似乎所有场景都失败了,所以我们应该放弃这种方法。


URLConnection.guessContentTypeFromStream

实施

import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.BufferedInputStream;
import java.net.URLConnection;

final File file = new File(basePath + "/" + fileName);
try {
    final FileInputStream fileInputStream = new FileInputStream(file);
    final InputStream inputStream = new BufferedInputStream(fileInputStream);

    return URLConnection.guessContentTypeFromStream(inputStream);
} catch (final Exception exception) {
    return "<EXCEPTION: " + exception.getMessage() + ">";
}

结果

Rar with Rar extension is:       null
Rar with Zip extension is:       null
Zip with Zip extension is:       null
7z with 7z extension is:         null
7z with Zip extension is:        null
Tar.xz with Tar.xz extension is: null
Tar.xz with Zip extension is:    null
Tar.gz with Tar.gz extension is: null
Tar.gz with Zip extension is:    null
Rar without extension is:        null
Zip without extension is:        null
7z without extension is:         null
Tar.xz without extension is:     null
Tar.gz without extension is:     null

结论

同样,此方法在所有情况下均失败。 It seems its support is very limited.


Files.probeContentType

实施

import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

try {
    final Path path = Paths.get(basePath + "/" + fileName);
    return Files.probeContentType(path);
} catch (final Exception exception) {
    return "<EXCEPTION: " + exception.getMessage() + ">";
}

结果

Rar with Rar extension is:       application/vnd.rar
Rar with Zip extension is:       application/zip
Zip with Zip extension is:       application/zip
7z with 7z extension is:         application/x-7z-compressed
7z with Zip extension is:        application/zip
Tar.xz with Tar.xz extension is: application/x-xz-compressed-tar
Tar.xz with Zip extension is:    application/zip
Tar.gz with Tar.gz extension is: application/x-compressed-tar
Tar.gz with Zip extension is:    application/zip
Rar without extension is:        application/vnd.rar
Zip without extension is:        application/zip
7z without extension is:         application/x-7z-compressed
Tar.xz without extension is:     application/x-xz
Tar.gz without extension is:     application/gzip

结论

此方法效果出奇地好,但不要被愚弄,在某些情况下它会一直失败。如果文件的扩展名错误(不匹配的是内容),它将报告文件类型为扩展名。这种情况应该不会经常发生,但是如果挑剔的人就不要用这个方法了。

此外,some warn that his approach doesn't work well in Windows

Workaround: If one manages to remove the extension from the filename, this would return the proper value for all the given scenarios.


Apache Tika (tika-eval 1.18)

似乎有 many flavors of this library(应用程序、服务器、评估等),但网络上的许多人抱怨它有点 "dependency-heavy"。

实施

import org.apache.tika.Tika;

try {
    return new Tika().detect(new File(basePath + "/" + fileName));
} catch (final Exception exception) {
    return "<EXCEPTION: " + exception.getMessage() + ">";
}

结果

Rar with Rar extension is:       application/x-rar-compressed
Rar with Zip extension is:       application/x-rar-compressed
Zip with Zip extension is:       application/zip
7z with 7z extension is:         application/x-7z-compressed
7z with Zip extension is:        application/x-7z-compressed
Tar.xz with Tar.xz extension is: application/x-xz
Tar.xz with Zip extension is:    application/x-xz
Tar.gz with Tar.gz extension is: application/gzip
Tar.gz with Zip extension is:    application/gzip
Rar without extension is:        application/x-rar-compressed
Zip without extension is:        application/zip
7z without extension is:         application/x-7z-compressed
Tar.xz without extension is:     application/x-xz
Tar.gz without extension is:     application/gzip

结论

所有文件都被正确识别,但它既有优点也有缺点。

优点:

  • 由 Apache 维护。
  • 不被扩展所愚弄。

缺点:

  • 真的很重,特别是如果只想检查获取文件类型。 Tika-eval Jar 重量为 +40MB。

URLConnection

实施

import java.net.URL;
import java.net.URLConnection;

try {
    final URL url = new URL("file://" + basePath + "/" + fileName);
    final URLConnection urlConnection = url.openConnection();
    return urlConnection.getContentType();
} catch (final Exception exception) {
    return "<EXCEPTION: " + exception.getMessage() + ">";
}

结果

Rar with Rar extension is:       content/unknown
Rar with Zip extension is:       application/zip
Zip with Zip extension is:       application/zip
7z with 7z extension is:         content/unknown
7z with Zip extension is:        application/zip
Tar.xz with Tar.xz extension is: content/unknown
Tar.xz with Zip extension is:    application/zip
Tar.gz with Tar.gz extension is: application/octet-stream
Tar.gz with Zip extension is:    application/zip
Rar without extension is:        content/unknown
Zip without extension is:        content/unknown
7z without extension is:         content/unknown
Tar.xz without extension is:     content/unknown
Tar.gz without extension is:     content/unknown

结论

它几乎不识别任何文件压缩格式,并通过扩展名而不是其内容来引导自己。


简单魔法 1.14

此项目似乎已更新 at least once a year

实施

import com.j256.simplemagic.ContentInfo;
import com.j256.simplemagic.ContentInfoUtil;

try {
    final ContentInfoUtil util = new ContentInfoUtil();
    final ContentInfo info = util.findMatch(basePath + "/" + fileName);

    return info.getMimeType();
} catch (final Exception exception) {
    return "<EXCEPTION: " + exception.getMessage() + ">";
}

结果

Rar with Rar extension is:       application/x-rar
Rar with Zip extension is:       application/x-rar
Zip with Zip extension is:       application/zip
7z with 7z extension is:         application/x-7z-compressed
7z with Zip extension is:        application/x-7z-compressed
Tar.xz with Tar.xz extension is: <EXCEPTION: null>
Tar.xz with Zip extension is:    <EXCEPTION: null>
Tar.gz with Tar.gz extension is: application/x-gzip
Tar.gz with Zip extension is:    application/x-gzip
Rar without extension is:        application/x-rar
Zip without extension is:        application/zip
7z without extension is:         application/x-7z-compressed
Tar.xz without extension is:     <EXCEPTION: null>
Tar.gz without extension is:     application/x-gzip

结论

它几乎适用于我们所有的场景,但似乎对于大多数 "obscure" 压缩格式,如 Tar.xz,它似乎无法检测到它们(并在此过程中抛出异常)。


MimeUtil 2.1.3

这个项目 has not been modified since 2010,所以不要期待支持或更新。它只是为了完成而列在这里。

实施

import eu.medsea.mimeutil.MimeUtil2;

try {
    final MimeUtil2 mimeUtil = new MimeUtil2();
        mimeUtil.registerMimeDetector("eu.medsea.mimeutil.detector.MagicMimeMimeDetector");

    return MimeUtil2.getMostSpecificMimeType(mimeUtil.getMimeTypes(basePath + "/" + fileName)).toString();
} catch (final Exception exception) {
    return "<EXCEPTION: " + exception.getMessage() + ">";
}

结果

Rar with Rar extension is:       application/x-rar
Rar with Zip extension is:       application/x-rar
Zip with Zip extension is:       application/zip
7z with 7z extension is:         application/octet-stream
7z with Zip extension is:        application/octet-stream
Tar.xz with Tar.xz extension is: application/octet-stream
Tar.xz with Zip extension is:    application/octet-stream
Tar.gz with Tar.gz extension is: application/x-gzip
Tar.gz with Zip extension is:    application/x-gzip
Rar without extension is:        application/x-rar
Zip without extension is:        application/zip
7z without extension is:         application/octet-stream
Tar.xz without extension is:     application/octet-stream
Tar.gz without extension is:     application/x-gzip

结论

它识别了一些最流行的文件类型,但在 Tar.xz 和 7z 时失败了。


文件 - 命令行

不是最漂亮的解决方案,但必须尝试一下:Ubuntu file command

实施

import java.io.BufferedReader;
import java.io.InputStreamReader;

try {
    final Process process = Runtime.getRuntime().exec("file --mime-type " + basePath + "/" + fileName);

    final BufferedReader stdInput = new BufferedReader(new InputStreamReader(process.getInputStream()));

    String text = "";

    String s;
    while ((s = stdInput.readLine()) != null) {
        text += s;
    }

    return text.split(": ")[1];
} catch (final Exception exception) {
    return "<EXCEPTION: " + exception.getMessage() + ">";
}

结果

Rar with Rar extension is:       application/x-rar
Rar with Zip extension is:       application/x-rar
Zip with Zip extension is:       application/zip
7z with 7z extension is:         application/x-7z-compressed
7z with Zip extension is:        application/x-7z-compressed
Tar.xz with Tar.xz extension is: application/x-xz
Tar.xz with Zip extension is:    application/x-xz
Tar.gz with Tar.gz extension is: application/gzip
Tar.gz with Zip extension is:    application/gzip
Rar without extension is:        application/x-rar
Zip without extension is:        application/zip
7z without extension is:         application/x-7z-compressed
Tar.xz without extension is:     application/x-xz
Tar.gz without extension is:     application/gzip

结论

它适用于我们所有的场景,但同样,这依赖于系统中存在的命令 File 运行 代码。