Azure 搜索中的文件格式检测

File Format Detection in Azure Search

我们在 Azure 中有大量 blob，我们想将它们添加到 Azure 搜索索引中。这些 blob 具有多种格式（PDF、DOC、RTF 等），但其中 none 具有文件扩展名。

正因为如此，Azure 搜索在索引过程中停滞不前，因为它似乎只使用文件扩展名来检测文件格式。我们收到以下错误，并且由于我们所有的文件都有这些 "invalid" 扩展名，因此无论为索引错误设置任何阈值如何，它都会发生：

Import configuration failed, error creating Indexer: "Error with data source: Document 'https://XXXXXXX.blob.core.windows.net/folder/filename.00001' has unsupported content type 'unsupported'. To index only the blob metadata and ignore its content, set the 'dataToExtract' indexer configuration property to 'storageMetadata'. See https://aka.ms/azsearchblobdatatoextract. To ignore this error and continue indexing blobs with unsupported content types, set the 'failOnUnsupportedContentType' switch in indexer configuration to false. For more information, see https://aka.ms/blob-indexer-parameters-for-extraction. Please adjust your data source definition in order to proceed."

有什么方法可以让 Azure 搜索进行基于文件内容的文件检测，或者至少使用 blob 上的元数据？

Azure 搜索已经进行了基于内容的内容类型检测，但有些 blob 存在问题。在索引器操作期间可以跳过这些有问题的 blob（带有警告，以便您知道发生了什么），但是如果在索引器 creation 期间遇到这样的 blob，创建将失败并出现您遇到的错误.

如果您删除（或跳过使用 blob 元数据）有问题的 blob，您的大多数其他 blob 是否按预期工作？我怀疑 Azure 搜索团队有兴趣查看有问题的 blob，如果您可以共享它的话。

Azure 搜索中的文件格式检测

File Format Detection in Azure Search

azure

azure-cognitive-search