我想在 AWS CloudSearch 上搜索大型文件内容,但最大文档大小为 1MB - 我该如何处理?
I have large file contents that I want to make searchable on AWS CloudSearch but the maximum document size is 1MB - how do I deal with this?
我可以将文件内容拆分成单独的搜索文档,但我必须在结果中手动识别这一点,并且只向用户显示一个结果 - 否则看起来有 2 个文件匹配他们的搜索而实际上只有一个。
此外,相关性得分也不正确。有什么想法吗?
因此 AWS 支持的响应是将文件拆分为单独的文档。为了回应我对相关性评分和多次点击的担忧,他们说了以下内容:
You do raise two very valid concerns here for your more challenging use case here. With regard to relevance, you face a very significant problem already in that is harder to establish a strong 'signal' and degrees of differentiation with large bodies of text. If the documents you have are much like reports or whitepapers, a potential workaround to this may be in indexing the first X number of characters (or the first identified paragraph) into a "thesis" field. This field could be weighted to better indicate what the document subject matter may be without manual review.
With regard to result duplication, this will require post-processing on your end if you wish to filter it. You can create a new field that can generate a unique "Parent" id that will be shared for each chunk of the whole document. The post-processing can check to see if this "Parent" id has already been return(the first result should be seen as most relevant), and if it has, filter the subsequent results. What is doubly useful in such a scenario, is that you include a refinement link into your results that could filter on all matches within that particular Parent id.
我可以将文件内容拆分成单独的搜索文档,但我必须在结果中手动识别这一点,并且只向用户显示一个结果 - 否则看起来有 2 个文件匹配他们的搜索而实际上只有一个。
此外,相关性得分也不正确。有什么想法吗?
因此 AWS 支持的响应是将文件拆分为单独的文档。为了回应我对相关性评分和多次点击的担忧,他们说了以下内容:
You do raise two very valid concerns here for your more challenging use case here. With regard to relevance, you face a very significant problem already in that is harder to establish a strong 'signal' and degrees of differentiation with large bodies of text. If the documents you have are much like reports or whitepapers, a potential workaround to this may be in indexing the first X number of characters (or the first identified paragraph) into a "thesis" field. This field could be weighted to better indicate what the document subject matter may be without manual review.
With regard to result duplication, this will require post-processing on your end if you wish to filter it. You can create a new field that can generate a unique "Parent" id that will be shared for each chunk of the whole document. The post-processing can check to see if this "Parent" id has already been return(the first result should be seen as most relevant), and if it has, filter the subsequent results. What is doubly useful in such a scenario, is that you include a refinement link into your results that could filter on all matches within that particular Parent id.