在 Azure 认知搜索中,我可以将多个 blob 添加到索引中单个记录的集合中吗
In Azure Cognitive Search, can I add multiple blobs into a collection of a single record in an index
我有一个 blob 容器,其中每个文件夹代表一个我在 ACS 中编制索引的项目。文件夹名称是 ACS 索引中项目的键。想象一下下面的容器结构:
container {
item1 {
blob1,
blob2
},
item2 {
blob3
},
item3 {
blob4,
blob5,
blob6
}
}
我希望能够 运行 针对容器的索引器,使用 OcrSkill、KeyPhrases、EntityRecognition 等技能从 blob 中提取见解。
我知道我可以使用 ShaperSkill 将单个 blob/document 的信息转换成我喜欢的格式。例如:
List<InputFieldMappingEntry> inputMappings = new List<InputFieldMappingEntry>();
inputMappings.Add(new InputFieldMappingEntry(
name: "content",
source: "/document/content"));
inputMappings.Add(new InputFieldMappingEntry(
name: "languageCode",
source: "/document/languageCode"));
inputMappings.Add(new InputFieldMappingEntry(
name: "keyPhrases",
source: "/document/keyPhrases"));
inputMappings.Add(new InputFieldMappingEntry(
name: "organizations",
source: "/document/organizations"));
inputMappings.Add(new InputFieldMappingEntry(
name: "name",
source: "/document/name"));
List<OutputFieldMappingEntry> outputMappings = new List<OutputFieldMappingEntry>();
outputMappings.Add(new OutputFieldMappingEntry(
name: "output",
targetName: "myDoc"));
ShaperSkill shaperSkill = new ShaperSkill(
description: "Shape to myDoc",
context: "/document",
name: "Doc Shaper",
inputs: inputMappings,
outputs: outputMappings);
对于索引器本身,我可以像这样从 metadata_storage_path
中提取文件夹名称:
List<FieldMapping> fieldMappings = new List<FieldMapping>();
fieldMappings.Add(new FieldMapping(
sourceFieldName: "metadata_storage_path",
targetFieldName: "key",
mappingFunction: FieldMappingFunction.ExtractTokenAtPosition("/", 4)));
我不知道该怎么做(或者如果我能做到的话)是对 /document/myDoc
输出字段进行多次引用,并将多个条目放入我的 ACS 索引中的集合中。我想要的输出如下:
...(此处仅显示相关字段)
{
"value": [
{
"key": "item1",
"myDocs": [
{
"name": "blob1",
"content": "<content from blob1>",
"languageCode": "<languageCode from blob1>",
"keyPhrases": "<keyPhrases from blob1>",
"organizations": "<organizations from blob1>"
},
{
"name": "blob2",
"content": "<content from blob2>",
"languageCode": "<languageCode from blob2>",
"keyPhrases": "<keyPhrases from blob2>",
"organizations": "<organizations from blob2>"
}
]
},
{
"key": "item2",
"myDocs": [
{
"name": "blob3",
"content": "<content from blob3>",
"languageCode": "<languageCode from blob3>",
"keyPhrases": "<keyPhrases from blob3>",
"organizations": "<organizations from blob3>"
}
]
},
{
"key": "item3",
"myDocs": [
{
"name": "blob4",
"content": "<content from blob4>",
"languageCode": "<languageCode from blob4>",
"keyPhrases": "<keyPhrases from blob4>",
"organizations": "<organizations from blob4>"
},
{
"name": "blob5",
"content": "<content from blob5>",
"languageCode": "<languageCode from blob5>",
"keyPhrases": "<keyPhrases from blob5>",
"organizations": "<organizations from blob5>"
},
{
"name": "blob6",
"content": "<content from blob6>",
"languageCode": "<languageCode from blob6>",
"keyPhrases": "<keyPhrases from blob6>",
"organizations": "<organizations from blob6>"
}
]
}
]
}
有人知道我能做什么吗?
索引器不提供跨多个文档到单个索引字段的聚合,因为它的更改跟踪可能会多次处理 blob,从而导致不确定的结果。解决方案是创建两个索引,一个用于 blob 的索引,一个用于父记录的索引。要么使用外部进程读取blob索引,批量更新父索引,聚合逻辑应该更简单,但需要管理外部触发器;或使用 Custom Web API skill to update the parent index as blobs are processed. The aggregation logic for the custom skill may be more complex to only selective add to the parent record if the child blob doesn't already exist. Check out the examples 设置 Azure Functions 并将技能连接到函数。
我有一个 blob 容器,其中每个文件夹代表一个我在 ACS 中编制索引的项目。文件夹名称是 ACS 索引中项目的键。想象一下下面的容器结构:
container {
item1 {
blob1,
blob2
},
item2 {
blob3
},
item3 {
blob4,
blob5,
blob6
}
}
我希望能够 运行 针对容器的索引器,使用 OcrSkill、KeyPhrases、EntityRecognition 等技能从 blob 中提取见解。 我知道我可以使用 ShaperSkill 将单个 blob/document 的信息转换成我喜欢的格式。例如:
List<InputFieldMappingEntry> inputMappings = new List<InputFieldMappingEntry>();
inputMappings.Add(new InputFieldMappingEntry(
name: "content",
source: "/document/content"));
inputMappings.Add(new InputFieldMappingEntry(
name: "languageCode",
source: "/document/languageCode"));
inputMappings.Add(new InputFieldMappingEntry(
name: "keyPhrases",
source: "/document/keyPhrases"));
inputMappings.Add(new InputFieldMappingEntry(
name: "organizations",
source: "/document/organizations"));
inputMappings.Add(new InputFieldMappingEntry(
name: "name",
source: "/document/name"));
List<OutputFieldMappingEntry> outputMappings = new List<OutputFieldMappingEntry>();
outputMappings.Add(new OutputFieldMappingEntry(
name: "output",
targetName: "myDoc"));
ShaperSkill shaperSkill = new ShaperSkill(
description: "Shape to myDoc",
context: "/document",
name: "Doc Shaper",
inputs: inputMappings,
outputs: outputMappings);
对于索引器本身,我可以像这样从 metadata_storage_path
中提取文件夹名称:
List<FieldMapping> fieldMappings = new List<FieldMapping>();
fieldMappings.Add(new FieldMapping(
sourceFieldName: "metadata_storage_path",
targetFieldName: "key",
mappingFunction: FieldMappingFunction.ExtractTokenAtPosition("/", 4)));
我不知道该怎么做(或者如果我能做到的话)是对 /document/myDoc
输出字段进行多次引用,并将多个条目放入我的 ACS 索引中的集合中。我想要的输出如下:
...(此处仅显示相关字段)
{
"value": [
{
"key": "item1",
"myDocs": [
{
"name": "blob1",
"content": "<content from blob1>",
"languageCode": "<languageCode from blob1>",
"keyPhrases": "<keyPhrases from blob1>",
"organizations": "<organizations from blob1>"
},
{
"name": "blob2",
"content": "<content from blob2>",
"languageCode": "<languageCode from blob2>",
"keyPhrases": "<keyPhrases from blob2>",
"organizations": "<organizations from blob2>"
}
]
},
{
"key": "item2",
"myDocs": [
{
"name": "blob3",
"content": "<content from blob3>",
"languageCode": "<languageCode from blob3>",
"keyPhrases": "<keyPhrases from blob3>",
"organizations": "<organizations from blob3>"
}
]
},
{
"key": "item3",
"myDocs": [
{
"name": "blob4",
"content": "<content from blob4>",
"languageCode": "<languageCode from blob4>",
"keyPhrases": "<keyPhrases from blob4>",
"organizations": "<organizations from blob4>"
},
{
"name": "blob5",
"content": "<content from blob5>",
"languageCode": "<languageCode from blob5>",
"keyPhrases": "<keyPhrases from blob5>",
"organizations": "<organizations from blob5>"
},
{
"name": "blob6",
"content": "<content from blob6>",
"languageCode": "<languageCode from blob6>",
"keyPhrases": "<keyPhrases from blob6>",
"organizations": "<organizations from blob6>"
}
]
}
]
}
有人知道我能做什么吗?
索引器不提供跨多个文档到单个索引字段的聚合,因为它的更改跟踪可能会多次处理 blob,从而导致不确定的结果。解决方案是创建两个索引,一个用于 blob 的索引,一个用于父记录的索引。要么使用外部进程读取blob索引,批量更新父索引,聚合逻辑应该更简单,但需要管理外部触发器;或使用 Custom Web API skill to update the parent index as blobs are processed. The aggregation logic for the custom skill may be more complex to only selective add to the parent record if the child blob doesn't already exist. Check out the examples 设置 Azure Functions 并将技能连接到函数。