为 Elasticsearch 文档解析 Google 自定义搜索 API

Parsing Google Custom Search API for Elasticsearch Documents

Google Custom Search API 检索结果并将其写入 JSON 后,我想解析 JSON 以生成有效的 Elasticsearch 文档。您可以为嵌套结果配置父子关系。然而,这种关系似乎不是由数据结构本身推断出来的。我试过自动加载,但没有结果。

下面是一些示例输入,不包括 id 或 index 等内容。我试图专注于创建正确的数据结构。我试过修改深度优先搜索等图形算法,但 运行 遇到了不同数据结构的问题。

这是一些示例输入:

# mock data structure
google = {"content": "foo", 
          "results": {"result_one": {"persona": "phone",
                                     "personb":  "phone",
                                     "personc":  "phone"
                                    },
                      "result_two": ["thing1",
                                     "thing2",
                                     "thing3"
                                    ],
                      "result_three": "none"
                     },
          "query": ["Taylor Swift", "Bob Dole", "Rocketman"]
}

# correctly formatted documents for _source of elasticsearch entry
correct_documents = [
    {"content":"foo"},
    {"results": ["result_one", "result_two", "result_three"]},
    {"result_one": ["persona", "personb", "personc"]},
    {"persona": "phone"},
    {"personb": "phone"},
    {"personc": "phone"},
    {"result_two":["thing1","thing2","thing3"]},
    {"result_three": "none"},
    {"query": ["Taylor Swift", "Bob Dole", "Rocketman"]}
]

这是我目前的方法,仍在进行中:

def recursive_dfs(graph, start, path=[]):
  '''recursive depth first search from start'''
  path=path+[start]
  for node in graph[start]:
    if not node in path:
      path=recursive_dfs(graph, node, path)
  return path

def branching(google):
    """ Get branches as a starting point for dfs"""
    branch = 0
    while branch < len(google):

        if google[google.keys()[branch]] is dict:

            #recursive_dfs(google, google[google.keys()[branch]])
            pass

        else:
            print("branch {}: result {}\n".format(branch,     google[google.keys()[branch]]))

        branch += 1

branching(google)

你可以看到recursive_dfs()仍然需要修改以处理字符串和列表数据结构。

我会继续这样做,但如果您有任何想法、建议或解决方案,我将不胜感激。感谢您的宝贵时间。

这是您的问题的可能答案。

def myfunk( inHole, outHole):
    for keys in inHole.keys():
        is_list = isinstance(inHole[keys],list);
        is_dict = isinstance(inHole[keys],dict);
        if is_list:
            element = inHole[keys];
            new_element = {keys:element};
            outHole.append(new_element);
        if is_dict:
            element = inHole[keys].keys();
            new_element = {keys:element};
            outHole.append(new_element);
            myfunk(inHole[keys], outHole);
        if not(is_list or is_dict):
            new_element = {keys:inHole[keys]};
            outHole.append(new_element);
    return outHole.sort();