如何合并 AWS Comprehend batch_detect_key_phrases() ResultList 和 ErrorList

How to merge the AWS Comprehend batch_detect_key_phrases() ResultList and ErrorList

我有一个包含推文的数据框。每行对应 1 条推文。我可以获得 使用 AWS Comprehend batch_detect_key_phrases() 的关键短语。 batch_detect_key_phrases() returns一个 负载中的 ResultList 和 ErrorList。为了将关键短语结果合并回数据框,它们需要与原始推文对齐,因此我需要保持 ResultList 和 ErrorList 对齐。

第267行的code here分别处理了ErrorList和ResultList。

根据 Python Boto docs, "ErrorList (list) -- A list containing one object for each document that contained an error. The results are sorted in ascending order by the Index field and match the order of the documents in the input list..."

我在下面编写的代码使用 ResultList 和 ErrorList 索引号来确保它们是 正确合并到 keyPhrases 列表中,然后将其合并回原始数据框。 本质上,keyPhrases[0] 是与数据帧第 0 行关联的关键短语。如果有 处理推文时出错,然后将占位符错误消息添加到该行 数据框。

我认为可以使 ResultList 和 ErrorList 保持对齐的唯一其他方法是 将 2 个列表合并到一个更大的列表中,该列表按各自的索引升序排列。 接下来,我将处理该 1 个更大的列表。

有没有更简单的方法来处理 ResultList 和 ErrorList,使它们保持对齐?

keyphraseResults = {'ResultList': [
            {'Index': 0, 'KeyPhrases': [{'Score': 0.9999997615814209, 'Text': 'financial status', 'BeginOffset': 26, 'EndOffset': 42}, {'Score': 1.0, 'Text': 'my job', 'BeginOffset': 58, 'EndOffset': 64}, {'Score': 1.0, 'Text': 'title', 'BeginOffset': 69, 'EndOffset': 71}, {'Score': 1.0, 'Text': 'a new job', 'BeginOffset': 77, 'EndOffset': 86}]}, 
            {'Index': 1, 'KeyPhrases': [{'Score': 0.9999849796295166, 'Text': 'Holy moley', 'BeginOffset': 0, 'EndOffset': 4}, {'Score': 1.0, 'Text': 'Batman', 'BeginOffset': 27, 'EndOffset': 29}, {'Score': 1.0, 'Text': 'has a jacket', 'BeginOffset': 47, 'EndOffset': 55}]},                 
            {'Index': 3, 'KeyPhrases': [{'Score': 0.9999970197677612, 'Text': 'USA', 'BeginOffset': 4, 'EndOffset': 7}]}, 
            {'Index': 5, 'KeyPhrases': [{'Score': 0.9999970197677612, 'Text': 'home town', 'BeginOffset': 6, 'EndOffset': 15}]}], 
'ErrorList': [{"ErrorCode": "123", "ErrorMessage": "First error goes here", "Index": 2},
              {"ErrorCode": "456", "ErrorMessage": "Second error goes here", "Index": 4}], 
'ResponseMetadata': {'RequestId': '123b6c73-45e0-4595-b943-612accdef41b', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '123b6c73-e5f7-4b95-b52s-612acc71341d', 'content-type': 'application/x-amz-json-1.1', 'content-length': '1125', 'date': 'Sat, 06 Jun 2020 20:38:04 GMT'}, 'RetryAttempts': 0}}

# Holds the ordered list of key phrases that correspond to the data frame. 
keyPhrases = []

# Set it to an arbitrarily large number in case ErrorList below is empty we'll still 
# need a number for comparison. 
errIndexlist = [9999]

# This will be inserted for the rows corresponding to the ErrorList. 
ErrorMessage = "* Error processing keyphrases"

# Since the rows of the response need to be kept in alignment with the rows of the dataframe, 
# get the error indicies first, if any. These will be compared to the ResultList below.
if 'ErrorList' in keyphraseResults and len(keyphraseResults['ErrorList']) > 0:
    batchErroresults = keyphraseResults["ErrorList"]
    errIndexlist = []

    for entry in batchErroresults:
        errIndexlist.append(entry["Index"])
        print(entry)

# Sort the indicies to ensure they are in ascending order since that order is 
# important for the logic below. 
errIndexlist.sort(reverse = False)

if 'ResultList' in keyphraseResults:

    batchResults = keyphraseResults["ResultList"]

    for entry in batchResults:

        resultDict = entry["KeyPhrases"]

        if len(errIndexlist) > 0:

            if entry['Index'] < errIndexlist[0]:

                results = ""
                for textDict in resultDict: 
                    results = results + ", " + textDict['Text']

                # Remove the leading comma.
                if len(results) > 1:
                    results = results[2:]

                keyPhrases.append(results)

            else:
                # Else we have an error to merge from the PRIOR result.
                keyPhrases.append(ErrorMessage)
                errIndexlist.remove(errIndexlist[0])

                # THEN add the key phrase for the current result.
                results = ""
                for textDict in resultDict: 
                    results = results + ", " + textDict['Text']

                # Remove the leading comma.
                if len(results) > 1:
                    results = results[2:]

                keyPhrases.append(results)

print("\nFinal results are:")
for text in keyPhrases:
    print(text)

我是根据这个SO post想出来的。

总的来说,合并ResultList和ErrorList,在Index上对合并后的列表进行排序,然后顺序处理合并后的列表。

from operator import itemgetter

keyphraseResults = {'ResultList': [
        {'Index': 0, 'KeyPhrases': [{'Score': 0.9999997615814209, 'Text': 'financial status', 'BeginOffset': 26, 'EndOffset': 42}, {'Score': 1.0, 'Text': 'my job', 'BeginOffset': 58, 'EndOffset': 64}, {'Score': 1.0, 'Text': 'title', 'BeginOffset': 69, 'EndOffset': 71}, {'Score': 1.0, 'Text': 'a new job', 'BeginOffset': 77, 'EndOffset': 86}]}, 
        {'Index': 1, 'KeyPhrases': [{'Score': 0.9999849796295166, 'Text': 'Holy moley', 'BeginOffset': 0, 'EndOffset': 4}, {'Score': 1.0, 'Text': 'Batman', 'BeginOffset': 27, 'EndOffset': 29}, {'Score': 1.0, 'Text': 'has a jacket', 'BeginOffset': 47, 'EndOffset': 55}]},                 
        {'Index': 3, 'KeyPhrases': [{'Score': 0.9999970197677612, 'Text': 'USA', 'BeginOffset': 4, 'EndOffset': 7}]}, 
        {'Index': 5, 'KeyPhrases': [{'Score': 0.9999970197677612, 'Text': 'home town', 'BeginOffset': 6, 'EndOffset': 15}]}], 
        'ErrorList': [{"ErrorCode": "123", "ErrorMessage": "First error goes here", "Index": 2},
          {"ErrorCode": "456", "ErrorMessage": "Second error goes here", "Index": 4}], 
        'ResponseMetadata': {'RequestId': '123b6c73-45e0-4595-b943-612accdef41b',   'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '123b6c73-e5f7-4b95-b52s-612acc71341d', 'content-type': 'application/x-amz-json-1.1', 'content-length': '1125', 'date': 'Sat, 06 Jun 2020 20:38:04 GMT'}, 'RetryAttempts': 0}}

keyPhrases = []

# This will be inserted for the rows in ErrorList or just make it empty. 
ErrorMessage = "* Error processing keyphrases"

if len(keyphraseResults["ResultList"]) > 0 and len(keyphraseResults["ErrorList"]) > 0:
    processResults = keyphraseResults["ResultList"].copy() + keyphraseResults["ErrorList"].copy()
elif len(keyphraseResults["ResultList"]) > 0:
    processResults = keyphraseResults["ResultList"].copy()
else:
    processResults = keyphraseResults["ErrorList"].copy()

processResults = sorted(processResults, key=itemgetter('Index'), reverse = False)

for entry in processResults:

    if 'ErrorCode' in entry:
        keyPhrases.append(ErrorMessage)

    elif 'KeyPhrases' in entry:
        resultDict = entry["KeyPhrases"]

        results = ""
        for textDict in resultDict: 
            results = results + ", " + textDict['Text']

        # Remove the leading comma.
        if len(results) > 2:
            results = results[2:]

        keyPhrases.append(results)

print("\nFinal results are:")
for text in keyPhrases:
    print(text)