Python-Textract-Boto3 - 尝试将方法调用的结果作为参数传递给同一方法,然后循环

Python-Textract-Boto3 - Trying to pass result of a method call as an argument to the same method, and loop

我在 AWS S3 上有一个多页 pdf,我正在使用 textract 提取所有文本。我可以分批获得响应,其中第一个响应为我提供了 'NextToken',我需要将其作为参数传递给 get_document_analysis 方法。

如何避免每次手动粘贴从前一个 运行 收到的 NextToken 值时手动 运行 宁 get_document_analysis 方法?

尝试一下:

import boto3

client = boto3.client('textract')

# Get my JobId
test_output = client.start_document_text_detection(DocumentLocation = {'S3Object': {'Bucket':'myawsbucket', 'Name':'mymuli-page-pdf-file.pdf'}})['JobId']

def my_output():
    my_ls = []
    
    # I need to repeat the the following function until the break condition further below
    while True: 
        
        # This returns a dictionary, with one key named NextToken, which value will need to be passed as an arg to the next iteration of the function
        x=client.get_document_analysis(JobId = my_job_id_ref) 
        
        # Assinging value of NextToken to a variable
        next_token = x['NextToken'] 
        
        #Running the function again, this time with the next_token passed as an argument.
        x=client.get_document_analysis(JobId = my_job_id_ref, NextToken = next_token)
        
        # Need to repeat the running of the function until there is no token. The token is normally a string, hence
        if len(next_token) <1:
            break
        
        my_ls.append(x)
        
    return my_ls

诀窍是使用 while 条件来检查 nextToken 是否为空。

# Get the analysis once to see if there is a need to loop in the first place
x=client.get_document_analysis(JobId = my_job_id_ref) 
next_token = x.get('NextToken')
my_ls.append(x)

# Now repeat until we have the last page
while next_token is not None:
    x = client.get_document_analysis(JobId = my_job_id_ref) 
    next_token = x.get('NextToken')
    my_ls.append(x)

next_token 的值将不断被覆盖,直到它是 None - 此时我们跳出循环。

请注意,我正在使用 x.get(..) 检查 response-dictionary 是否包含 NextToken。它可能不会一开始就设置,在这种情况下 .get(..) 将始终 return None。 (如果未设置 NextToken,x["NextToken"] 将抛出 KeyError。)