Python-Textract-Boto3 - 尝试将方法调用的结果作为参数传递给同一方法,然后循环
Python-Textract-Boto3 - Trying to pass result of a method call as an argument to the same method, and loop
我在 AWS S3 上有一个多页 pdf,我正在使用 textract 提取所有文本。我可以分批获得响应,其中第一个响应为我提供了 'NextToken',我需要将其作为参数传递给 get_document_analysis 方法。
如何避免每次手动粘贴从前一个 运行 收到的 NextToken 值时手动 运行 宁 get_document_analysis 方法?
尝试一下:
import boto3
client = boto3.client('textract')
# Get my JobId
test_output = client.start_document_text_detection(DocumentLocation = {'S3Object': {'Bucket':'myawsbucket', 'Name':'mymuli-page-pdf-file.pdf'}})['JobId']
def my_output():
my_ls = []
# I need to repeat the the following function until the break condition further below
while True:
# This returns a dictionary, with one key named NextToken, which value will need to be passed as an arg to the next iteration of the function
x=client.get_document_analysis(JobId = my_job_id_ref)
# Assinging value of NextToken to a variable
next_token = x['NextToken']
#Running the function again, this time with the next_token passed as an argument.
x=client.get_document_analysis(JobId = my_job_id_ref, NextToken = next_token)
# Need to repeat the running of the function until there is no token. The token is normally a string, hence
if len(next_token) <1:
break
my_ls.append(x)
return my_ls
诀窍是使用 while
条件来检查 nextToken 是否为空。
# Get the analysis once to see if there is a need to loop in the first place
x=client.get_document_analysis(JobId = my_job_id_ref)
next_token = x.get('NextToken')
my_ls.append(x)
# Now repeat until we have the last page
while next_token is not None:
x = client.get_document_analysis(JobId = my_job_id_ref)
next_token = x.get('NextToken')
my_ls.append(x)
next_token
的值将不断被覆盖,直到它是 None - 此时我们跳出循环。
请注意,我正在使用 x.get(..)
检查 response-dictionary 是否包含 NextToken。它可能不会一开始就设置,在这种情况下 .get(..)
将始终 return None
。 (如果未设置 NextToken,x["NextToken"]
将抛出 KeyError
。)
我在 AWS S3 上有一个多页 pdf,我正在使用 textract 提取所有文本。我可以分批获得响应,其中第一个响应为我提供了 'NextToken',我需要将其作为参数传递给 get_document_analysis 方法。
如何避免每次手动粘贴从前一个 运行 收到的 NextToken 值时手动 运行 宁 get_document_analysis 方法?
尝试一下:
import boto3
client = boto3.client('textract')
# Get my JobId
test_output = client.start_document_text_detection(DocumentLocation = {'S3Object': {'Bucket':'myawsbucket', 'Name':'mymuli-page-pdf-file.pdf'}})['JobId']
def my_output():
my_ls = []
# I need to repeat the the following function until the break condition further below
while True:
# This returns a dictionary, with one key named NextToken, which value will need to be passed as an arg to the next iteration of the function
x=client.get_document_analysis(JobId = my_job_id_ref)
# Assinging value of NextToken to a variable
next_token = x['NextToken']
#Running the function again, this time with the next_token passed as an argument.
x=client.get_document_analysis(JobId = my_job_id_ref, NextToken = next_token)
# Need to repeat the running of the function until there is no token. The token is normally a string, hence
if len(next_token) <1:
break
my_ls.append(x)
return my_ls
诀窍是使用 while
条件来检查 nextToken 是否为空。
# Get the analysis once to see if there is a need to loop in the first place
x=client.get_document_analysis(JobId = my_job_id_ref)
next_token = x.get('NextToken')
my_ls.append(x)
# Now repeat until we have the last page
while next_token is not None:
x = client.get_document_analysis(JobId = my_job_id_ref)
next_token = x.get('NextToken')
my_ls.append(x)
next_token
的值将不断被覆盖,直到它是 None - 此时我们跳出循环。
请注意,我正在使用 x.get(..)
检查 response-dictionary 是否包含 NextToken。它可能不会一开始就设置,在这种情况下 .get(..)
将始终 return None
。 (如果未设置 NextToken,x["NextToken"]
将抛出 KeyError
。)