为 boto3 MTurk 构建 HTMLQuestion XML

Building HTMLQuestion XML for boto3 MTurk

我正在尝试构建 XML 以使用 HTMLQuestion 数据结构和 boto3 的 create_hit function. According to the documentation, the XML should be formatted like this.

提交到亚马逊的 Mechanical Turks 服务

我创建了一个 class TurkTaskAssembler,它具有生成 xml 并通过 API 将此 XML 传递到 Mechanical Turks 平台的方法.我使用 boto3 库来处理与 Amazon 的通信。

我生成的 XML 似乎格式不正确,因为当我尝试通过 API 传递此 XML 时,我收到验证错误,例如所以:

>>> tta = TurkTaskAssembler("What color is the sky?")
>>> response = tta.create_hit_task()
>>> ParamValidationError: Parameter validation failed: Invalid type for parameter Question, value: <Element HTMLQuestion at 0x1135f68c0>, type: <type 'lxml.etree._Element'>, valid types: <type 'basestring'>

然后我修改了 create_question_xml 方法,使用 tostring 方法将 XML 信封转换为字符串,但这会产生不同的错误:

>>> tta = TurkTaskAssembler("What color is the sky?")
>>> tta.create_hit_task()
>>> ClientError: An error occurred (ParameterValidationError) when calling the CreateHIT operation: There was an error parsing the XML question or answer data in your request.  Please make sure the data is well-formed and validates against the appropriate schema. Details: cvc-elt.1.a: Cannot find the declaration of element 'HTMLQuestion'. (1508611228659 s)

我真的不确定自己做错了什么,而且 XML 经验很少。

以下是所有相关代码:

import os
import boto3
from lxml.etree import Element, SubElement, CDATA, tostring
from .settings import mturk_access_key_id, mturk_access_secret_key

xml_schema_url = 'http://mechanicalturk.amazonaws.com/AWSMechanicalTurkDataSchemas/2011-11-11/HTMLQuestion.xsd'


class TurkTaskAssembler(object):

    def __init__(self, question):
        self.client = boto3.client(
            service_name='mturk',
            region_name='us-east-1',
            endpoint_url='https://mturk-requester-sandbox.us-east-1.amazonaws.com',
            aws_access_key_id=mturk_access_key_id,
            aws_secret_access_key=mturk_access_secret_key
        )
        self.question = question

    def create_question_xml(self):
        # questionFile = open(os.path.join(__location__, "question.xml"), "r")
        # question = questionFile.read()
        # return question
        XHTML_NAMESPACE = xml_schema_url
        XHTML = "{%s}" % XHTML_NAMESPACE
        NSMAP = {
            None : XHTML_NAMESPACE,
            'xsi': 'http://www.w3.org/2001/XMLSchema-instance',
            ''
            }
        envelope = Element("HTMLQuestion", nsmap=NSMAP)

        html =  """
            <!DOCTYPE html>
            <html>
             <head>
              <meta http-equiv='Content-Type' content='text/html; charset=UTF-8'/>
              <script type='text/javascript' src='https://s3.amazonaws.com/mturk-public/externalHIT_v1.js'></script>
             </head>
             <body>
              <form name='mturk_form' method='post' id='mturk_form' action='https://www.mturk.com/mturk/externalSubmit'>
              <input type='hidden' value='' name='assignmentId' id='assignmentId'/>
              <h1>Answer this question</h1>
              <p>{question}</p>
              <p><textarea name='comment' cols='80' rows='3'></textarea></p>
              <p><input type='submit' id='submitButton' value='Submit' /></p></form>
              <script language='Javascript'>turkSetAssignmentID();</script>
             </body>
            </html>
            """.format(question=self.question)

        html_content = SubElement(envelope, 'HTMLContent')
        html_content.text = CDATA(html)
        xml_meta = """<?xml version="1.1" encoding="utf-8"?>"""
        return xml_meta + tostring(envelope, encoding='utf-8')

    def create_hit_task(self):
        response = self.client.create_hit(
            MaxAssignments=1,
            AutoApprovalDelayInSeconds=10800,
            LifetimeInSeconds=10800,
            AssignmentDurationInSeconds=300,
            Reward='0.05',
            Title='a title',
            Keywords='some keywords',
            Description='a description',
            Question=self.create_question_xml(),
        )
        return response

为什么不简单地将 XML 数据放入一个单独的 XML 文件中(就像您所做的那样,但注释掉了)?这将使您不必合并多个模块和大量代码。

使用您描述的模板 here,创建 question.xml:

<HTMLQuestion xmlns="http://mechanicalturk.amazonaws.com/AWSMechanicalTurkDataSchemas/2011-11-11/HTMLQuestion.xsd">
  <HTMLContent><![CDATA[
<!DOCTYPE html>
<html>
 <head>
  <meta http-equiv='Content-Type' content='text/html; charset=UTF-8'/>
  <script type='text/javascript' src='https://s3.amazonaws.com/mturk-public/externalHIT_v1.js'></script>
 </head>
 <body>
  <form name='mturk_form' method='post' id='mturk_form' action='https://www.mturk.com/mturk/externalSubmit'>
  <input type='hidden' value='' name='assignmentId' id='assignmentId'/>
  <h1>Answer this question</h1>
  <p>{question}</p>
  <p><textarea name='comment' cols='80' rows='3'></textarea></p>
  <p><input type='submit' id='submitButton' value='Submit' /></p></form>
  <script language='Javascript'>turkSetAssignmentID();</script>
 </body>
</html>
]]>
  </HTMLContent>
  <FrameHeight>450</FrameHeight>
</HTMLQuestion>

然后在您的 create_question_xml() 函数中:

def create_question_xml(self):
    question_file = open("question.xml", "r").read()
    xml = question_file.format(question=self.question)
    return xml

这应该就是您所需要的。

我认为您对亚马逊建议您使用的 3 种格式有点困惑。 据我所知,您选择了 HTMLQuestion. (Other two are: ExternalQuestion and QuestionFormData)。

要以 HTMLQuestion 格式保留问题,只需使用文档中提供的简单示例,无需将其包装在 XML 中。这是固定函数:

def create_question_html(self):
    # you can extract template into a file, 
    # as @Mangohero1 suggested which would simplify code a bit.

    return """
    <HTMLQuestion xmlns="http://mechanicalturk.amazonaws.com/AWSMechanicalTurkDataSchemas/2011-11-11/HTMLQuestion.xsd">
      <HTMLContent><![CDATA[
    <!DOCTYPE html>
    <html>
     <head>
      <meta http-equiv='Content-Type' content='text/html; charset=UTF-8'/>
      <script type='text/javascript' src='https://s3.amazonaws.com/mturk-public/externalHIT_v1.js'></script>
     </head>
     <body>
      <form name='mturk_form' method='post' id='mturk_form' action='https://www.mturk.com/mturk/externalSubmit'>
      <input type='hidden' value='' name='assignmentId' id='assignmentId'/>
      <h1>{question}</h1>
      <p><textarea name='comment' cols='80' rows='3'></textarea></p>
      <p><input type='submit' id='submitButton' value='Submit' /></p></form>
      <script language='Javascript'>turkSetAssignmentID();</script>
     </body>
    </html>
    ]]>
      </HTMLContent>
      <FrameHeight>450</FrameHeight>
    </HTMLQuestion>
    """.format(question=self.question)