直接从 Amazon Transcribe 获取结果(无服务器)

Get result from Amazon Transcribe directly (serverless)

我使用无服务器 Lambda 服务通过 Amazon Transcribe 将语音转录为文本。我当前的脚本能够从 S3 转录文件并将结果存储为 JSON 文件也在 S3 中。

是否可以直接获取结果,因为我想将其存储在数据库中(AWS RDS 中的 PostgreSQL)?

感谢您的提示

serverless.yml

...
provider:
  name: aws
  runtime: nodejs10.x
  region: eu-central-1
  memorySize: 128
  timeout: 30
  environment:
    S3_AUDIO_BUCKET: ${self:service}-${opt:stage, self:provider.stage}-records
    S3_TRANSCRIPTION_BUCKET: ${self:service}-${opt:stage, self:provider.stage}-transcriptions
    LANGUAGE_CODE: de-DE
  iamRoleStatements:
    - Effect: Allow
      Action:
        - s3:PutObject
        - s3:GetObject
      Resource:
        - 'arn:aws:s3:::${self:provider.environment.S3_AUDIO_BUCKET}/*'
        - 'arn:aws:s3:::${self:provider.environment.S3_TRANSCRIPTION_BUCKET}/*'
    - Effect: Allow
      Action:
        - transcribe:StartTranscriptionJob
      Resource: '*'

functions:

  transcribe:
    handler: handler.transcribe
    events:
      - s3:
          bucket: ${self:provider.environment.S3_AUDIO_BUCKET}
          event: s3:ObjectCreated:*

  createTextinput:
    handler: handler.createTextinput
    events:
      - http:
          path: textinputs
          method: post
          cors: true
...

resources:
  Resources:
    S3TranscriptionBucket:
      Type: 'AWS::S3::Bucket'
      Properties:
        BucketName: ${self:provider.environment.S3_TRANSCRIPTION_BUCKET}  
...

handler.js

const db = require('./db_connect');

const awsSdk = require('aws-sdk');

const transcribeService = new awsSdk.TranscribeService();

module.exports.transcribe = (event, context, callback) => {
  const records = event.Records;

  const transcribingPromises = records.map((record) => {
    const recordUrl = [
      'https://s3.amazonaws.com',
      process.env.S3_AUDIO_BUCKET,
      record.s3.object.key,
    ].join('/');

    // create random filename to avoid conflicts in amazon transcribe jobs

    function makeid(length) {
       var result           = '';
       var characters       = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789';
       var charactersLength = characters.length;
       for ( var i = 0; i < length; i++ ) {
          result += characters.charAt(Math.floor(Math.random() * charactersLength));
       }
       return result;
    }

    const TranscriptionJobName = makeid(7);

    return transcribeService.startTranscriptionJob({
      LanguageCode: process.env.LANGUAGE_CODE,
      Media: { MediaFileUri: recordUrl },
      MediaFormat: 'wav',
      TranscriptionJobName,
      //MediaSampleRateHertz: 8000, // normally 8000 if you are using wav file
      OutputBucketName: process.env.S3_TRANSCRIPTION_BUCKET,
    }).promise();
  });

  Promise.all(transcribingPromises)
    .then(() => {
      callback(null, { message: 'Start transcription job successfully' });
    })
    .catch(err => callback(err, { message: 'Error start transcription job' }));
};

module.exports.createTextinput = (event, context, callback) => {
  context.callbackWaitsForEmptyEventLoop = false;
  const data = JSON.parse(event.body);
  db.insert('textinputs', data)
    .then(res => {
      callback(null,{
        statusCode: 200,
        body: "Textinput Created! id: " + res
      })
    })
    .catch(e => {
      callback(null,{
        statusCode: e.statusCode || 500,
        body: "Could not create a Textinput " + e
      })
    }) 
};

Amazon Transcribe 目前仅支持在 S3 中存储转录,如 API definition for StartTranscriptionJob 中所述。但有一种特殊情况:如果您不想管理自己的 S3 存储桶以进行转录,您可以省略 OutputBucketName,转录将存储在 AWS 管理的 S3 存储桶中。在这种情况下,您将获得一个预签名 URL,允许您下载转录本。

由于转录是异步发生的,因此我建议您创建第二个 AWS Lambda 函数,由 CloudWatch 事件触发,该事件会在您的转录状态发生变化时发出(如 Using Amazon CloudWatch Events with Amazon Transcribe) or by a S3 notification (Using AWS Lambda with Amazon S3 中所述)。然后,此 AWS Lambda 函数可以从 S3 获取完成的转录并将其内容存储在 PostgreSQL 中。

我认为您最好的选择是在存储转录时从 s3 事件触发 lambda,然后 post 将数据发送到您的数据库。正如 Dunedan 提到的,您不能直接从转录到数据库。

您可以像这样通过无服务器将事件添加到 lambda:

storeTranscriptonInDB:
  handler: index.storeTransciptInDB
  events:
    - s3:
        bucket: ${self:provider.environment.S3_TRANSCRIPTION_BUCKET}
        rules:
          - suffix: .json

脚本文件的 s3 密钥将为 event.Records[#].s3.object.key 我会遍历记录以便彻底,并且对每个记录都做这样的事情:

const storeTransciptInDB = async (event, context, callback) => {
  const records = event.Records;
  for (record of event.Records) {
    let key = record.s3.object.key;
    let params = {
      Bucket: record.s3.bucket.name,
      Key: key
    }
    let transcriptFile = await s3.getObject(params).promise();
    let transcriptObject = JSON.parse(data.Body.toString("utf-8"));
    let transcriptResults = transcriptObject.results.transcripts;
    let transcript = "";
    transcriptResults.forEach(result => (transcript += result.transcript + " "));
    // at this point you can post the transcript variable to your database
  }
}