如何在 nodejs 上将 concatenate/join 音频缓冲区数组（文本到语音转换结果）合并为一个？

Question

我想将许多文本转换成一个音频，但我很困惑如何将多个音频连接成一个音频文件（由于 5k chars/request 限制，您无法将长文本转换为音频).

我当前的代码如下。它生成多个音频字节数组，但无法合并 mp3 音频，因为它忽略了 head/meta 信息。是否推荐在 TTS 领域使用 LINEAR16？我很高兴听到任何建议。谢谢。

  const client = new textToSpeech.TextToSpeechClient();
  const promises = ['hi','world'].map(text => {
    const requestBody = {
      audioConfig: {
        audioEncoding: 'MP3'
      },
      input: {
        text: text,
      },
      voice: {
        languageCode: 'en-US',
        ssmlGender: 'NEUTRAL'
      },
    };
    return client.synthesizeSpeech(requestBody)
  })
  const responses = await Promise.all(promises)
  console.log(responses)
  const audioContents = responses.map(res => res[0].audioContent)
  const audioContent = audioContents.join() // this line has a problem

标准输出

[
  [
    {
      audioContent: <Buffer ff f3 44 c4 00 12 a0 01 24 01 40 00 01 7c 06 43 fa 7f 80 38 46 63 fe 1f 00 33 3f c7 f0 03 03 33 1f c1 f0 0c eb fa 3f 03 20 7e 63 f3 78 03 ba 64 73 e0 ... 2638 more bytes>
    },
    null,
    null
  ],
  [
    {
      audioContent: <Buffer ff f3 44 c4 00 12 58 05 24 01 41 00 01 1e 02 23 9e 1f e0 1f 83 83 df ef 80 e8 ff 99 f0 0c 00 e8 7f c3 68 03 cf fd f8 8f ff 0f 3c 7f 88 f8 8c 87 e0 23 ... 2926 more bytes>
    },
    null,
    null
  ]
]

Answer 1

解决方法 1

正如我在评论中提到的，节点中有一个 google-tts-concat-ssml 包满足您的要求，这不是 Google 官方包。它会根据 API 的 5K 字符限制自动发出多个请求，并将生成的音频连接到单个音频文件中。在执行代码之前，安装以下客户端库：

npm install @google-cloud/text-to-speech
npm install google-text-to-speech-concat --save

通过在每个请求的 <p></p> 标记之间添加少于 5k 个字符来尝试下面的代码 API。例如，如果您有 9K 个字符，则需要将其拆分为 2 个或更多个请求，因此在 <p></p> 标记之间添加前 5K 个字符，然后在新的 [=15= 标记之间添加剩余的 4k 个字符] 标签。因此，通过使用 google-text-to-speech-concat 包，API 返回的音频文件被连接成一个音频文件。

const textToSpeech =require('@google-cloud/text-to-speech');
const testSynthesize =require('google-text-to-speech-concat');
const fs = require('fs');
const path= require('path');
(async () => {
 const request = {
   voice: {
     languageCode: 'en-US',
     ssmlGender: 'FEMALE'
   },
   input: {
     ssml: `
     <speak>
     <p>add less than 5k chars between paragraph tags</p>
     <p>add less than 5k chars between paragraph tags</p>
     </speak>`
   },
   audioConfig: {
     audioEncoding: 'MP3'
   }
 };
 try {
   // Create your Text To Speech client
   // More on that here: https://cloud.google.com/docs/authentication/production#providing_credentials_to_your_application
   const textToSpeechClient = new textToSpeech.TextToSpeechClient({
     keyFilename: path.join(__dirname, 'google-cloud-credentials.json')
   });
   // Synthesize the text, resulting in an audio buffer
   const buffer = await testSynthesize.synthesize(textToSpeechClient, request);
   // Handle the buffer
   // For example write it to a file or directly upload it to storage, like S3 or Google Cloud Storage
   const outputFile = path.join(__dirname, 'Output.mp3');
   // Write the file
   fs.writeFile(outputFile, buffer, 'binary', (err) => {
     if (err) throw err;
     console.log('Got audio!', outputFile);
   });
 } catch (err) {
   console.log(err);
 }
})();

解决方法 2

尝试使用以下代码将整个文本拆分为 5K 个字符的集合，并将它们发送到 API 进行转换。如您所知，这会创建多个音频文件。在执行代码之前，在您当前的工作目录中创建一个文件夹来存储输出的音频文件。

const textToSpeech = require('@google-cloud/text-to-speech');
const fs = require('fs');
const util = require('util');
 
// Creates a client
const client = new textToSpeech.TextToSpeechClient();
 
(async function () {
 
 // The text to synthesize
 var text = fs.readFileSync('./text.txt', 'utf8');
 var newArr = text.match(/[^\.]+\./g);
 
 var charCount = 0;
 var textChunk = "";
 var index = 0;
 
 for (var n = 0; n < newArr.length; n++) {
 
   charCount += newArr[n].length;
   textChunk = textChunk + newArr[n];
 
   console.log(charCount);
 
   if (charCount > 4600 || n == newArr.length - 1) {
 
     console.log(textChunk);
 
     // Construct the request
     const request = {
       input: {
         text: textChunk
       },
       // Select the language and SSML voice gender (optional)
       voice: {
         languageCode: 'en-US',
         ssmlGender: 'MALE',
         name: "en-US-Wavenet-B"
       },
       // select the type of audio encoding
       audioConfig: {
         effectsProfileId: [
           "headphone-class-device"
         ],
         pitch: -2,
         speakingRate: 1.1,
         audioEncoding: "MP3"
       },
     };
 
     // Performs the text-to-speech request
     const [response] = await client.synthesizeSpeech(request);
 
     console.log(response);
 
     // Write the binary audio content to a local file
     const writeFile = util.promisify(fs.writeFile);
     await writeFile('result/Output' + index + '.mp3', response.audioContent, 'binary');
     console.log('Audio content written to file: output.mp3');
 
     index++;
 
     charCount = 0;
     textChunk = "";
   }
 }
}());

要将输出的音频文件合并为一个音频文件，可以使用 audioconcat 包，它不是 Google 官方包。您也可以使用其他类似的可用包来连接音频文件。

要使用这个 audioconcat 库需要已经安装了 ffmpeg 应用程序（不是 ffmpeg NPM 包）。所以，在执行连接音频文件的代码之前，安装基于你的OS的ffmpeg工具并安装以下客户端库：

npm install audioconcat
npm install ffmpeg --enable-libmp3lame

尝试下面的代码，它连接输出目录中的所有音频文件，并将单个连接的 output.mp3 音频文件存储在当前工作目录中。

const audioconcat = require('audioconcat')
const testFolder = 'result/';
const fs = require('fs');
var array=[];
fs.readdirSync(testFolder).forEach(songs => {
 array.push("result/"+songs);
 console.log(songs);
});
 
audioconcat(array)
 .concat('output.mp3')
 .on('start', function (command) {
   console.log('ffmpeg process started:', command)
 })
 .on('error', function (err, stdout, stderr) {
   console.error('Error:', err)
   console.error('ffmpeg stderr:', stderr)
 })
 .on('end', function (output) {
   console.error('Audio successfully created', output)
 })

对于这两种解决方法，我测试了来自各种 GitHub 链接的代码，并根据您的要求修改了代码。以下是供您参考的链接。

如何在 nodejs 上将 concatenate/join 音频缓冲区数组（文本到语音转换结果）合并为一个？

How to concatenate/join audio buffer arrays (text-to-speech results) into one on nodejs?

node.js

google-text-to-speech