使用 Microsoft Azure Text To Speech with Unity 时播放声音的开头和结尾会出现断音

There will be broken sounds at the beginning and end of the playing sound when using Microsoft Azure Text To Speech with Unity

我正在使用 Microsoft Azure Text To Speech with Unity。但是播放声音的开头和结尾会有断音。这是正常现象,还是 result.AudioData 坏了。下面是代码。

    public AudioSource audioSource;
    void Start()
    {
        SynthesisToSpeaker("你好世界");
    }
    public void SynthesisToSpeaker(string text)
    {
        var config = SpeechConfig.FromSubscription("[redacted]", "southeastasia");
        config.SpeechSynthesisLanguage = "zh-CN";
        config.SpeechSynthesisVoiceName = "zh-CN-XiaoxiaoNeural";

        // Creates a speech synthesizer.
        // Make sure to dispose the synthesizer after use!       
        SpeechSynthesizer synthesizer = new SpeechSynthesizer(config, null);
        Task<SpeechSynthesisResult> task = synthesizer.SpeakTextAsync(text);
        StartCoroutine(CheckSynthesizer(task, config, synthesizer));
    }
    private IEnumerator CheckSynthesizer(Task<SpeechSynthesisResult> task,
        SpeechConfig config,
        SpeechSynthesizer synthesizer)
    {
        yield return new WaitUntil(() => task.IsCompleted);
        var result = task.Result;
        // Checks result.
        string newMessage = string.Empty;
        if (result.Reason == ResultReason.SynthesizingAudioCompleted)
        {
            var sampleCount = result.AudioData.Length / 2;
            var audioData = new float[sampleCount];
            for (var i = 0; i < sampleCount; ++i)
            {
                audioData[i] = (short)(result.AudioData[i * 2 + 1] << 8
                        | result.AudioData[i * 2]) / 32768.0F;
            }
            // The default output audio format is 16K 16bit mono
            var audioClip = AudioClip.Create("SynthesizedAudio", sampleCount,
                    1, 16000, false);
            audioClip.SetData(audioData, 0);
            audioSource.clip = audioClip;
            audioSource.Play();

        }
        else if (result.Reason == ResultReason.Canceled)
        {
            var cancellation = SpeechSynthesisCancellationDetails.FromResult(result);
        }
        synthesizer.Dispose();
    }

默认的音频格式是Riff16Khz16BitMonoPcm,在result.AudioData的开头有一个riff header。如果您将 audioData 传递给 audioClip,它将播放 header,然后您会听到一些噪音。

您可以通过 speechConfig.SetSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Raw16Khz16BitMonoPcm); 将格式设置为没有 header 的原始格式,详情请参阅 this sample