如何使 iOS 语音转文本持久化

Question

我正在对一种新的潜在产品进行初步研究。该产品的一部分要求 iPhone 和 iPad 上的 Speech-To-Text 保持开启状态，直到用户将其关闭。在我自己使用它时，我注意到它要么在 30 秒左右后自动关闭，无论用户是否停止说话，要么在说话者说出一定数量的可疑词后它会关闭。在任何情况下，该产品都要求它一直保持开启状态，直到被明确告知停止。以前有人用过这个吗？是的，我已经尝试了很好的搜索，我似乎找不到任何实质内容，尤其是任何用正确语言编写的东西。谢谢朋友！

Answer 1

import Speech

let recognizer = SFSpeechRecognizer()
let request = SFSpeechURLRecognitionRequest(url: audioFileURL)
#if targetEnvironment(simulator)
  request.requiresOnDeviceRecognition = /* only appears to work on device; not simulator */ false
#else
  request.requiresOnDeviceRecognition = /* only appears to work on device; not simulator */ true
#endif
recognizer?.recognitionTask(with: request, resultHandler: { (result, error) in
 print (result?.bestTranscription.formattedString)
})

上面的代码片段，当运行在 物理设备上时 将 连续地 （“持续地”）使用Apple 的语音框架。

这里的魔法线是request.requiresOnDeviceRecognition = ...

如果 request.requiresOnDeviceRecognition 为真且 SFSpeechRecognizer#supportsOnDeviceRecognition 为 true，则音频将持续转录，直到电池耗尽、用户取消转录或其他一些 error/terminating 条件发生。至少在我的试验中是这样。

文档：

https://developer.apple.com/documentation/speech/recognizing_speech_in_live_audio

Answer 2

我找到了 here 显示您演讲的教程。但是看注释：

Apple limits recognition per device. The limit is not known, but you can contact Apple for more information. Apple limits recognition per app.

If you routinely hit limits, make sure to contact Apple, they can probably resolve it.

Speech recognition uses a lot of power and data.

Speech recognition only lasts about a minute at a time.

编辑

这个答案是针对 iOS 10 的。我预计 iOS 12 会在 2018 年 10 月发布，但 Apple 仍然说：

Plan for a one-minute limit on audio duration. Speech recognition can place a relatively high burden on battery life and network usage. In iOS 10, utterance audio duration is limited to about one minute, which is similar to the limit for keyboard-related dictation.

参见：https://developer.apple.com/documentation/speech

iOS 11 和 12 的 Speech 框架没有 API 变化。查看全部 API changes and especially for iOS 12 in detail by Paul Hudson: iOS 12 APIs Diffs

所以我的回答应该仍然有效。

Answer 3

这将帮助您每 40 秒自动开始录音，即使您什么都不说。如果你说话然后有 2 秒的沉默，它将停止并调用 didfinishtalk 函数。

@objc  func startRecording() {


    self.fullsTring = ""
    audioEngine.reset()

    if recognitionTask != nil {
        recognitionTask?.cancel()
        recognitionTask = nil

    }



    let audioSession = AVAudioSession.sharedInstance()
    do {
        try audioSession.setCategory(.record)
        try audioSession.setMode(.measurement)
        try audioSession.setActive(true, options: .notifyOthersOnDeactivation)
        try audioSession.setPreferredSampleRate(44100.0)

        if audioSession.isInputGainSettable {
            let error : NSErrorPointer = nil

            let success = try? audioSession.setInputGain(1.0)

            guard success != nil else {
                print ("audio error")
                return
            }
            if (success != nil) {
                print("\(String(describing: error))")
            }
        }
        else {
            print("Cannot set input gain")
        }
    } catch {
        print("audioSession properties weren't set because of an error.")
    }
    recognitionRequest = SFSpeechAudioBufferRecognitionRequest()

    let inputNode = audioEngine.inputNode
    guard let recognitionRequest = recognitionRequest else {
        fatalError("Unable to create an SFSpeechAudioBufferRecognitionRequest object")
    } 

    recognitionRequest.shouldReportPartialResults = true 
    self.timer4 = Timer.scheduledTimer(timeInterval: TimeInterval(40), target: self, selector: #selector(againStartRec), userInfo: nil, repeats: false)

    recognitionTask = speechRecognizer.recognitionTask(with: recognitionRequest, resultHandler: { (result, error ) in  

        var isFinal = false  //8

        if result != nil {
            self.timer.invalidate()
            self.timer = Timer.scheduledTimer(timeInterval: TimeInterval(2.0), target: self, selector: #selector(self.didFinishTalk), userInfo: nil, repeats: false)

            let bestString = result?.bestTranscription.formattedString
            self.fullsTring = bestString!

     self.inputContainerView.inputTextField.text = result?.bestTranscription.formattedString

           isFinal = result!.isFinal

        }
        if error == nil{

        }
        if  isFinal {

            self.audioEngine.stop()
            inputNode.removeTap(onBus: 0)

            self.recognitionRequest = nil
            self.recognitionTask = nil
            isFinal = false

        }
        if error != nil{
            URLCache.shared.removeAllCachedResponses()



            self.audioEngine.stop()
                           inputNode.removeTap(onBus: 0)

                                    guard let task = self.recognitionTask else {
                                                      return
                                                  }
                                                  task.cancel()
                                                  task.finish()



        }
    })
    audioEngine.reset()
    inputNode.removeTap(onBus: 0)

  let recordingFormat = AVAudioFormat(standardFormatWithSampleRate: 44100, channels: 1)
    inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { (buffer, when) in
        self.recognitionRequest?.append(buffer)
    }

    audioEngine.prepare()

    do {
        try audioEngine.start()
    } catch {
        print("audioEngine couldn't start because of an error.")
    }


    self.hasrecorded = true



}


@objc func againStartRec(){

    self.inputContainerView.uploadImageView.setBackgroundImage( #imageLiteral(resourceName: "microphone") , for: .normal)
      self.inputContainerView.uploadImageView.alpha = 1.0
            self.timer4.invalidate()
    timer.invalidate()
           self.timer.invalidate()

            if ((self.audioEngine.isRunning)){

                self.audioEngine.stop()
                self.recognitionRequest?.endAudio()
                self.recognitionTask?.finish()


            }
   self.timer2 = Timer.scheduledTimer(timeInterval: 2, target: self, selector: #selector(startRecording), userInfo: nil, repeats: false)

}


@objc func didFinishTalk(){


    if self.fullsTring != ""{

     self.timer4.invalidate()
     self.timer.invalidate()
     self.timer2.invalidate()


          if ((self.audioEngine.isRunning)){

                 self.audioEngine.stop()
                 guard let task = self.recognitionTask else {
                    return
                 }
                 task.cancel()
                 task.finish()


             }



    }
}

Answer 4


///
/// Code lightly adopted by  from https://developer.apple.com/documentation/speech/recognizing_speech_in_live_audio?language=swift
///
/// Modifications from original:
/// - Color of text changes every time a new "chunk" of text is transcribed
/// -- This was a feature I added while playing with my nephews. They loved it (2 and 6) (we kept saying rainbow)
/// - I added a bit of logic to scroll to the end of the text once new chunks were added
/// - I formatted the code using swiftformat
///

import Speech
import UIKit

public class ViewController: UIViewController, SFSpeechRecognizerDelegate {
  private let speechRecognizer = SFSpeechRecognizer(locale: Locale(identifier: "en-US"))!

  private var recognitionRequest: SFSpeechAudioBufferRecognitionRequest?

  private var recognitionTask: SFSpeechRecognitionTask?

  private let audioEngine = AVAudioEngine()

  @IBOutlet var textView: UITextView!

  @IBOutlet var recordButton: UIButton!

  let colors: [UIColor] = [.red, .orange, .yellow, .green, .blue, .purple]

  var colorIndex = 0

  override public func viewDidLoad() {
    super.viewDidLoad()

    textView.textColor = colors[colorIndex]
    // Disable the record buttons until authorization has been granted.
    recordButton.isEnabled = false
  }

  override public func viewDidAppear(_ animated: Bool) {
    super.viewDidAppear(animated)
    // Configure the SFSpeechRecognizer object already
    // stored in a local member variable.
    speechRecognizer.delegate = self

    // Asynchronously make the authorization request.
    SFSpeechRecognizer.requestAuthorization { authStatus in

      // Divert to the app's main thread so that the UI
      // can be updated.
      OperationQueue.main.addOperation {
        switch authStatus {
        case .authorized:
          self.recordButton.isEnabled = true

        case .denied:
          self.recordButton.isEnabled = false
          self.recordButton.setTitle("User denied access to speech recognition", for: .disabled)

        case .restricted:
          self.recordButton.isEnabled = false
          self.recordButton.setTitle("Speech recognition restricted on this device", for: .disabled)

        case .notDetermined:
          self.recordButton.isEnabled = false
          self.recordButton.setTitle("Speech recognition not yet authorized", for: .disabled)

        default:
          self.recordButton.isEnabled = false
        }
      }
    }
  }

  private func startRecording() throws {
    // Cancel the previous task if it's running.
    recognitionTask?.cancel()
    recognitionTask = nil

    // Configure the audio session for the app.
    let audioSession = AVAudioSession.sharedInstance()
    try audioSession.setCategory(.record, mode: .measurement, options: .duckOthers)
    try audioSession.setActive(true, options: .notifyOthersOnDeactivation)
    let inputNode = audioEngine.inputNode

    // Create and configure the speech recognition request.
    recognitionRequest = SFSpeechAudioBufferRecognitionRequest()

    ////////////////////////////////////////////////////////////////////////////////
    ////////////////////////////////////////////////////////////////////////////////
    /// The below lines are responsible for keeping the recording active longer
    /// than just short bursts. I've had the recording going all day in somewhat
    /// rudimentary attempts.
    ////////////////////////////////////////////////////////////////////////////////
    ////////////////////////////////////////////////////////////////////////////////
    if #available(iOS 13, *) {
      let supportsOnDeviceRecognition = speechRecognizer.supportsOnDeviceRecognition
      if !supportsOnDeviceRecognition {
        fatalError("On device transcription not supported on this device. It is safe to remove this error but I wanted to add it as a warning that you'd actually see.")
      }
      recognitionRequest!.requiresOnDeviceRecognition = /* only appears to work on device; not simulator */ supportsOnDeviceRecognition
    }

    guard let recognitionRequest = recognitionRequest else { fatalError("Unable to create a SFSpeechAudioBufferRecognitionRequest object") }
    recognitionRequest.shouldReportPartialResults = true

    // Create a recognition task for the speech recognition session.
    // Keep a reference to the task so that it can be canceled.
    recognitionTask = speechRecognizer.recognitionTask(with: recognitionRequest) { result, error in
      var isFinal = false

      if let result = result {
        // Update the text view with the results.
        self.colorIndex = (self.colorIndex + 1) % self.colors.count
        self.textView.text = result.bestTranscription.formattedString
        self.textView.textColor = self.colors[self.colorIndex]
        self.textView.scrollRangeToVisible(NSMakeRange(result.bestTranscription.formattedString.count - 1, 0))
        isFinal = result.isFinal
        print("Text \(result.bestTranscription.formattedString)")
      }

      if error != nil || isFinal {
        // Stop recognizing speech if there is a problem.
        self.audioEngine.stop()
        inputNode.removeTap(onBus: 0)

        self.recognitionRequest = nil
        self.recognitionTask = nil

        self.recordButton.isEnabled = true
        self.recordButton.setTitle("Start Recording", for: [])
      }
    }

    // Configure the microphone input.
    let recordingFormat = inputNode.outputFormat(forBus: 0)
    inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { (buffer: AVAudioPCMBuffer, _: AVAudioTime) in
      self.recognitionRequest?.append(buffer)
    }

    audioEngine.prepare()
    try audioEngine.start()

    // Let the user know to start talking.
    textView.text = "(Go ahead, I'm listening)"
  }

  // MARK: SFSpeechRecognizerDelegate

  public func speechRecognizer(_: SFSpeechRecognizer, availabilityDidChange available: Bool) {
    if available {
      recordButton.isEnabled = true
      recordButton.setTitle("Start Recording", for: [])
    } else {
      recordButton.isEnabled = false
      recordButton.setTitle("Recognition Not Available", for: .disabled)
    }
  }

  // MARK: Interface Builder actions

  @IBAction func recordButtonTapped() {
    if audioEngine.isRunning {
      audioEngine.stop()
      recognitionRequest?.endAudio()
      recordButton.isEnabled = false
      recordButton.setTitle("Stopping", for: .disabled)
    } else {
      do {
        try startRecording()
        recordButton.setTitle("Stop Recording", for: [])
      } catch {
        recordButton.setTitle("Recording Not Available", for: [])
      }
    }
  }
}

上面的代码片段，当运行在 物理设备上时 将 连续地 （“持续地”）使用Apple 的语音框架。

这里的魔法线是request.requiresOnDeviceRecognition = ...

如果 request.requiresOnDeviceRecognition 为真且 SFSpeechRecognizer#supportsOnDeviceRecognition 为 true，则音频将持续转录，直到电池耗尽、用户取消转录或其他一些 error/terminating 条件发生。至少在我的试验中是这样。

文档：

https://developer.apple.com/documentation/speech/recognizing_speech_in_live_audio

备注：

我最初尝试编辑这个答案 [0] 但想添加太多细节以至于我觉得它完全劫持了原来的回答者。理想情况下，我将保持自己的答案：一种将此方法转化为 SwiftUI 以及可组合架构（采用他们的示例 [1]）的方法，作为 Apple 平台上语音转录快速入门的规范来源。

0:

1: https://github.com/pointfreeco/swift-composable-architecture/tree/main/Examples/SpeechRecognition/SpeechRecognition

如何使 iOS 语音转文本持久化

How To Make iOS Speech-To-Text Persistent

speech-to-text

ios

swift

speech-recognition-api

文档：

文档：

备注：