Python 的声音文件：soundfile.write 和剪辑

Question

假设我使用 Python 的 soundfile,

读取 WAV 文件

import soundfile
x, fs = soundfile.read("test.wav")

数组 x 在 float32 和 max(x) = 1, min(x) = -1 中。即：x 中的每个样本都是一个介于 -1 和 1 之间的 float32 数字。我对其进行了一些操作并得到 y。现在我想将 y 保存到 WAV 文件。假设 y 现在的值大于 1（and/or 小于 -1），我使用

soundfile.write("processed.wav", y, fs)

SoundFile 如何处理超出的值？它会进行削波（如果 y[t] > 1 需要 y[t] = 1）或归一化（将整个信号除以 max(abs(y))）还是其他？

我没有在文档中找到答案：https://pysoundfile.readthedocs.io/en/latest/#soundfile.write

import numpy as np 
import soundfile as sf

x = np.array([0,0.5,0.75, 1, 2]) 
sf.write("x.wav", x, 1)
y, fs = sf.read("x.wav")

print(y)

输出为：

[0.         0.5        0.75       0.99996948 0.99996948]

看来它确实进行了裁剪，但我想确定一下。我可以控制 soundfile.write 如何处理超出的值吗？

Answer 1

这里要回答的重要问题不仅仅是 soundfile 做什么，而是如何确认行为。

让我们保留这个整洁的小示例程序，它带有一些额外的注释：

import numpy as np 
import soundfile as sf

x = np.array([0,0.5,0.75, 1, 2]) # x.dtype is 'float64'
sf.write("x.wav", x, 1) # a wav at sampling rate 1 Hz

y, fs = sf.read("x.wav")

print(y)

WAV 可以有几种风格，在采样率和数据格式（或位深度）方面都有所不同。奇怪行为的一种可能性是 1 Hz 采样率。值得庆幸的是，在这种情况下它没有影响，但一般来说，避免由奇怪的值引起的潜在问题是个好主意。坚持使用标准采样率，直到您可以定义行为。

Soundfile 的文档本身不透明，但您确实需要做一些追逐信息。对于 write() 方法，我们看到

subtype (str, optional) – See default_subtype() for the default value and available_subtypes() for all possible values.

然而，另一个重要的信息实际上是在 data 字段下

The data type of data does not select the data type of the written file. Audio data will be converted to the given subtype. Writing int values to a float file will not scale the values to [-1.0, 1.0). If you write the value np.array([42], dtype='int32'), to a subtype='FLOAT' file, the file will then contain np.array([42.], dtype='float32').

基本上，数据类型不是由样本数据推断出来的，将被缩放到 subtype。

当我们查看 default_subtype 时，我们发现 WAV 的默认值是 16 位 PCM。

棘手的一点是，当使用 read 读取信息时，声音文件会做什么？

好的做法是使用其他东西来确认行为。如果第二种读取数据的方法报告相同的信息，那么宾果游戏，我们已经破解了它。如果不是，则表明至少有一种方法正在更改数据，因此您必须尝试第三种方法（依此类推）。

读取数据并确保数据未被更改的一种好方法是使用十六进制编辑器进行读取。

此时让我们提醒自己，soundfile.read() 的输出是：

[0.         0.5        0.75       0.99996948 0.99996948]

上面的十六进制示例创建了一个文件：

52494646 2E000000 57415645 666D7420 10000000 01000100 01000000 02000000 02001000 64617461 0A000000 00000040 0060FF7F FF7F

我们知道它是 16 位样本，所以最后 10 个字节是我们感兴趣的（每个样本 2 个字节，总共 5 个样本）

16 位是有符号的，所以我们有 ±2^{15} 的摆幅，即 32768（学究们别担心，我马上就讲到）

0000 0040 0060 FF7F FF7F

啊，但那是小端格式。所以，让我们把它翻转过来，让它更容易看到

0000 4000 6000 7FFF 7FFF

每个人依次

0000 是 0，简单易用：[0.0]
4000 是 16384，或 32768 * 0.5：[0.5]
6000 是 24576，或 32768 * 0.75：[0.75]
7FFF是32767，正是可以描述的峰值正振幅。

由于幅度被缩放到 32767，这就是读回数据时出现轻微错误的原因：32767 / 32768 等于 0.99996948（有一点舍入误差）

让我们通过将最后两个样本翻转为负值来确认该行为。

import numpy as np 
import soundfile as sf

x = np.array([0,0.5,0.75, -1, -2]) # x.dtype is 'float64'
sf.write("x.wav", x, 1) # a wav at sampling rate 1 Hz

y, fs = sf.read("x.wav")

print(y)

在大端格式中，我们的十六进制数据现在是

0000 4000 6000 8000 8000

8000 是 -32768 作为 16 位有符号整数。

由此我们可以确认我们的数据正在被裁剪（未规范化或包装）

Python 的声音文件：soundfile.write 和剪辑

Python's SoundFile: soundfile.write and clipping

python

audio

soundfile