为什么 GPU 比 Apple M1 Mac 上的 CPU 慢 3.5 倍?

Why GPU is 3.5 times slower than the CPU on Apple M1 Mac?

我在M1 MacBook Air上用Keras搭建了一个简单的网络,安装了官方推荐的tensorflow-metal希望能得到更快的训练或预测速度。然而,GPU 预测比 CPU 慢 3.5 倍,这让我很困惑。这是我的代码,启用和不启用 GPU 的输出:

import time

import numpy as np
from keras.callbacks import ModelCheckpoint
from sklearn.model_selection import train_test_split
from tensorflow import keras
from tensorflow.keras import layers


class CNNModel(object):
    def __init__(self, input_shape=(29, 1), num_classes=6, model_path=None):
        self.model = keras.Sequential(
            [
                keras.Input(input_shape),
                layers.Conv1D(16, kernel_size=3, activation="relu"),
                layers.MaxPooling1D(pool_size=3),
                layers.Conv1D(32, kernel_size=3, activation="relu"),
                layers.MaxPooling1D(pool_size=3),
                layers.Flatten(),
                layers.Dropout(0.5),
                layers.Dense(32, activation="sigmoid"),
                layers.Dense(num_classes, activation='softmax')
            ]
        )
        self.model.compile(loss="categorical_crossentropy", optimizer='adam', metrics=['accuracy'])
        if model_path is not None:
            self.model.load_weights(model_path)

    def predict(self, x):
        preds = self.model.predict(x)
        preds = np.argmax(preds, axis=1)
        return preds

    def fit(self, x, y, model_save_path, batch_size=64, epochs=30):
        history = self.model.fit(x, y, batch_size=batch_size, epochs=epochs, validation_split=0.2,
                                 callbacks=[ModelCheckpoint(filepath=model_save_path, save_weights_only=True,
                                                            monitor='val_accuracy', mode='max', save_best_only=True)])


if __name__ == '__main__':
    model_path = "test.h5"
    sample_size = 20000
    data_x, data_y = np.random.random((sample_size, 29)), np.random.randint(0, 12, size=(sample_size, 1))
    class_num = np.unique(data_y).shape[0]
    data_y = keras.utils.to_categorical(data_y, class_num)
    Xtrain, Xtest, Ytrain, Ytest = train_test_split(data_x, data_y, test_size=0.2)
    model = CNNModel(input_shape=(Xtrain.shape[1], 1), num_classes=class_num)
    model.fit(Xtrain, Ytrain, batch_size=512, epochs=10, model_save_path=model_path)
    model = CNNModel(input_shape=(Xtrain.shape[1], 1), num_classes=class_num, model_path=model_path)
    since = time.time()
    preds = model.predict(Xtest)
    end = time.time()
    print(f'Predict {Xtest.shape[0]} samples in {end - since : .9f}s, {(end - since) / Xtest.shape[0]: .9f}s on avg')

我在使用 GPU 时得到如下输出:

Metal device set to: Apple M1

systemMemory: 8.00 GB maxCacheSize: 2.67 GB

2022-01-10 21:07:47.974952: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support. 2022-01-10 21:07:47.975053: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: ) 2022-01-10 21:07:48.039236: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz Epoch 1/10 2022-01-10 21:07:48.206631: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled. 23/25 [==========================>...] - ETA: 0s - loss: 2.5483 - accuracy: 0.08282022-01-10 21:07:48.674379: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled. 25/25 [==============================] - 1s 18ms/step - loss: 2.5446 - accuracy: 0.0839 - val_loss: 2.4955 - val_accuracy: 0.0850 Epoch 2/10 25/25 [==============================] - 0s 15ms/step - loss: 2.4923 - accuracy: 0.0870 - val_loss: 2.4852 - val_accuracy: 0.0875 Epoch 3/10 25/25 [==============================] - 0s 13ms/step - loss: 2.4864 - accuracy: 0.0863 - val_loss: 2.4851 - val_accuracy: 0.0866 Epoch 4/10 25/25 [==============================] - 0s 13ms/step - loss: 2.4866 - accuracy: 0.0841 - val_loss: 2.4851 - val_accuracy: 0.0862 Epoch 5/10 25/25 [==============================] - 0s 14ms/step - loss: 2.4863 - accuracy: 0.0826 - val_loss: 2.4849 - val_accuracy: 0.0869 Epoch 6/10 25/25 [==============================] - 0s 13ms/step - loss: 2.4855 - accuracy: 0.0909 - val_loss: 2.4850 - val_accuracy: 0.0800 Epoch 7/10 25/25 [==============================] - 0s 13ms/step - loss: 2.4861 - accuracy: 0.0843 - val_loss: 2.4848 - val_accuracy: 0.0884 Epoch 8/10 25/25 [==============================] - 0s 13ms/step - loss: 2.4852 - accuracy: 0.0848 - val_loss: 2.4852 - val_accuracy: 0.0803 Epoch 9/10 25/25 [==============================] - 0s 13ms/step - loss: 2.4848 - accuracy: 0.0880 - val_loss: 2.4846 - val_accuracy: 0.0866 Epoch 10/10 25/25 [==============================] - 0s 13ms/step - loss: 2.4846 - accuracy: 0.0871 - val_loss: 2.4851 - val_accuracy: 0.0875 2022-01-10 21:07:51.840891: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled. Predict 4000 samples in 0.259644985s, 0.000064911s on avg

我在使用 python -m pip uninstall tensorflow-metal:

卸载 tensorFlow-metal 后得到了这个

Epoch 1/10 25/25 [==============================] - 0s 6ms/step - loss: 2.6182 - accuracy: 0.0824 - val_loss: 2.5252 - val_accuracy: 0.0878 Epoch 2/10 25/25 [==============================] - 0s 3ms/step - loss: 2.5025 - accuracy: 0.0863 - val_loss: 2.4898 - val_accuracy: 0.0791 Epoch 3/10 25/25 [==============================] - 0s 3ms/step - loss: 2.4901 - accuracy: 0.0848 - val_loss: 2.4873 - val_accuracy: 0.0766 Epoch 4/10 25/25 [==============================] - 0s 3ms/step - loss: 2.4894 - accuracy: 0.0844 - val_loss: 2.4865 - val_accuracy: 0.0847 Epoch 5/10 25/25 [==============================] - 0s 3ms/step - loss: 2.4891 - accuracy: 0.0802 - val_loss: 2.4869 - val_accuracy: 0.0797 Epoch 6/10 25/25 [==============================] - 0s 3ms/step - loss: 2.4876 - accuracy: 0.0811 - val_loss: 2.4876 - val_accuracy: 0.0828 Epoch 7/10 25/25 [==============================] - 0s 3ms/step - loss: 2.4866 - accuracy: 0.0847 - val_loss: 2.4873 - val_accuracy: 0.0822 Epoch 8/10 25/25 [==============================] - 0s 3ms/step - loss: 2.4867 - accuracy: 0.0841 - val_loss: 2.4867 - val_accuracy: 0.0838 Epoch 9/10 25/25 [==============================] - 0s 3ms/step - loss: 2.4870 - accuracy: 0.0860 - val_loss: 2.4867 - val_accuracy: 0.0787 Epoch 10/10 25/25 [==============================] - 0s 3ms/step - loss: 2.4860 - accuracy: 0.0883 - val_loss: 2.4870 - val_accuracy: 0.0744 Predict 4000 samples in 0.073775768s, 0.000018444s on avg

上周我发现了同样的问题,这也让我很困惑。在我的例子中,CPU 训练花费了大约 7 秒,GPU 花费了大约 100 秒。所以 GPU 慢了 14 倍!这是一个简单的 ANN,但在 CNN 上我发现 GPU 比 CPU.

快大约 20%

我认为这取决于您的输入大小。 GPU 内核比 CPU 内核慢得多,但 GPU 的主要优点是您可以同时 运行 数千个线程。在 CPU 上,你受到内核数量的限制,而且即使 M1 有 8 个内核,也只有 4 个内核可以同时工作。

因此,如果您的训练批次足够小,您将无法从 GPU 中获益,因为不会使用大量线程。由于 GPU 架构,他们无法处理单独的批次。我建议您在少量 epoch 上测试 GPU 和 CPU 性能,然后选择更快的单元。

您无需卸载 tensorflow-metal 即可仅使用 CPU。你可以简单地调用

tf.config.set_visible_devices([], 'GPU')

在编译 NN 之前。此命令将从 TensorFlow 的可见设备中删除所有 GPU,因此训练将仅使用 CPU.