为什么 Xcode 和 Time Profiler 报告更快的 iOS 设备的 CPU 使用率更高？

Question

我编写了一个模拟经典计算机的应用程序。尽管在 App Store 上呆了几年，但我经常尝试通过在 Instruments 中使用 Time Profiler 进行测试来减少对 CPU 内核的需求。比较具有显着不同规格的真实设备之间的结果时，CPU% 利用率显示出相反的趋势。

带注释的 Xcode 屏幕截图显示了设备规格对比和 CPU 使用矛盾。在撰写本文时，使用的是 Xcode 10.2.1，并且两台设备都安装了 iOS 12.2.1。即使在调试模式下运行ning 也会应用编译优化。在其他设备之间也可以看到相同的趋势。 Time Profiler 显示与 Xcode 相同的百分比。有趣的是，当使用文件 > 记录选项... > 记录等待线程时，iPad Mini 2 设备下降到 ~22%，iPhone XS Max 下降到~28%。

实现细节：

该应用有两个并发进程线程用于两个不同的任务：

CPU 模拟线程 - 处理模拟计算机指令
CRT显示模拟线程 - 处理原始模拟视频信号并将它们转换为矢量图形

为了避免在任务有工作时重复创建两个进程的昂贵开销，调度信号量用于控制进程何时休眠。即使在调试模式下运行也会应用编译优化。

剥离示例代码：

下面的代码演示了一些用于此 post 目的的原则。在我的测试设备上，CPU 使用百分比差异并不像 iPad Mini 2 和 iPhone XS Max 设备报告的 ~120% 那样明显，但仍然矛盾，我应该期待更现代的 iPhone 设备的价值要低得多。

再次记录等待线程时，值较低，但这次更符合设备的生成，iPad Mini 2 = ~48% vs iPhone XS Max = ~35% .同样，考虑到处理器的差异，这仍然符合我的期望。

每次此演示代码运行时，平均结果可能会在没有明显原因的情况下至少有 5% 的偏差。这让我怀疑 CPU 用法 %.

的一般准确性

final class ViewController: UIViewController {

    let processorDispatchSemaphore = DispatchSemaphore(value: 0)
    let videoDispatchSemaphore = DispatchSemaphore(value: 0)
    fileprivate var stopEmulation = false
    fileprivate var lastTime: CFTimeInterval = 0.0
    fileprivate var accumulatedCycles = 0

    final var pretendVideoData: [Int] = []
    final var pretendDisplayData: [Int] = []

    override func viewDidLoad() {
        super.viewDidLoad()

        let displayLink = CADisplayLink(target: self, selector: #selector(displayUpdate))
        displayLink.add(to: .main, forMode: RunLoop.Mode.common)

        let concurrentEmulationQueue = DispatchQueue.global(qos: .userInteractive)

        // CPU simulation thread 
        concurrentEmulationQueue.async() {

            repeat {

                // pause until a display refresh
                self.processorDispatchSemaphore.wait()

                // calculate the number of simulated computer clock
                // clock cycles that would have been executed in the
                // same time
                let currentTime = displayLink.timestamp
                let delta: CFTimeInterval = currentTime - self.lastTime
                self.lastTime = currentTime

                // Z80A Microprocessor clocked at 3.25MHz = 3,250,000 per second
                // 1 second / 3250000 = 0.000000307692308
                var emulationCyclesRequired = Int((delta / 0.000000307692308).rounded())

                // safeguard: 
                // Time delay every 1/60th (0.0166667) of a second
                // 0.0166667 / 0.000000307692308 = 54167 cycles
                // let's say that no more than 3 times that should 
                // be allowed = 54167 * 3 = 162501
                if emulationCyclesRequired > 162501 {
                    // even on slow devices the thread only need
                    // cap cycles whilst the CADisplayLink takes
                    // time to kick - so after a less second the
                    // app need not apply this safeguard
                    emulationCyclesRequired = 162501
                    print("emulation cycles capped")
                }

                // do some simulated work
                // **** fake process filling code ****
                for cycle in 0...emulationCyclesRequired {

                    if cycle % 4 == 0 {
                        self.pretendVideoData.append(cycle &+ cycle)
                    }
                    self.accumulatedCycles = self.accumulatedCycles &+ 1

                    if self.accumulatedCycles > 40000 {
                        // unpause the CRT display simulation thread
                        self.videoDispatchSemaphore.signal()
                        self.pretendVideoData.removeAll(keepingCapacity: true)
                    }
                }
                // **** **** ****

            // thread is allowed to finish when app goes to the
            // background or a non-sumiulation screen.
            } while !self.stopEmulation
        }

        let concurrentDisplayQueue = DispatchQueue.global(qos: .userInteractive)

        // CRT display simulation thread
        // (edit) see comment to Rob - concurrentEmulationQueue.async(flags: .barrier) {
        concurrentDisplayQueue.async(flags: .barrier) {

            repeat {
                self.videoDispatchSemaphore.wait()

                // do some simulated work
                // **** fake process filling code ****
                for index in 0...1000 {
                    self.pretendDisplayData.append(~index)
                }

                self.pretendDisplayData.removeAll(keepingCapacity: true)
                // **** **** ****

            // thread is allowed to finish when app goes to the
            // background or a non-sumiulation screen.
            } while !self.stopEmulation

        }
    }

    @objc fileprivate func displayUpdate() {
        // unpause the CPU simulation thread
        processorDispatchSemaphore.signal()
    }

}

问题：

为什么 CPU 速度更快的设备的 CPU 使用率更高？有什么理由认为结果不准确？
我怎样才能更好地解释这些数字或在设备之间获得更好的基准？
为什么记录等待线程会导致较低的 CPU 使用百分比（但仍然没有显着差异，有时更快的设备会更高）？

Answer 1

我写了一个执行一致计算的例程（通过对 Gregory-Leibniz 系列求和来计算 π，每 60 秒限制到只有 1.2m 次迭代，具有类似的 semaphore/displaylink 舞蹈在你的例子中）。 iPad mini 2 和 iPhone Xs Max 都能够维持目标 60fps（iPad mini 2 几乎没有），并且看到 CPU 使用值更符合有人会期望。具体来说，CPU 在 iPhone Xs Max (iOS 13) 上的使用率为 47%，但在 iPad mini 2 (iOS 12.3.1) 上的使用率为 102% :

iPhone Xs 最大值：

iPad 迷你 2:

然后我运行通过 Instruments 中的“Time Profiler”进行以下设置：

“高频”采样；
“记录等待线程”;
“延迟”或“窗口化”捕获；和
将调用树更改为按“状态”排序。

对于代表性时间样本，iPhone Xs Max 报告该线程有运行 48.2% 的时间（基本上，只是等待超过一半的时间）：

而在 iPad mini 2 上，该线程运行 95.7% 的时间（几乎没有多余的带宽，几乎一直在计算）：

最重要的是，这表明 iPhone Xs Max 上的特定队列的性能可能大约是 iPad mini 2 的两倍。

您可以看到 Xcode 调试器 CPU 图表和 Instruments 的“Time Profiler”向我们讲述了相当一致的故事。而且它们也都符合我们的预期，即 iPhone Xs Max 与 iPhone mini 2.

完全相同的任务将大大减少负担。

为了全面披露，当我降低工作负载时（例如，将其从每 60 秒 1.2m 次迭代降低到仅 800k），CPU 利用率差异不那么明显， CPU 在 iPhone Xs Max 上的使用率为 48%，在 iPad mini 2 上为 59%。但是，更强大的 iPhone 使用更少 CPU 比 iPad.

你问过：

Why might the CPU usage % be higher for devices with faster CPUs? Any reason to think the results are not accurate?

一些观察：

我不确定你在这里比较 apples-to-apples。如果您要进行此类比较，请绝对确保在每个设备上的每个线程上完成的工作完全相同。（我喜欢几年前在 WWDC 演讲中听到的那句话；换句话说，“理论上，理论和实践之间没有区别；在实践中，有天壤之别”。）

如果您降低了帧速率或其他 time-based 可能会以不同方式拆分计算的差异，则这些数字可能无法比较，因为上下文切换等其他因素可能会起作用。我会 100% 确定两个设备上的计算是相同的，否则比较会产生误导。
调试器的 CPU“使用百分比”恕我直言，只是一个有趣的晴雨表。也就是说，当你没有任何事情发生时，你想确保仪表是好的和低的，以确保那里没有漂浮的流氓任务。相反，当做一些大规模并行化和计算密集型的事情时，你可以使用它来确保你没有一些错误阻止设备被充分利用。

但是这个调试器的“已用百分比”并不是我愿意挂在帽子上的数字，一般来说。查看 Instruments、识别被阻塞的线程、查看 CPU 核心的利用率等总是更有启发性。
在你的例子中，你非常强调 CPU iPad mini 2 与 85 上 47% 的“已使用百分比”的调试器报告iPhone Xs Max 上的百分比。您显然忽略了 iPad mini 上的这一点，它大约占总容量的 ¼th，但对于 iPhone Xs Max 仅在 ⅙th 附近。总而言之，总计比这些简单的百分比更令人担忧。

How could I better interpret the figures or get better benchmarks between devices?

是的，Instruments 总是会给你更有意义、更可操作的结果。

Why does Record Waiting Threads result in lower CPU usage percentages (but still not significantly different and sometimes higher for the faster device)?

我不确定你说的是哪个“百分比”。大多数通用调用树百分比对于“当我的代码是运行时，有多少百分比的时间花在了哪里”很有用，但是在没有“记录等待线程”的情况下，你会错过很大一部分等式，即您的代码正在等待其他东西的地方。这些都是重要的问题，但通过包括“记录等待线程”，您可以捕捉到更全面的画面（即应用程序运行缓慢的地方）。

FWIW，这是生成上述代码的代码：

class ViewController: UIViewController {

    @IBOutlet weak var fpsLabel: UILabel!
    @IBOutlet weak var piLabel: UILabel!

    let calculationSemaphore = DispatchSemaphore(value: 0)
    let displayLinkSemaphore = DispatchSemaphore(value: 0)
    let queue = DispatchQueue(label: Bundle.main.bundleIdentifier! + ".pi", qos: .userInitiated)
    var times: [CFAbsoluteTime] = []

    override func viewDidLoad() {
        super.viewDidLoad()

        let displayLink = CADisplayLink(target: self, selector: #selector(handleDisplayLink(_:)))
        displayLink.add(to: .main, forMode: .common)

        queue.async {
            self.calculatePi()
        }
    }

    /// Calculate pi using Gregory-Leibniz series
    ///
    /// I wouldn’t generally hardcode the number of iterations, but this just what I empirically verified I could bump it up to without starting to see too many dropped frames on iPad implementation. I wanted to max out the iPad mini 2, while not pushing it over the edge where the numbers might no longer be comparable.

    func calculatePi() {
        var iterations = 0
        var i = 1.0
        var sign = 1.0
        var value = 0.0
        repeat {
            iterations += 1
            if iterations % 1_200_000 == 0 {
                displayLinkSemaphore.signal()
                DispatchQueue.main.async {
                    self.piLabel.text = "\(value)"
                }
                calculationSemaphore.wait()
            }
            value += 4.0 / (sign * i)
            i += 2
            sign *= -1
        } while true
    }

    @objc func handleDisplayLink(_ displayLink: CADisplayLink) {
        displayLinkSemaphore.wait()
        calculationSemaphore.signal()
        times.insert(displayLink.timestamp, at: 0)
        let count = times.count
        if count > 60 {
            let fps = 60 / (times.first! - times.last!)
            times = times.dropLast(count - 60)
            fpsLabel.text = String(format: "%.1f", fps)
        }
    }
}

底线，考虑到我对上述内容的实验似乎与我们的期望相关，而你的则不然’，我想知道你的计算是否真的每 60 秒做一次完全相同的工作，不管设备如何，就像上面那样。一旦出现任何丢帧、不同时间间隔的不同计算等，似乎各种其他变量都会发挥作用并使比较无效。

就其价值而言，以上是所有信号量和显示 link 逻辑。当我将其简化为尽快在单个线程中对序列的 5000 万个值求和时，iPhone Xs Max 在 0.12 秒内完成，而 iPad mini 2 在 0.38 秒内完成秒。显然，通过没有任何定时器或信号量的简单计算，硬件性能得到了显着提升。最重要的是，我不会倾向于依赖调试器或仪器中的任何 CPU 使用计算来确定您可以实现的理论性能。

为什么 Xcode 和 Time Profiler 报告更快的 iOS 设备的 CPU 使用率更高？

Why are Xcode and Time Profiler reporting higher CPU usage for faster iOS devices?

xcode

semaphore

cpu-usage

ios

swift