为什么在 Go 中锁定比 Java 慢得多?在 Mutex.Lock() Mutex.Unlock() 中花费了很多时间
Why Locking in Go much slower than Java? Lot's of time spent in Mutex.Lock() Mutex.Unlock()
我编写了一个小型 Go 库 (go-patan) that collects a running min/max/avg/stddev of certain variables. I compared it to an equivalent Java implementation (patan),令我惊讶的是 Java 实现速度要快得多。我想明白为什么。
该库基本上由一个简单的数据存储和一个序列化读写的锁组成。这是代码片段:
type Store struct {
durations map[string]*Distribution
counters map[string]int64
samples map[string]*Distribution
lock *sync.Mutex
}
func (store *Store) addSample(key string, value int64) {
store.addToStore(store.samples, key, value)
}
func (store *Store) addDuration(key string, value int64) {
store.addToStore(store.durations, key, value)
}
func (store *Store) addToCounter(key string, value int64) {
store.lock.Lock()
defer store.lock.Unlock()
store.counters[key] = store.counters[key] + value
}
func (store *Store) addToStore(destination map[string]*Distribution, key string, value int64) {
store.lock.Lock()
defer store.lock.Unlock()
distribution, exists := destination[key]
if !exists {
distribution = NewDistribution()
destination[key] = distribution
}
distribution.addSample(value)
}
我已经对 GO 和 Java 实现 (go-benchmark-gist, java-benchmark-gist) 进行了基准测试,并且 Java 到目前为止获胜,但我不明白为什么:
Go Results:
10 threads with 20000 items took 133 millis
100 threads with 20000 items took 1809 millis
1000 threads with 20000 items took 17576 millis
10 threads with 200000 items took 1228 millis
100 threads with 200000 items took 17900 millis
Java Results:
10 threads with 20000 items takes 89 millis
100 threads with 20000 items takes 265 millis
1000 threads with 20000 items takes 2888 millis
10 threads with 200000 items takes 311 millis
100 threads with 200000 items takes 3067 millis
我已经使用 Go 的 pprof 分析了程序并生成了一个调用图 call-graph。这表明它基本上所有时间都花在 sync.(*Mutex).Lock() 和 sync.(*Mutex).Unlock().
根据分析器的 Top20 调用:
(pprof) top20
59110ms of 73890ms total (80.00%)
Dropped 22 nodes (cum <= 369.45ms)
Showing top 20 nodes out of 65 (cum >= 50220ms)
flat flat% sum% cum cum%
8900ms 12.04% 12.04% 8900ms 12.04% runtime.futex
7270ms 9.84% 21.88% 7270ms 9.84% runtime/internal/atomic.Xchg
7020ms 9.50% 31.38% 7020ms 9.50% runtime.procyield
4560ms 6.17% 37.56% 4560ms 6.17% sync/atomic.CompareAndSwapUint32
4400ms 5.95% 43.51% 4400ms 5.95% runtime/internal/atomic.Xadd
4210ms 5.70% 49.21% 22040ms 29.83% runtime.lock
3650ms 4.94% 54.15% 3650ms 4.94% runtime/internal/atomic.Cas
3260ms 4.41% 58.56% 3260ms 4.41% runtime/internal/atomic.Load
2220ms 3.00% 61.56% 22810ms 30.87% sync.(*Mutex).Lock
1870ms 2.53% 64.10% 1870ms 2.53% runtime.osyield
1540ms 2.08% 66.18% 16740ms 22.66% runtime.findrunnable
1430ms 1.94% 68.11% 1430ms 1.94% runtime.freedefer
1400ms 1.89% 70.01% 1400ms 1.89% sync/atomic.AddUint32
1250ms 1.69% 71.70% 1250ms 1.69% github.com/toefel18/go-patan/statistics/lockbased.(*Distribution).addSample
1240ms 1.68% 73.38% 3140ms 4.25% runtime.deferreturn
1070ms 1.45% 74.83% 6520ms 8.82% runtime.systemstack
1010ms 1.37% 76.19% 1010ms 1.37% runtime.newdefer
1000ms 1.35% 77.55% 1000ms 1.35% runtime.mapaccess1_faststr
950ms 1.29% 78.83% 15660ms 21.19% runtime.semacquire
860ms 1.16% 80.00% 50220ms 67.97% main.Benchmrk.func1
有人可以解释为什么在 Go 中锁定似乎比在 Java 中慢得多,我做错了什么?我还在 Go 中编写了一个基于通道的实现,但速度更慢。
最好避免在需要高性能的微型函数中使用 defer
,因为它很昂贵。在大多数其他情况下,没有必要避免它,因为 defer
的成本被它周围的代码所抵消。
我还建议使用 lock sync.Mutex
而不是使用指针。指针为程序员带来了少量额外工作(初始化,nil
错误),并为垃圾收集器带来了少量额外工作。
我还在 golang-nuts group. The reply from Jesper Louis Andersen 上发布了这个问题,很好地解释了 Java 使用同步优化技术,例如锁转义 analysis/lock 省略和锁粗化。
Java JIT 可能正在获取锁并允许在锁内一次进行多个更新以提高性能。我 运行 Java 与 -Djava.compiler=NONE
的基准测试提供了显着的性能,但不是一个公平的比较。
我假设其中许多优化技术对生产环境的影响较小。
我编写了一个小型 Go 库 (go-patan) that collects a running min/max/avg/stddev of certain variables. I compared it to an equivalent Java implementation (patan),令我惊讶的是 Java 实现速度要快得多。我想明白为什么。
该库基本上由一个简单的数据存储和一个序列化读写的锁组成。这是代码片段:
type Store struct {
durations map[string]*Distribution
counters map[string]int64
samples map[string]*Distribution
lock *sync.Mutex
}
func (store *Store) addSample(key string, value int64) {
store.addToStore(store.samples, key, value)
}
func (store *Store) addDuration(key string, value int64) {
store.addToStore(store.durations, key, value)
}
func (store *Store) addToCounter(key string, value int64) {
store.lock.Lock()
defer store.lock.Unlock()
store.counters[key] = store.counters[key] + value
}
func (store *Store) addToStore(destination map[string]*Distribution, key string, value int64) {
store.lock.Lock()
defer store.lock.Unlock()
distribution, exists := destination[key]
if !exists {
distribution = NewDistribution()
destination[key] = distribution
}
distribution.addSample(value)
}
我已经对 GO 和 Java 实现 (go-benchmark-gist, java-benchmark-gist) 进行了基准测试,并且 Java 到目前为止获胜,但我不明白为什么:
Go Results:
10 threads with 20000 items took 133 millis
100 threads with 20000 items took 1809 millis
1000 threads with 20000 items took 17576 millis
10 threads with 200000 items took 1228 millis
100 threads with 200000 items took 17900 millis
Java Results:
10 threads with 20000 items takes 89 millis
100 threads with 20000 items takes 265 millis
1000 threads with 20000 items takes 2888 millis
10 threads with 200000 items takes 311 millis
100 threads with 200000 items takes 3067 millis
我已经使用 Go 的 pprof 分析了程序并生成了一个调用图 call-graph。这表明它基本上所有时间都花在 sync.(*Mutex).Lock() 和 sync.(*Mutex).Unlock().
根据分析器的 Top20 调用:
(pprof) top20
59110ms of 73890ms total (80.00%)
Dropped 22 nodes (cum <= 369.45ms)
Showing top 20 nodes out of 65 (cum >= 50220ms)
flat flat% sum% cum cum%
8900ms 12.04% 12.04% 8900ms 12.04% runtime.futex
7270ms 9.84% 21.88% 7270ms 9.84% runtime/internal/atomic.Xchg
7020ms 9.50% 31.38% 7020ms 9.50% runtime.procyield
4560ms 6.17% 37.56% 4560ms 6.17% sync/atomic.CompareAndSwapUint32
4400ms 5.95% 43.51% 4400ms 5.95% runtime/internal/atomic.Xadd
4210ms 5.70% 49.21% 22040ms 29.83% runtime.lock
3650ms 4.94% 54.15% 3650ms 4.94% runtime/internal/atomic.Cas
3260ms 4.41% 58.56% 3260ms 4.41% runtime/internal/atomic.Load
2220ms 3.00% 61.56% 22810ms 30.87% sync.(*Mutex).Lock
1870ms 2.53% 64.10% 1870ms 2.53% runtime.osyield
1540ms 2.08% 66.18% 16740ms 22.66% runtime.findrunnable
1430ms 1.94% 68.11% 1430ms 1.94% runtime.freedefer
1400ms 1.89% 70.01% 1400ms 1.89% sync/atomic.AddUint32
1250ms 1.69% 71.70% 1250ms 1.69% github.com/toefel18/go-patan/statistics/lockbased.(*Distribution).addSample
1240ms 1.68% 73.38% 3140ms 4.25% runtime.deferreturn
1070ms 1.45% 74.83% 6520ms 8.82% runtime.systemstack
1010ms 1.37% 76.19% 1010ms 1.37% runtime.newdefer
1000ms 1.35% 77.55% 1000ms 1.35% runtime.mapaccess1_faststr
950ms 1.29% 78.83% 15660ms 21.19% runtime.semacquire
860ms 1.16% 80.00% 50220ms 67.97% main.Benchmrk.func1
有人可以解释为什么在 Go 中锁定似乎比在 Java 中慢得多,我做错了什么?我还在 Go 中编写了一个基于通道的实现,但速度更慢。
最好避免在需要高性能的微型函数中使用 defer
,因为它很昂贵。在大多数其他情况下,没有必要避免它,因为 defer
的成本被它周围的代码所抵消。
我还建议使用 lock sync.Mutex
而不是使用指针。指针为程序员带来了少量额外工作(初始化,nil
错误),并为垃圾收集器带来了少量额外工作。
我还在 golang-nuts group. The reply from Jesper Louis Andersen 上发布了这个问题,很好地解释了 Java 使用同步优化技术,例如锁转义 analysis/lock 省略和锁粗化。
Java JIT 可能正在获取锁并允许在锁内一次进行多个更新以提高性能。我 运行 Java 与 -Djava.compiler=NONE
的基准测试提供了显着的性能,但不是一个公平的比较。
我假设其中许多优化技术对生产环境的影响较小。