Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: potential deadlock with goroutine stuck on internal lock? #15

Open
Thegaram opened this issue Apr 25, 2019 · 6 comments
Open

Comments

@Thegaram
Copy link

Hey, first of all thanks for the hard work on this great lib!

I'm having trouble interpreting the output below. It suggests goroutine 77264 holds lock 0xc4202a60e0 for a long time, preventing others (like goroutine 77325 and many more) from acquiring it.

However, the output suggests that goroutine 77264 actually got stuck during unlock: raft.go:688 is a deferred mu.Unlock(), and deadlock.go:330 is actually is a lock acquire statement in this lib.

Does this mean that the (potential) deadlock is coming from this lib in this case? What would make goroutine 77264 get stuck on that internal lock? (I reproduced the same output with 1m30s lock timeout.)

POTENTIAL DEADLOCK:
Previous place where the lock was grabbed
goroutine 77264 lock 0xc4202a60e0
../raft/raft.go:618 raft.(*Raft).AppendEntriesRPCHandler { rf.mu.Lock() } <<<<<
/usr/local/Cellar/go/1.9/libexec/src/runtime/asm_amd64.s:509 runtime.call32 { CALLFN(·call32, 32) }
/usr/local/Cellar/go/1.9/libexec/src/reflect/value.go:434 reflect.Value.call { call(frametype, fn, args, uint32(frametype.size), uint32(retOffset)) }
/usr/local/Cellar/go/1.9/libexec/src/reflect/value.go:302 reflect.Value.Call { return v.call("Call", in) }
../labrpc/labrpc.go:478 labrpc.(*Service).dispatch { function.Call([]reflect.Value{svc.rcvr, args.Elem(), replyv}) }
../labrpc/labrpc.go:402 labrpc.(*Server).dispatch { return service.dispatch(methodName, req) }
../labrpc/labrpc.go:229 labrpc.(*Network).ProcessReq.func1 { r := server.dispatch(req) }

Have been trying to lock it again for more than 30s
goroutine 77325 lock 0xc4202a60e0
../raft/raft.go:618 raft.(*Raft).AppendEntriesRPCHandler { rf.mu.Lock() } <<<<<
/usr/local/Cellar/go/1.9/libexec/src/runtime/asm_amd64.s:509 runtime.call32 { CALLFN(·call32, 32) }
/usr/local/Cellar/go/1.9/libexec/src/reflect/value.go:434 reflect.Value.call { call(frametype, fn, args, uint32(frametype.size), uint32(retOffset)) }
/usr/local/Cellar/go/1.9/libexec/src/reflect/value.go:302 reflect.Value.Call { return v.call("Call", in) }
../labrpc/labrpc.go:478 labrpc.(*Service).dispatch { function.Call([]reflect.Value{svc.rcvr, args.Elem(), replyv}) }
../labrpc/labrpc.go:402 labrpc.(*Server).dispatch { return service.dispatch(methodName, req) }
../labrpc/labrpc.go:229 labrpc.(*Network).ProcessReq.func1 { r := server.dispatch(req) }

Here is what goroutine 77264 doing now
goroutine 77264 [semacquire]:
sync.runtime_SemacquireMutex(0xc420084744, 0xa900000000)
    /usr/local/Cellar/go/1.9/libexec/src/runtime/sema.go:71 +0x3d
sync.(*Mutex).Lock(0xc420084740)
    /usr/local/Cellar/go/1.9/libexec/src/sync/mutex.go:134 +0x14c
github.com/sasha-s/go-deadlock.(*lockOrder).postUnlock(0xc420084740, 0x1288e00, 0xc4202a60e0)
    .../src/github.com/sasha-s/go-deadlock/deadlock.go:330 +0x3f
github.com/sasha-s/go-deadlock.postUnlock(0x1288e00, 0xc4202a60e0)
    .../src/github.com/sasha-s/go-deadlock/deadlock.go:167 +0x5f
github.com/sasha-s/go-deadlock.(*Mutex).Unlock(0xc4202a60e0)
    .../src/github.com/sasha-s/go-deadlock/deadlock.go:97 +0x7f
raft.(*Raft).AppendEntriesRPCHandler(0xc4202a60e0, 0xc420beb840, 0xc421834620)
    .../src/raft/raft.go:688 +0xcc8

...
@sasha-s
Copy link
Owner

sasha-s commented Apr 26, 2019

Hi.
Looking at the output, it seems you are right.
Somehow the l.mu got locked, so we are stuck on deadlock.go:330 .
I do not see how it could happen.
I would look at the dump of the other goroutines, to see if any of them are doing something in deadlock.go (under the l.mu lock).
Or, if you have something reproducible, I'd love to look at it.

@sasha-s
Copy link
Owner

sasha-s commented Apr 27, 2019

I found one bug in the library, hopefully fixed it in 1595213
Not sure it would solve your problem, but could you pull and try again?

@zhiqiangxu
Copy link

I just implemented a deadlock detector with easy to prove correctness:

https://github.com/zhiqiangxu/util/tree/master/deadlock

@dunglas
Copy link
Contributor

dunglas commented Jul 8, 2020

I'm facing the same problem while debugging Mercure:

POTENTIAL DEADLOCK:
Previous place where the lock was grabbed
goroutine 5212824 lock 0xc0001aa780
hub/bolt_transport.go:120 hub.(*BoltTransport).Dispatch { t.Lock() } <<<<<
hub/subscribe.go:201 hub.(*Hub).dispatchSubscriptionUpdate { h.transport.Dispatch(u) }
hub/subscribe.go:180 hub.(*Hub).shutdown { h.dispatchSubscriptionUpdate(s, false) }
hub/subscribe.go:49 hub.(*Hub).SubscribeHandler { return }
/usr/local/Cellar/go/1.14.3/libexec/src/net/http/server.go:2012 http.HandlerFunc.ServeHTTP { f(w, r) }
../go/pkg/mod/github.com/gorilla/[email protected]/mux.go:210 mux.(*Router).ServeHTTP { handler.ServeHTTP(w, req) }
../go/pkg/mod/github.com/gorilla/[email protected]/cors.go:54 handlers.(*cors).ServeHTTP { ch.h.ServeHTTP(w, r) }
../go/pkg/mod/github.com/unrolled/[email protected]/secure.go:177 secure.(*Secure).Handler.func1 { h.ServeHTTP(w, r) }
/usr/local/Cellar/go/1.14.3/libexec/src/net/http/server.go:2012 http.HandlerFunc.ServeHTTP { f(w, r) }
../go/pkg/mod/github.com/gorilla/[email protected]/logging.go:45 handlers.loggingHandler.ServeHTTP { h.handler.ServeHTTP(logger, req) }
../go/pkg/mod/github.com/gorilla/[email protected]/recovery.go:78 handlers.recoveryHandler.ServeHTTP { h.handler.ServeHTTP(w, req) }
../go/pkg/mod/github.com/gorilla/[email protected]/mux.go:210 mux.(*Router).ServeHTTP { handler.ServeHTTP(w, req) }
/usr/local/Cellar/go/1.14.3/libexec/src/net/http/server.go:2807 http.serverHandler.ServeHTTP { handler.ServeHTTP(rw, req) }
/usr/local/Cellar/go/1.14.3/libexec/src/net/http/server.go:1895 http.(*conn).serve { serverHandler{c.server}.ServeHTTP(w, w.req) }

Have been trying to lock it again for more than 30s
goroutine 5373639 lock 0xc0001aa780

Here is what goroutine 5212824 doing now
goroutine 5212824 [semacquire]:
sync.runtime_SemacquireMutex(0xc00000ee24, 0x0, 0x1)
	/usr/local/Cellar/go/1.14.3/libexec/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc00000ee20)
	/usr/local/Cellar/go/1.14.3/libexec/src/sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
	/usr/local/Cellar/go/1.14.3/libexec/src/sync/mutex.go:81
github.com/sasha-s/go-deadlock.(*lockOrder).postUnlock(0xc00000ee20, 0x16feaa0, 0xc0001aa780)
	/Users/dunglas/workspace/go/pkg/mod/github.com/sasha-s/[email protected]/deadlock.go:330 +0xa6
github.com/sasha-s/go-deadlock.postUnlock(...)
	/Users/dunglas/workspace/go/pkg/mod/github.com/sasha-s/[email protected]/deadlock.go:167
github.com/sasha-s/go-deadlock.(*RWMutex).Unlock(0xc0001aa780)
	/Users/dunglas/workspace/go/pkg/mod/github.com/sasha-s/[email protected]/deadlock.go:129 +0x66
github.com/dunglas/mercure/hub.(*BoltTransport).Dispatch(0xc0001aa780, 0xc001522600, 0x0, 0x0)
	/Users/dunglas/workspace/mercure/hub/bolt_transport.go:133 +0x251
github.com/dunglas/mercure/hub.(*Hub).dispatchSubscriptionUpdate(0xc000298420, 0xc002c1f5e0, 0xc000f81000)
	/Users/dunglas/workspace/mercure/hub/subscribe.go:201 +0x2f2
github.com/dunglas/mercure/hub.(*Hub).shutdown(0xc000298420, 0xc002c1f5e0)
	/Users/dunglas/workspace/mercure/hub/subscribe.go:180 +0x4c
github.com/dunglas/mercure/hub.(*Hub).SubscribeHandler(0xc000298420, 0x2bb4068, 0xc00102f260, 0xc000fae900)
	/Users/dunglas/workspace/mercure/hub/subscribe.go:49 +0x66e
net/http.HandlerFunc.ServeHTTP(0xc000280ea0, 0x2bb4068, 0xc00102f260, 0xc000fae900)
	/usr/local/Cellar/go/1.14.3/libexec/src/net/http/server.go:2012 +0x44
github.com/gorilla/mux.(*Router).ServeHTTP(0xc0001dc180, 0x2bb4068, 0xc00102f260, 0xc002d4f500)
	/Users/dunglas/workspace/go/pkg/mod/github.com/gorilla/[email protected]/mux.go:210 +0xe2
github.com/gorilla/handlers.(*cors).ServeHTTP(0xc0001942d0, 0x2bb4068, 0xc00102f260, 0xc002d4f500)
	/Users/dunglas/workspace/go/pkg/mod/github.com/gorilla/[email protected]/cors.go:54 +0x102a
github.com/unrolled/secure.(*Secure).Handler.func1(0x2bb4068, 0xc00102f260, 0xc002d4f500)
	/Users/dunglas/workspace/go/pkg/mod/github.com/unrolled/[email protected]/secure.go:177 +0xdf
net/http.HandlerFunc.ServeHTTP(0xc00000ff80, 0x2bb4068, 0xc00102f260, 0xc002d4f500)
	/usr/local/Cellar/go/1.14.3/libexec/src/net/http/server.go:2012 +0x44
github.com/gorilla/handlers.loggingHandler.ServeHTTP(0x18849a0, 0xc000010020, 0x1885560, 0xc00000ff80, 0x17b87b0, 0x18907a0, 0xc002c1f500, 0xc002d4f500)
	/Users/dunglas/workspace/go/pkg/mod/github.com/gorilla/[email protected]/logging.go:45 +0x243
github.com/gorilla/handlers.recoveryHandler.ServeHTTP(0x18850a0, 0xc000298de0, 0x1883d00, 0xc0002a1b20, 0x1, 0x18907a0, 0xc002c1f500, 0xc002d4f500)
	/Users/dunglas/workspace/go/pkg/mod/github.com/gorilla/[email protected]/recovery.go:78 +0xce
github.com/gorilla/mux.(*Router).ServeHTTP(0xc0001dc0c0, 0x18907a0, 0xc002c1f500, 0xc000a19e00)
	/Users/dunglas/workspace/go/pkg/mod/github.com/gorilla/[email protected]/mux.go:210 +0xe2
net/http.serverHandler.ServeHTTP(0xc0001c62a0, 0x18907a0, 0xc002c1f500, 0xc000a19e00)
	/usr/local/Cellar/go/1.14.3/libexec/src/net/http/server.go:2807 +0xa3
net/http.(*conn).serve(0xc0035ebcc0, 0x1892f60, 0xc002980200)
	/usr/local/Cellar/go/1.14.3/libexec/src/net/http/server.go:1895 +0x86c
created by net/http.(*Server).Serve
	/usr/local/Cellar/go/1.14.3/libexec/src/net/http/server.go:2933 +0x35c

I'm using the same version of the library. Thanks for writing such an useful tool btw!

@ghostiam
Copy link

ghostiam commented Mar 22, 2021

v0.2.1 (1595213)

This blocking occurs in the caching layer, which is heavily loaded, about 100,000 requests per second.
The blocking does not work immediately, sometimes it takes up to a week, and sometimes the next day after the restart

POTENTIAL DEADLOCK:
Previous place where the lock was grabbed
goroutine 1595446618 lock 0xc00048e7d0
<project>/pkg/stracing/mutex.go:48 stracing.(*Mutex).Lock ??? <<<<<
...

Have been trying to lock it again for more than 5m0s
goroutine 1595163670 lock 0xc00048e7d0

Here is what goroutine 1595446618 doing now
goroutine 1595446618 [semacquire]:
sync.runtime_SemacquireMutex(0xc0001108a4, 0xd00, 0x1)
 runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc0001108a0)
 sync/mutex.go:138 +0x105
sync.(*Mutex).Lock(...)
 sync/mutex.go:81
github.com/sasha-s/go-deadlock.(*lockOrder).postUnlock(0xc0001108a0, 0xc54900, 0xc00048e7d0)
 github.com/sasha-s/[email protected]/deadlock.go:330 +0xa6
github.com/sasha-s/go-deadlock.postUnlock(...)
 github.com/sasha-s/[email protected]/deadlock.go:167
github.com/sasha-s/go-deadlock.(*Mutex).Unlock(0xc00048e7d0)
 github.com/sasha-s/[email protected]/deadlock.go:97 +0x65
...

@ghostiam
Copy link

ghostiam commented Mar 22, 2021

But in this case, there is a very small timeout, while in my real code there is 5 minutes.

package main

import (
	"time"

	"github.com/sasha-s/go-deadlock"
)

func main() {
	deadlock.Opts.DeadlockTimeout = 1 * time.Second

	var mx deadlock.Mutex

	fn := func() {
		mx.Lock()
		mx.Unlock()
	}

	for i := 0; i < 100_000; i++ {
		go func() {
			for {
				fn()
			}
		}()
	}

	select {}
}

Perhaps the problem is not in this library, but in some other place. But in my real case, mutex protected counting the number in a slice and adding to a slice.

// copy code from https://github.com/jaegertracing/jaeger/blob/748bf213d1e5a03ebc4da63601473be3bbe193f4/examples/hotrod/pkg/tracing/mutex.go with my fixes
type Mutex struct {
	...
	waiters     []string
	waitersLock deadlock.Mutex // <--
}

func (sm *Mutex) Lock(ctx context.Context, name string) {
	activeSpan := opentracing.SpanFromContext(ctx) // "activeSpan == noopSpan{}" in production

	sm.waitersLock.Lock() // <-- "Previous place where the lock was grabbed"
	if waiting := len(sm.waiters); waiting > 0 && activeSpan != nil {
		activeSpan.LogFields(...) // <-- func (n noopSpan) LogFields(fields ...log.Field) {}
	}
	sm.waiters = append(sm.waiters, name)
	sm.waitersLock.Unlock() // <-- "Have been trying to lock it again for more than 5m0s"

	sm.realLock.Lock()
	sm.holder = name

	sm.waitersLock.Lock()
	behindLen := len(sm.waiters) - 1
	sm.waitersLock.Unlock()

	if activeSpan != nil {
		activeSpan.LogFields(...)
	}
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants