Movatterモバイル変換


[0]ホーム

URL:


Skip to content
DEV Community
Log in Create account

DEV Community

Iñigo Etxaniz
Iñigo Etxaniz

Posted on

     

Go deadlock detection: delock library

Introduction

In the Go programming world, managing concurrency is a part of everyday life. Go makes concurrent programming relatively straightforward with its built-in primitives like goroutines and channels. However, when it comes to locking structures to ensure thread safety, things can get a bit tricky.

Typically, we use a pointer to a structure and consistently call its methods to maintain thread safety. This approach works well for most scenarios, but complex algorithms demand careful consideration to avoid concurrency issues, especially deadlocks.

While working on a large-scale project for a caching implementation in an authentication service, I encountered an intriguing problem. Occasionally, and quite rarely, the system would hang. Race conditions were unlikely culprits, as I diligently used mutexes within structures. This left me suspecting deadlocks. But how could I confirm this and, more importantly, identify where they were happening? That's where the idea fordelock began – a tool specifically designed to detect and pinpoint deadlocks in Go applications.

Understanding Deadlocks in Go

Deadlocks in Go, or in any programming language for that matter, are akin to a gridlock in traffic – everything comes to a standstill. A deadlock occurs when two or more goroutines are waiting on each other to release resources, leading to a situation where none of them can proceed.

Go Runtime Deadlock Detection

This example, albeit a bit contrived, effectively illustrates a fundamental deadlock scenario. In theExecute method, we acquire a mutex lock and then call the same method recursively without ever releasing the lock. This leads to an attempt to acquire the lock again while it's still held by the same goroutine, causing a deadlock.

Though this is a simple and somewhat unrealistic example, it mirrors real-world scenarios where deadlocks occur due to improper lock handling. It’s a classic case of a process indefinitely waiting on itself. The Go runtime is adept at catching such blatant deadlocks in a single-threaded context, where the entire application grinds to a halt. This example serves as a reminder of how subtle mistakes in concurrency management can lead to significant issues, and why it's crucial to code with caution when dealing with locks.

packagemainimport("log""sync""time")typeLockExamplestruct{musync.MutexValueint}func(l*LockExample)Execute(){l.mu.Lock()deferl.mu.Unlock()time.Sleep(100*time.Millisecond)l.Value++log.Printf("value: %d",l.Value)l.Execute()}funcmain(){varlLockExamplel.Execute()}
Enter fullscreen modeExit fullscreen mode

Upon running the simple deadlock example with standardsync.Mutex, the Go runtime effectively detects and reports a deadlock. Here's what happens:

  • The log statement2023/12/05 12:38:02 value: 1 indicates that theExecute function ran once and incrementedValue.
  • Immediately after, the program crashes with a fatal error:all goroutines are asleep - deadlock!. This error is thrown by the Go runtime when it detects that all goroutines are waiting on each other, with no possibility of resuming.
  • The stack trace provided points to where the deadlock occurs, highlighting the recursive call toExecute that attempts to reacquire an already held lock.

This result demonstrates the Go runtime's ability to detect simple deadlocks where the entire program is stuck waiting for locks to be released. It serves as a practical example of how easy it is to fall into a deadlock trap and the importance of careful lock management.

go run ./examples/simple_std2023/12/05 12:38:02 value: 1fatal error: all goroutines are asleep - deadlock!goroutine 1[sync.Mutex.Lock]:sync.runtime_SemacquireMutex(0x485040?, 0x0?, 0xc00007e050?)        /usr/local/go/src/runtime/sema.go:77 +0x25sync.(*Mutex).lockSlow(0xc0000120e0)        /usr/local/go/src/sync/mutex.go:171 +0x15dsync.(*Mutex).Lock(...)        /usr/local/go/src/sync/mutex.go:90main.(*LockExample).Execute(0xc0000120e0)        /home/inigo/Documents/Projects/delock/examples/simple_std/main.go:15 +0x4cmain.(*LockExample).Execute(0xc0000120e0)        /home/inigo/Documents/Projects/delock/examples/simple_std/main.go:21 +0x12amain.main()        /home/inigo/Documents/Projects/delock/examples/simple_std/main.go:26 +0x1fexitstatus 2
Enter fullscreen modeExit fullscreen mode

Limitations in Complex Scenarios

In real-world programming, especially in services, it's common to have multiple goroutines performing different tasks. In such environments, a deadlock might not stop the entire application, making it more challenging to detect and resolve.

To illustrate this, let's modify our previous example to fit into a broader context:

packagemainimport("log""sync""time")typeLockExamplestruct{musync.MutexValueint}func(l*LockExample)GetValue()int{l.mu.Lock()deferl.mu.Unlock()returnl.Value}func(l*LockExample)Execute(){l.mu.Lock()deferl.mu.Unlock()time.Sleep(100*time.Millisecond)l.Value++log.Printf("value: %d",l.Value)l.Execute()}funcmain(){varlLockExamplegofunc(){l.Execute()}()readValueChan:=make(chanint)gofunc(){readValueChan<-l.GetValue()}()for{timeoutChan:=make(chanint)gofunc(){time.Sleep(1*time.Second)timeoutChan<-0}()select{case<-timeoutChan:log.Printf("waiting...")case<-readValueChan:log.Printf("value: %d",l.GetValue())return}}}
Enter fullscreen modeExit fullscreen mode

In this version, we runExecute in a separate goroutine, simulating a background process in a service. Another goroutine attempts to read theValue field, while the main routine periodically checks if the value has been read.

Output and Analysis

Running this code produces an output like this:

2023/12/05 12:42:25 value: 12023/12/05 12:42:26 waiting......
Enter fullscreen modeExit fullscreen mode

The first log indicates thatValue incremented once. However, the subsequent "waiting..." logs reveal the main goroutine is active but cannot readValue again due to the deadlock inExecute. This setup demonstrates a subtle deadlock that does not halt the entire program, exemplifying how deadlocks in multi-threaded applications can be more elusive and challenging to detect.

This scenario highlights the need for tools likedelock, which can help identify and diagnose deadlocks in more complex, real-world applications.

Detecting Deadlocks with Delock

In our previous example, the Go runtime couldn’t detect the deadlock because other parts of the program were still running. Now, let's see howdelock changes the game.

We modify our code to usedelock.Mutex instead ofsync.Mutex. This small change allowsdelock to detect and report deadlocks even in complex scenarios where the Go runtime might not.

packagemainimport("log""time""github.com/ietxaniz/delock")typeLockExamplestruct{mudelock.MutexValueint}func(l*LockExample)GetValue()int{lockID,err:=l.mu.Lock()iferr!=nil{panic(err)}deferl.mu.Unlock(lockID)returnl.Value}func(l*LockExample)Execute(){lockID,err:=l.mu.Lock()iferr!=nil{panic(err)}deferl.mu.Unlock(lockID)time.Sleep(100*time.Millisecond)l.Value++log.Printf("value: %d",l.Value)l.Execute()}funcmain(){varlLockExamplegofunc(){l.Execute()}()readValueChan:=make(chanint)gofunc(){readValueChan<-l.GetValue()}()for{timeoutChan:=make(chanint)gofunc(){time.Sleep(1*time.Second)timeoutChan<-0}()select{case<-timeoutChan:log.Printf("waiting...")case<-readValueChan:log.Printf("value: %d",l.GetValue())return}}}
Enter fullscreen modeExit fullscreen mode

In the modified code, the standard locking mechanism:

l.mu.Lock()deferl.mu.Unlock()
Enter fullscreen modeExit fullscreen mode

is transformed into:

lockID,err:=l.mu.Lock()iferr!=nil{panic(err)}deferl.mu.Unlock(lockID)
Enter fullscreen modeExit fullscreen mode

Here's what's happening:

  • lockID: This is a unique identifier for the lock instance within thedelock library. It's used internally to track and manage stack data associated with each lock operation.
  • err: If theLock() method takes longer than the defined timeout, it returns an error. This error contains valuable stack trace information indicating where the deadlock might be occurring.

For those concerned about verbosity, usingfold-ninja, a VSCode extension I developed, can streamline the appearance in the editor. The expanded code block:

lockID,err:=l.mu.Lock()iferr!=nil{panic(err)}deferl.mu.Unlock(lockID)
Enter fullscreen modeExit fullscreen mode

is neatly collapsed to:

lockID,err:=l.mu.Lock()deferl.mu.Unlock(lockID)
Enter fullscreen modeExit fullscreen mode

This makes the code more manageable and less cluttered while maintaining the functionality and robustness provided bydelock.

Output and Analysis

When we run this code:

2023/12/05 12:52:42 value: 12023/12/05 12:52:43 waiting...panic: Deadlock detected
Enter fullscreen modeExit fullscreen mode

Unlike before,delock detects the deadlock situation and panics, providing detailed stack trace information. This shows exactly where the deadlock occurs, demonstratingdelock's ability to uncover hidden issues in multi-threaded applications.

By usingdelock, developers can gain insights into complex deadlock situations, making it a powerful tool for debugging and ensuring smooth concurrency management in Go services.

Tackling More Complex Deadlock Scenarios

Before we dive into a more intricate deadlock scenario, let's recall the primary function of thedelock tool:delock is designed to help us identify and understand deadlocks in our Go applications, especially in complex cases where traditional Go runtime detection might fall short. Its ability to provide detailed stack trace information is invaluable in diagnosing and resolving these challenging issues.

Now, let's explore a scenario involving parent-child relationships within data structures, where deadlocks can be more elusive and difficult to detect.

Not all deadlock issues are as straightforward as the ones we've explored so far. In real-world applications, especially those involving intricate parent-child relationships between data structures, deadlocks can be more elusive. Let's delve into a more challenging example, where two interconnected data structures, each with its own lock, interact in ways that could potentially lead to deadlock situations.

In this scenario, we'll examine how parent and child structures, both equipped with mutexes, can create complex locking sequences that might result in deadlocks under certain conditions. This complexity showcases the subtleties of deadlock issues in sophisticated program architectures.

The following code demonstrates the same scenario using thedelock library. Whiledelock is instrumental in detecting deadlocks, it's important to note that it's primarily designed for debugging and testing purposes rather than production use. The additional processingdelock performs when acquiring and releasing locks might introduce overhead not suitable for a production environment.

packagemainimport("fmt""os""strconv""time""github.com/ietxaniz/delock")typeChildstruct{mudelock.RWMutexvaluestringidintparent*Parent}typeParentstruct{mudelock.RWMutexitems[]*ChildlastIDintnumChangesint}func(p*Parent)Add(valuestring){lockID,err:=p.mu.Lock()iferr!=nil{panic(err)}deferp.mu.Unlock(lockID)time.Sleep(1*time.Millisecond)p.lastID++ifp.items==nil{p.items=make([]*Child,0)}p.items=append(p.items,&Child{value:value,id:p.lastID,parent:p})}func(p*Parent)GetChildByID(childIDint)*Child{lockID,err:=p.mu.RLock()iferr!=nil{panic(err)}deferp.mu.RUnlock(lockID)for_,item:=rangep.items{ifitem.GetID()==childID{returnitem}}returnnil}func(p*Parent)GetNumChanges()int{lockID,err:=p.mu.RLock()iferr!=nil{panic(err)}deferp.mu.RUnlock(lockID)returnp.numChanges}func(p*Parent)IncreaseNumChanges(){lockID,err:=p.mu.RLock()iferr!=nil{panic(err)}deferp.mu.RUnlock(lockID)time.Sleep(1*time.Millisecond)p.numChanges++}func(c*Child)GetID()int{lockID,err:=c.mu.RLock()iferr!=nil{panic(err)}deferc.mu.RUnlock(lockID)time.Sleep(1*time.Millisecond)returnc.id}func(c*Child)SetValue(valuestring){lockID,err:=c.mu.Lock()iferr!=nil{panic(err)}deferc.mu.Unlock(lockID)time.Sleep(1*time.Millisecond)c.value=valuec.parent.IncreaseNumChanges()}funcmain(){varpParentiflen(os.Args)<3{println("Usage: complex <n> <t>")return}n,err:=strconv.Atoi(os.Args[1])iferr!=nil{panic(err)}t,err:=strconv.Atoi(os.Args[2])iferr!=nil{panic(err)}gofunc(){fori:=0;i<n;i++{time.Sleep(time.Duration(t)*time.Millisecond)p.Add(fmt.Sprintf("child%d",i))gofunc(idint,sleepMillisecondsint){for{time.Sleep(time.Duration(sleepMilliseconds)*time.Millisecond)child:=p.GetChildByID(id)ifchild!=nil{child.SetValue(fmt.Sprintf("child%d",id))}}}(i,t)}}()lastReadedNumChanges:=0gofunc(){for{time.Sleep(1*time.Second)lastReadedNumChanges=p.GetNumChanges()}}()for{time.Sleep(1*time.Second)println(lastReadedNumChanges)}}
Enter fullscreen modeExit fullscreen mode

It's important to note that while this example runs smoothly under normal, unstressed conditions, it can lead to a deadlock under high-stress scenarios. This behavior can be observed by adjusting the number of runner goroutines (n) and the sleep duration in milliseconds (t) in each step. Whenn is low andt is high, the program appears to function correctly. However, increasingn while decreasingt creates conditions ripe for a deadlock, demonstrating how system load and operation timing can impact the occurrence of deadlocks in complex systems.

With the complex example in mind, we'll run the code using various parameters to simulate different stress levels. By altering the number of goroutines (n) and the sleep duration (t), we aim to observe the program's behavior under varying conditions. Sometimes, theValue might increase steadily; other times, it may get stuck, raising suspicions of a deadlock. To confirm these suspicions, we'll switch to usingdelock. This will help us determine not only if a deadlock is occurring but also under which specific conditions. Once detected, we'll delve into the code to unravel the reason behind the deadlock, enhancing our understanding of these intricate scenarios.

go run ./examples/complex_std/ 50 500105395780780780780780780go run ./examples/complex_std/ 50 1002828136300496752103514071787217125522937
Enter fullscreen modeExit fullscreen mode

50 tasks with 50ms of sleep time seems to have problems while 50 tasks with 100ms of sleep time seems to work properly. So lets execute 50 tasks with 10ms sleep time using delock version.

exportDELOCK_TIMEOUT=10000&& go run ./examples/complex 50 100...0panic: Deadlock detected***** READ LOCK********goroutine 52[running]:github.com/ietxaniz/delock.(*RWMutex).RLock(0xc0000b0120)        /home/inigo/Documents/Projects/delock/rwmutex.go:159 +0x65main.(*Parent).GetChildByID(0xc0000b0120, 0x3)        /home/inigo/Documents/Projects/delock/examples/complex/main.go:42 +0x38main.main.func1.1(0x0?, 0xa)        /home/inigo/Documents/Projects/delock/examples/complex/main.go:119 +0x55created by main.main.func1ingoroutine 18        /home/inigo/Documents/Projects/delock/examples/complex/main.go:116 +0x54***** READ LOCK********goroutine 14[running]:github.com/ietxaniz/delock.(*RWMutex).RLock(0xc0000b0120)        /home/inigo/Documents/Projects/delock/rwmutex.go:159 +0x65main.(*Parent).IncreaseNumChanges(0xc0000b0120)        /home/inigo/Documents/Projects/delock/examples/complex/main.go:68 +0x2amain.(*Child).SetValue(0xc00007c300,{0xc000220020, 0x6})        /home/inigo/Documents/Projects/delock/examples/complex/main.go:95 +0xb1main.main.func1.1(0x0?, 0xa)        /home/inigo/Documents/Projects/delock/examples/complex/main.go:124 +0x15acreated by main.main.func1ingoroutine 18        /home/inigo/Documents/Projects/delock/examples/complex/main.go:116 +0x54...exitstatus 2
Enter fullscreen modeExit fullscreen mode

"I've omitted several lines from the output because numerous locks were acquired simultaneously in this scenario. However, I retained two critical outputs for analysis. The first one demonstrates typical behavior, but the second raises a red flag, particularly atexamples/complex/main.go:95. This line is our key to understanding the deadlock. Let's examine the related code segment, with line numbers for reference:"

87func(c*Child)SetValue(valuestring){88lockID,err:=c.mu.Lock()89iferr!=nil{90panic(err)91}92deferc.mu.Unlock(lockID)9394c.value=value95c.parent.IncreaseNumChanges()96}
Enter fullscreen modeExit fullscreen mode

Line 95,c.parent.IncreaseNumChanges() is called while holding a lock. This is a very common mistake. The problem is thatIncreaseNumChanges also acquires a lock. So, ifIncreaseNumChanges is called while holding a lock, it will never be able to acquire the lock and will wait forever. This is a deadlock.

So, this demonstrates how this tool can effectively help us find deadlocks in our code.

Solution to the Found Deadlock Issue

Upon discovering the deadlock, two potential solutions come to mind. The first is straightforward, while the second aligns more with my preferred approach:

Straightforward Solution

func(c*Child)SetValue(valuestring){lockID,err:=c.mu.Lock()iferr!=nil{panic(err)}c.value=valueparent:=c.parentc.mu.Unlock(lockID)parent.IncreaseNumChanges()}
Enter fullscreen modeExit fullscreen mode

This solution works by explicitly unlocking before exiting the function. However, it risks future errors if the code is modified and the lock is not properly released in all paths.

Preferred Solution

func(c*Child)setValueSafe(valuestring)*Parent{lockID,err:=c.mu.Lock()iferr!=nil{panic(err)}deferc.mu.Unlock(lockID)c.value=valuereturnc.parent}func(c*Child)SetValue(valuestring){parent:=c.setValueSafe(value)parent.IncreaseNumChanges()}
Enter fullscreen modeExit fullscreen mode

This approach achieves the same goal but retains thedefer statement for unlocking. This makes the code more robust against future changes and ensures that the lock is always properly released.

With these corrections applied, the program's output now aligns with our expectations, as evidenced in the terminal:

go run ./examples/complex_solved 100 1040484612991763220344407563106701375016838199262301226104292073230635417
Enter fullscreen modeExit fullscreen mode

Repository and Examples

Thedelock library is hosted on GitHub and can be found athttps://github.com/ietxaniz/delock. The repository includes anexamples folder, showcasing various scenarios:

  • simple andsimple_std: Demonstrating deadlock detection withdelock and the Go standard library, respectively.
  • simple2 andsimple2_std: Highlighting a scenario where Go's standard deadlock detection might fail, butdelock succeeds.
  • complex,complex_std, andcomplex_solved: Exploring more intricate deadlock situations with varying outcomes depending on the detection method used and whether the deadlock issue is resolved.

To run these examples, clone the repository and execute the desired example from the root folder using the command:go run pathToExample. For instance:go run ./examples/complex.

Open Invitation for Collaboration

I welcome any suggestions, feedback, or contributions to thedelock project. Whether it's improving existing features, proposing new ones, or enhancing documentation and examples, your input is valuable. Feel free to raise issues, submit pull requests, or reach out for discussions on potential enhancements.

Top comments(0)

Subscribe
pic
Create template

Templates let you quickly answer FAQs or store snippets for re-use.

Dismiss

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment'spermalink.

For further actions, you may consider blocking this person and/orreporting abuse

Programmer
  • Location
    Eibar
  • Joined

More fromIñigo Etxaniz

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

Log in Create account

[8]ページ先頭

©2009-2025 Movatter.jp