Discovering goroutine leaks with Semgrep

MalBot · November 9, 2021, 4:50am

By Alex Useche, Security Engineer
Originally published May 10, 2021

While learning how to write multithreaded code in Java or C++ can make computer science students reconsider their career choices, calling a function asynchronously in Go is just a matter of prefixing a function call with the go keyword. However, writing concurrent Go code can also be risky, as vicious concurrency bugs can slowly sneak into your application. Before you know it, there could be thousands of hanging goroutines slowing down your application, ultimately causing it to crash. This blog post provides a Semgrep rule that can be used in a bug-hunting quest and includes a link to a repository of specialized Semgrep rules that we use in our audits. It also explains how to use one of those rules to find a particularly pesky type of bug in Go: goroutine leaks.

The technique described in this post is inspired by GCatch, a tool that uses interprocedural analysis and the Z3 solver to detect misuse-of-channel bugs that may lead to hanging goroutines. The technique and development of the tool are particularly exciting because of the lack of research on concurrency bugs caused by the incorrect use of Go-specific structures such as channels.

Although the process of setting up this sort of tool, running it, and using it in a practical context is inherently complex, it is worthwhile. When we closely analyzed confirmed bugs reported by GCatch, we noticed patterns in their origins. We were then able to use those patterns to discover alternative ways of identifying instances of these bugs. Semgrep, as we will see, is a good tool for this job, given its speed and the ability to easily tweak Semgrep rules.

Goroutine leaks explained

Perhaps the best-known concurrency bugs in Go are race conditions, which often result from improper memory aliasing when working with goroutines inside of loops. Goroutine leaks, on the other hand, are also common concurrency bugs but are seldom discussed. This is partially because the consequences of a goroutine leak only become apparent after several of them occur; the leaks begin to affect performance and reliability in a noticeable way.

Goroutine leaks typically result from the incorrect use of channels to synchronize a message passed between goroutines. This problem often occurs when unbuffered channels are used for logic in cases when buffered channels should be used. This type of bug may cause goroutines to hang in memory and eventually exhaust a system’s resources, resulting in a system crash or a denial-of-service condition.

Let’s look at a practical example:

import (
  "fmt"
  "runtime"
  "time"
)
func main() {

requestData(1)

time.Sleep(time.Second * 1)

fmt.Printf(“Number of hanging goroutines: %d”, runtime.NumGoroutine() - 1)

}
func requestData(timeout time.Duration) string {

dataChan := make(chan string)
go func() {

newData := requestFromSlowServer()

dataChan <- newData // block

}()

select {

case result := <- dataChan:

fmt.Printf("[+] request returned: %s", result)

return result

case <- time.After(timeout):

fmt.Println("[!] request timeout!")

return “”

}

}
func requestFromSlowServer() string {

time.Sleep(time.Second * 1)

return “very important data”

}

In the above code, a channel write operation on line 21 blocks the anonymous goroutine that encloses it. The goroutine declared on line 19 will be blocked until a read operation occurs on dataChan. This is because read and write operations block goroutines when unbuffered channels are used, and every write operation must have a corresponding read operation.

There are two scenarios that cause anonymous goroutine leaks:

If the second case, case <- time.After(timeout), occurs before the read operation on line 24, the requestData function will exit, and the anonymous goroutine inside of it will be leaked.
If both cases are triggered at the same time, the scheduler will randomly select one of the two cases. If the second case is selected, the anonymous goroutine will be leaked.

When running the code, you’ll get the following output:

[!] request timeout!
Number of hanging goroutines: 1
Program exited.

The hanging goroutine is the anonymous goroutine on line 19.
‍
Using buffered channels would fix the above issue. While reading or writing to an unbuffered channel results in a goroutine block, executing a send (a write) to a buffered channel results in a block only when the channel buffer is full. Similarly, a receive operation will cause a block only when the channel buffer is empty.

To prevent a goroutine leak, all we need to do is add a length to the channel on line 17, which gives us the following:

func requestData(timeout time.Duration) string {
 dataChan := make(chan string, 1)
go func() {

newData := requestFromSlowServer()

dataChan <- newData // block

}()

After running the updated program, we can confirm that there are no more hanging goroutines.

[!] request timeout!
Number of hanging goroutines: 0
Program exited.

This bug may seem minor, but in certain situations, it could lead to a goroutine leak. For an example of a goroutine leak, see this PR in the Kubernetes repository. While running 1,496 goroutines, the author of the patch experienced an API server crash resulting from a goroutine leak.

Finding the bug

The process of debugging concurrency issues is so complex that a tool like Semgrep may seem ill-equipped for it. However, when we closely examined common Go concurrency bugs found in the wild, we identified patterns that we could easily leverage to create Semgrep rules. Those rules enabled us to find even complex bugs of this kind, largely because Go concurrency bugs can often be described by a few sets of simple patterns.

Before using Semgrep, it is important to recognize the limitations on the types of issues that it can solve. When searching for concurrency bugs, the most significant limitation is Semgrep’s inability to conduct interprocedural analysis. This means that we’ll need to target bugs that are contained within individual functions. This is a manageable problem when working in Go and won’t prevent us from using Semgrep, since Go programmers often rely on anonymous goroutines defined within individual functions.

Now we can begin to construct our Semgrep rule, basing it on the following typical manifestation of a goroutine leak:

An unbuffered channel, C, of type T is declared.
A write/send operation to channel C is executed in an anonymous goroutine, G.
C is read/received in a select block (or another location outside of G).
The program follows an execution path in which the read operation of C does not occur before the enclosing function is terminated.

It is the last step that generally causes a goroutine leak.

Bugs that result from the above conditions tend to cause patterns in the code, which we can detect using Semgrep. Regardless of the forms that these patterns take, there will be an unbuffered channel declared in the program, which we’ll want to analyze:

- pattern-inside: |
       $CHANNEL := make(...)
       ...

We’ll also need to exclude instances in which the channel is declared as a buffered channel:

- pattern-not-inside: |
       $CHANNEL := make(..., $T)
       ...

To detect the goroutine leak from our example, we can use the following pattern:

- pattern: |
         go func(...){
           ...
           $CHANNEL <- $X
           ...
         }(...)
         ...
         select {
         case ...
         case $Y := <- $CHANNEL:
         ...
         }

This code tells Semgrep to look for a send operation to the unbuffered channel, $CHANNEL, executed inside an anonymous goroutine, as well as a subsequent receive operation inside of a select block. Slight variations in this pattern may occur, so we will need to account for those in our Semgrep rule.

For instance, the receive expression could use the assignment (=) operator rather than the declaration (:=) operator, which would require a new pattern. We won't go over every possible variation here, but you can skip ahead and view the completed rule if you’d like. The finished rule also includes cases in which there could be more send operations than receive operations for an unbuffered channel, $CHANNEL.

We should also exclude instances of false positives, such as that from the Moby repository, in which a read operation on the blocking channel prevents the channel from causing a block before exiting the function.

- pattern-not-inside: |
       ...
       select {
       case ...
       case ...:
         ...
         ... =<- $CHANNEL
         ...
       }
   - pattern-not-inside: |
       ...
       select {
       case ...
       case ...:
         ...
         <-$CHANNEL
         ...
       }

Once we have completed our pattern, we can run it against the code. (Try it out using the Semgrep playground.) Running the pattern from the command line returns the following output:

$ semgrep --config ./hanging-goroutine.yml
running 1 rules…

test.go

severity:warning rule:semgrep.go.hanging-goroutine: Potential goroutine leak due to unbuffered channel send inside loop or unbuffered channel read in select block.
18: go func() {

19:     newData := requestFromSlowServer()

20:     dataChan <- newData // block

21: }()

22: select {

23: case result := <-dataChan:

24:     fmt.Printf("[+] request returned: %s", result)

25:     return result

26: case <-time.After(timeout):

27:     fmt.Println("[!] request timeout!")

28:     return “”

We ran this pattern against an unpatched release of the Docker codebase and compared the matches with those reported by GCatch and documented on its repository. Our Semgrep rule missed only 5 out of all the goroutine leak bugs found by GCatch that were reported to the Docker team via PRs. We also used this rule to find bugs on the Kubernetes and Minikube repositories, amazon-ecs-agent (Amazon Elastic Container Service Agent), as well as in two open-source Microsoft projects (azure-container-networking and hcsshim), and submitted the patches as PRs.

GCatch uses a technique that is smarter and more sophisticated than the one above. As a result, it can analyze multiple execution paths to find more complex instances of this bug. However, there are advantages to using Semgrep instead of a complex tool:

Semgrep can analyze code more quickly, because it focuses on discovering pattern matches rather than conducting taint and data flow analysis.
Semgrep rules are very easy to understand and update.
The setup is more straightforward and reliable.

Of course, the drawback is that we miss complex issues, such as cases in which the send operation occurs inside a separate named function (as opposed to an anonymous function). However, Semgrep has experimental support for data flow analysis, taint tracking, and basic cross-function analysis, and we look forward to testing and developing more complicated rules as support for those features continues to mature.

Finding other types of concurrency bugs

Our new semgrep-rules repository contains the Semgrep rule for the above bug, as well as other rules we developed and use in our code audits to find Go concurrency bugs. These include Semgrep rules used to catch race conditions in anonymous goroutines and unsafe uses of sleep functions for synchronization.

Like the goroutine leak we examined, these types of bugs are manifested in repeatable patterns, making them detectable by Semgrep. Keep an eye on the repository, as we will be adding and refining Semgrep rules for Go and other languages. We will also continue researching ways to debug concurrency bugs in Go and will look forward to sharing more findings in the future.

Article Link: Discovering goroutine leaks with Semgrep | Trail of Bits Blog