eG Monitoring
 

Measures reported by KuberGarbageTest

The Kubernetes project is written in the Go programming language (also known as Golang). Go is a statically typed, compiled programming language designed at Google. Go is syntactically similar to C, but with memory safety, garbage collection, structural typing, and communicating sequential processes (CSP)-style concurrency.

Garbage collectors have the responsibility of tracking heap memory allocations, freeing up allocations that are no longer needed, and keeping allocations that are still in-use. The Go programming language uses a non-generational concurrent tri-color mark and sweep collector.

When a collection starts, the collector runs through four phases of work:

  • Mark Setup

  • Marking

  • Mark Termination

  • Sweeping

The Mark Setup phase is where the Write Barrier is turned on. The purpose of the Write Barrier is to allow the collector to maintain data integrity on the heap during a collection since both the collector and application goroutines will be running concurrently. In order to turn the Write Barrier on, every application goroutine running must be stopped. The only way to do that is for the collector to watch and wait for each goroutine to make a function call. Function calls guarantee the goroutines are at a safe point to be stopped.

Once the Write Barrier is turned on, the collector commences with the Marking phase. The Marking phase consists of marking values in heap memory that are still in-use. This work starts by inspecting the stacks for all existing goroutines to find root pointers to heap memory. Then the collector must traverse the heap memory graph from those root pointers. The first thing the collector does in this phase is take 25% of the available CPU capacity for itself. For example, if an application uses 4 CPUs, then the collector will hog an entire CPU while at this phase. In this case typically, the collector will use the 25% CPU capacity that it has set aside for this phase, to do the marking work, allowing normal application work to continue on the remaining 75%.

Once the Marking work is done, the next phase is Mark Termination. This is when the Write Barrier is turned off, various clean up tasks are performed, and the next collection goal is calculated.

Once the collection is finished, the full CPU capacity is released for the use of the application Goroutines again, thus bringing the application back to full throttle.

Sweeping typically happens after the collection is finished. Sweeping is when the memory associated with values in heap memory that were not marked as in-use are reclaimed. This activity occurs when application Goroutines attempt to allocate new values in heap memory.

In summary, by performing garbage collection, Golang ensures that applications make optimal use of available heap memory. While this improves application performance at one end, at the other, every collection also inflicts certain latencies on the running application that may slow down application work. For instance, at the Mark Setup phase, the garbage collector stops all application Goroutines, so it can turn on the Write Barrier. This imposes a Stop the World (STW) latency on the running application. Likewise, the application Goroutines are stopped at the Mark Termination phase as well, once again inflicting an STW latency on the applications. Also, sometimes, garbage collection steals CPU capacity to stay alive, and degrades application performance in the bargain. For instance, in the Marking phase, if the Goroutine dedicated to the collector is unable to finish the marking work before the heap memory in-use reaches its limit, the collector will recruit the application Goroutines to assist with the Marking work. This is called a Mark Assist. When this happens, the application will be forced to compete with the collector for the available CPU resources. This contention can occasionally choke application performance!

To optimize garbage collection and eliminate its ill effects, administrators must ensure that the collector does more work, while consuming minimum time and resources. For this purpose, administrators must first study the garbage collection activity closely, and figure out how much time and resources the collector typically invests in this process. This is where, the Kube Garbage Collection test helps!

This test monitors the garbage collection activity of Golang, and reports the time the Golang collector spends collecting garbage. Administrators will be alerted if too much time is being spent in garbage collection. The test also reveals the number of threads and Goroutines presently engaged in garbage collection, thus revealing how resource-intensive the garbage collection is. This way, the test enables administrators to periodically review the garbage collection activity, assess its impact on application performance, and figure out if it needs to be fine-tuned to reduce application latencies.

Outputs of the test:One set of results for the Kubernetes cluster being monitored.

The measures made by this test are as follows:

Measurement Description Measurement Unit Interpretation
Gc_duration Indicates the average time spent in garbage collection. Seconds A low value is desired for this measure. A very high value or a consistent increase in the value of this measure is a cause for concern, as it indicates that the garbage collector is probably taking too long to complete collections.

Since garbage collection often triggers stop-the-world latencies in applications, prolonged garbage collection activities can adversely impact application availability and performance. In short, the longer GC runs, poorer will be application performance.

One way to reduce GC time, is to fine-tune the configuration option called GC Percentage at runtime. This is set to 100 by default. This value represents a ratio of how much new heap memory can be allocated before the next collection has to start. Setting the GC Percentage to 100 means, based on the amount of heap memory marked as live after a collection finishes, the next collection has to start at or before 100% more new allocations are added to heap memory. You could decide to change the GC Percentage value to something larger than 100. This will increase the amount of heap memory that has to be allocated before the next collection can start, thus delaying the start of the next collection.

On the flip side though, increasing the GC percentage will slow down the pace of the collector. The collector has a pacing algorithm which is used to determine when a collection is to start. The algorithm depends on a feedback loop that the collector uses to gather information about the running application and the stress the application is putting on the heap. Stress can be defined as how fast the application is allocating heap memory within a given amount of time. It’s that stress that determines the pace at which the collector needs to run.

One misconception is thinking that slowing down the pace of the collector is a way to improve performance. In reality though, application performance truly improves only when more work is getting done between collections or during a collection. This can be achieved only by reducing the amount or the number of allocations any piece of work is adding to heap memory.

Increasing the GC percentage in fact, increases the workload of collections by adding more to the heap memory after every collection. In the long run, this may degrade application performance than improve it.

Thread_count Indicates the number of threads spawned by the garbage collection process. Number A large value for this measure is indicative of resource-intensive garbage collections.
Go_routines Indicates the number of Goroutines used for garbage collection. Number An unusually high value for this measure could indicate that the garbage collector is probably recruiting application Goroutines as well to do the Marking work on the collections. This in turn could be because of of Marking workloads that the collector is unable to complete using just its dedicated Goroutines. Such workloads are usually imposed by applications that consume heap memory significantly.