This is Part 3 of the series "The Wonderland of Dynamic Tracing" which consists of 7 parts. This series will consist of updates on the state of art of the dynamic tracing world.
The previous one, Part 2, introduced DTrace and SystemTap, two famous dynamic tracing frameworks. This part will continue looking at real world applications of SystemTap.
The huge impact of DTrace today wouldn’t be possible without the contributions of the renowned leading expert on DTrace, Brendan Gregg. I already mentioned him in Part 2 before. He previously worked on the Solaris file system in Sun Microsystems, being one of the earliest users of DTrace. He authored several books on DTrace and systems performance optimization, as well as many high quality blog posts concerning dynamic tracing in general.
After leaving Taobao in 2011, I went to Fuzhou and led an “idyllic life” there for a whole year. During the last few months there, I dived into Brendan’s public blog and obsessively studied DTrace and dynamic tracing. Before that, I had never heard of DTrace until one of my Sina Weibo followers mentioned it very briefly in a comment. I was immediately intrigued and did my own research to learn more about it. Well, I would never have imagined that my exploration would lead me to a totally new world and completely change my views about the entire computing world. So I devoted a lot of time to thoroughly reading each personal blog of Brendan’s. Ultimately, my efforts paid off. Fully enlightened, I felt I could finally take the subtleties of dynamic tracing in.
Then in 2012, my “idyllic life” in Fuzhou came to an end and I left for the US to join a CDN service provider and network security company. I immediately started to apply SystemTap and the whole set of dynamic tracing methods I had acquired to the company’s global network, to solve those very weird, very strange online problems. I found that my colleagues at the time would always add additional event tracking code into the software system on their own when troubleshooting online problems. They did so by directly editing the source code and adding various counters, or event tracking code to emit log data, primarily to the applications’ Lua code, and sometimes even to the code base of systems software like Nginx. In this way, a large number of logs would be collected online in real time, before entering the special database and going through offline analysis. However, their practice clearly brought colossal costs. This not only raised the cost related to hacking and maintaining the business system sharply, but also the online costs resulting from full-volume data collection and storage of enormous amounts of log data. Moreover, the following situation is not uncommon: Engineer A adds an event tracking code in the business code and Engineer B does the same later. However, they may end up being forgotten and left in the code base, without being noticed again. The final result would only be that these endlessly increasing events mess up the code base. And the invasive revisions would make corresponding software, whether system software or business code, more and more difficult to maintain.
Two serious problems exist in the way metrics and event tracking code is done. The first one is “too many” event tracking counters and logging statements are added. Out of a desire to cover all, we tend to gather some totally useless information, leading to unnecessary collection and storage costs. Even if sampling is already enough to analyze problems in many cases, the habitual response is still carrying out whole-network and full-volume data collection, which is clearly very expensive in the long run. The second problem is when “too few” counters and logging are added. It is often very difficult to plan all necessary information collection points in the first place, as no one can predict future problems needing troubleshooting. Consequently, whenever a new problem emerges, existing information collected is almost always insufficient. What follows is to revise the software system and conduct online operations frequently, causing much heavier workload to development and maintenance engineers, and higher risk of more severe online incidents.
Another brute force debugging method some maintenance engineers often use is to drop the servers offline, and then set a series of temporary firewall rules to block or screen user traffic or their own monitoring traffic, before fiddling with the production machine. This cumbersome process has a huge impact; firstly, as the machine is unable to continue its services, the overall throughput of the entire online system is impaired; secondly, problems which are reproduced only when real traffic exists will no longer occur. You can imagine how frustrating it will be.
Fortunately, SystemTap dynamic tracing offers an ideal solution to such problems while avoiding those headaches. You don’t have to change the software stack itself, be it systems software or business-level applications. I often write some dedicated tools that place dynamic probes on the “key spots” of the relevant code paths. These probes collect information separately, which will be combined and transmitted by the debugging tools to the terminal. This way of doing things enables me to quickly get key information I need through sampling on one or more machines, and obtain quick answers to some very basic questions to navigate subsequent (deeper) debugging work.
We earlier mentioned manually adding metrics & event tracking/logging code into the production systems to record logs and putting them in a database. The manual work is a far inferior strategy compared to seeing the whole production system as a directly accessible “database” from which we can obtain the needed information in a safe and quick manner, without leaving any trace. Following this train of thought, I wrote a number of debugging tools, most being open-sourced on GitHub. Many of these tools were targeted at systems software such as Nginx, LuaJIT and the operating system kernel, and some focused on higher-level Web frameworks like OpenResty. GitHub offers access to the following code repositories: nginx-systemtap-toolkit, perl-systemtap-toolkit and stappxx.
These tools helped me identify a lot of online problems, some even unexpectedly. We will walk through five examples below.
The first example is an accidental discovery when I analyzed the online Nginx process using the SystemTap-based Flame Graph. I noticed a big portion of CPU time was spent on a very strange code path. The code path turned out to be some temporary debugging statements left by one of my former colleagues when debugging an ancient problem. It’s like the “event tracking code” mentioned above. Although the problem had long been fixed, the debugging statements were left there forgotten, both online and in the company’s code repository. The existence of that piece of code came at a high price —— ongoing performance overhead was going unnoticed. The approach I used was sampling so that the tool can automatically draw a Flame Graph (the specifics of which we will cover in detail in Part 4.) From the graph, I can understand the problem and take measures accordingly. This process is much more efficient.
Long delays may be seen only in a very small portion of all the online requests, or “request latency outliers”. Though small in numbers, they may have delays on the level of whole seconds. I used to run into such things a lot. For example, one former colleague just took a wild guess that my OpenResty had a bug. Unconvinced, I immediately wrote a SystemTap tool for online sampling to analyze those requests delayed by over one second. The tool can directly test the internal time distribution of problematic requests, including delay of each typical I/O operation and pure CPU computing delay in the course of request handling. It soon found the delay appeared when OpenResty accessed the DNS server written in Go. Then the tool output detailed about those long-tailed DNS queries, which were all related to CNAME. Well, mystery solved! The delay had nothing to do with OpenResty. And the finding paved the way for further investigation and optimization.
Our third example is very interesting. It’s about shifting from network problems to hard disk issues in debugging. My former colleagues and I once noticed machines in a computer room showed a higher ratio of network timeout errors than the other colocations or data centers, albeit at a mere 1 percent. At first, we naturally paid attention to the network protocol stack. However, a series of dedicated SystemTap tools focusing directly on those outlier requests later led me to a hard disk configuration issue. First-hand data steered us to the correct track very quickly. A presumed network issue turned out to be a hard disk problem.
The fourth example turns to the Flame Graphs again. In the CPU flame graph for online nginx processes, we observed a phenomenon in the Nginx process: file opening and closing operations took a significant portion of the total CPU time. Our natural response was to initiate the file-handle caching of Nginx itself, without yielding any noticeable optimization results. With a new flame graph sampled, however, we found the “spin lock” used in the cache meta data now took a lot of CPU time. All became clear via the flame graph. Although we initiated the caching, it had been set at so large a size that its benefits were voided by the overhead of the meta data spin lock. Imagine that if we had no flame graphs and just performed black-box benchmarks, we would have reached the wrong conclusion that the file-handle cache of Nginx was useless instead of tuning cache parameters.
Now comes our last example for this section. After one online release operation, I remembered, the latest online flame graph revealed that the operation of compiling regular expressions consumed a lot of CPU time, but the caching of the compiled regular expression had already been enabled online. Apparently the number of regular expressions used in our business system had exceeded the maximum cache size. Accordingly, the next thing that came to my mind was simply to increase the cache size for online regular expressions. As expected, the bottleneck then immediately disappeared from our online flame graphs after the cache size growth.
These examples demonstrate that new problems will always occur and vary depending on the data centers, the servers, and even the time period of the day on the same machine. Whatever the problem, the solution is to analyze the root cause of the problem directly and take online samples from the first scene of events, instead of jumping into trials and errors with wild guesses. With the help of powerful observability tools, troubleshooting can actually yield much more with much less effort.
After we founded OpenResty Inc., we developed OpenResty XRay as a brand new dynamic tracing platform, putting an end to manual uses of open-source solutions like SystemTap.
This part of the series covered various use cases of one of the most mature dynamic tracing frameworks, SystemTap. In the next part, Part 4, we will talk about a very powerful visualization method to analyze resource usage across all software code paths, Flame Graphs. Stay tuned!
OpenResty XRay is a commercial dynamic tracing product offered by our OpenResty Inc. company. We use this product in our articles like this one to demonstrate implementation details, as well as provide statistics about real world applications and open source software. In general, OpenResty XRay can help users to gain deep insight into their online and offline software systems without any modifications or any other collaborations, and efficiently troubleshoot difficult problems for performance, reliability, and security. It utilizes advanced dynamic tracing technologies developed by OpenResty Inc. and others.
We welcome you to contact us to try out this product for free.
Yichun Zhang is the creator of the OpenResty® open source project. He is also the founder and CEO of the OpenResty Inc. company. He contributed a dozen open source Nginx 3rd-party modules, many Nginx and LuaJIT core patches, and designed the OpenResty XRay platform.
We provide a Chinese translation for this article on blog.openresty.com.cn We also welcome interested readers to contribute translations in other languages as long as the full article is translated without any omissions. We thank anyone willing to do so in advance.