In the demanding landscape of high-concurrency web services, developers and operations teams frequently grapple with a perplexing challenge: their applications’ real-world performance consistently falls short of expectations. Consider this all-too-common scenario: a single Nginx worker process struggles to handle even 300 requests per second (RPS), yet your server’s system load is already pushing its logical CPU core limits. Delving deeper, you might uncover severe, multi-millisecond blockages within Nginx’s Event Loop. These issues are the “invisible performance killers,” silently eroding your system’s overall throughput and efficiency.

While traditional monitoring tools only surface symptoms, failing to pinpoint code-level blockages, OpenResty XRay offers a powerful, non-intrusive dynamic tracing solution purpose-built for precisely analyzing and resolving complex off-CPU issues, as we’ll demonstrate in this real-world case study.

Unraveling Performance Puzzles: CPU Resource Contention Comes to Light

Our initial step in the analysis involved using OpenResty XRay’s C-level off-CPU flame graph to investigate process wait events.

Screenshot

Our analysis revealed that, beyond the expected epoll_wait network I/O waits, a significant portion of off-CPU time was consumed by pure CPU computation-related functions such as mpi_mul_hlp and free.

This clearly indicated that the process was ready to run but failed to acquire CPU time slices promptly, a classic symptom of CPU resource contention.

Utilizing a specialized analyzer, we pinpointed the root cause of the issue: the worker_cpu_affinity directive was missing from the Nginx configuration file.

no cpu affinity set.
no cpu affinity set.

The absence of this specific configuration led to a significant issue: multiple nginx worker processes were frequently scheduled across different CPU cores by the Linux kernel. This resulted in unnecessary context switching overhead, ultimately reducing the effective utilization of your CPU resources.

The Real Culprit Revealed: Blocking Lua I/O Operations

Having addressed the CPU contention issue, we continued our analysis of off-CPU time.

Screenshot

Upon further analysis, we discovered that the vast majority of off-CPU blocking time traced back to a single source: line 15 of a user-defined file, customize.lua.

This particular line of code invokes a function named file_list. Inside this function, we found the use of io.popen and read functions from the Lua standard library, which are employed to execute shell commands.

Crucially, these are synchronous blocking I/O operations. This means they halt the entire Nginx event loop, waiting until the external command fully executes and returns its results. This, right here, is a core reason behind the significantly reduced system throughput.

Quantitative Analysis: The Impact of File IO Performance

Understanding Blocking File IO Operations

Sample One

Beyond io.popen, our analysis using flame graphs that visualize C-level virtual file system read/write counts revealed two additional potential performance bottlenecks:

Screenshot

Looking at the flame graph above, it’s clear that the apr_generate_random_bytes function alone consumed a significant 59.8% of file read operations.

Screenshot

Meanwhile, the second flame graph highlights that the apr_sdbm_fetch function was responsible for 24% of file read operations.

Sample Two

Screenshot

This graph clearly illustrates that during Apache runtime, the C function apr_generate_random_bytes alone accounted for 51.8% of file read operations.

Screenshot

And the second graph further reveals that the apr_sdbm_fetch function collectively consumed 20.7% of file read operations.

Performance Assessment

To precisely evaluate the impact of these file I/O operations, we utilized specialized latency tools to meticulously measure the latency distribution of apr_generate_random_bytes:

3110 samples' latency: min=10, avg=17, max=1494 (us)
value |-------------------------------------------------- count
    2 |                                                      0
    4 |                                                      0
    8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  1540
   16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   1499
   32 |@@                                                   64
   64 |                                                      4
  128 |                                                      1
  256 |                                                      1
  512 |                                                      0
 1024 |                                                      1
 2048 |                                                      0
 4096 |                                                      0

We observed that this function can reach a maximum latency of 1494 microseconds, or nearly 1.5 milliseconds, which is quite substantial. This can block the Nginx event loop, affecting the request latency for all current concurrent connections.

Similarly, for apr_sdbm_fetch, we’ve observed latencies occasionally approaching 1 millisecond:

1570 samples' latency: min=5, avg=10, max=953 (us)
value |-------------------------------------------------- count
    1 |                                                      0
    2 |                                                      0
    4 |@@@@@@@@@@@@@@@@@                                   383
    8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  1080
   16 |@@@@                                                100
   32 |                                                      4
   64 |                                                      2
  128 |                                                      0
  256 |                                                      0
  512 |                                                      1
 1024 |                                                      0
 2048 |                                                      0

For Nginx, a platform renowned for its high concurrency and low latency capabilities, even a single millisecond of synchronous blocking is a critical concern that simply cannot be overlooked.

Deep Dive: Understanding Event Loop Blocking in Nginx

To wrap things up, we conducted a thorough evaluation of the overall blocking behavior within the Nginx event loop.

Over a 20-second sampling period, we captured a staggering 43,952 blocking samples. What’s truly striking is that the maximum single blocking duration soared to 75,165 microseconds – that’s over 75 milliseconds.

distribution of epoll loop blocking latencies (us): 43952 samples: min/avg/max: 754/0/75165
 value |-------------------------------------------------- count
     0 |                                                      31
     1 |                                                     149
     2 |                                                      33
     4 |                                                     166
     8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        6094
    16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  10839
    32 |@@@@@@@@@@@@@@@                                     3278
    64 |@@@@@@@@@                                           2060
   128 |@@@@@@@@@@@                                         2530
   256 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@     10082
   512 |@@@@@@@@@@@                                         2544
  1024 |@@@@@                                               1162
  2048 |@@@@@@@@@@                                          2343
  4096 |@@@@@@@@@@                                          2377
  8192 |                                                     199
 16384 |                                                      55
 32768 |                                                       8
 65536 |                                                       2
131072 |                                                       0
262144 |                                                       0

This data clearly illuminates why the application’s performance was so severely impacted. When the event loop is blocked for 75 milliseconds, it effectively means the worker process is completely incapacitated, unable to handle any new requests during that critical period.

Let’s put this into perspective by looking at a specific sample analysis:

found 3237 in reqs and 3238 done reqs in 3.103 sec (1043.07 r/s and 1043.39 r/s).
in reqs:
  pid 3101: 255.21 r/s
  pid 3102: 285.82 r/s
  pid 3103: 227.82 r/s
  pid 3106: 274.22 r/s
done reqs:
  pid 3102: 286.47 r/s
  pid 3103: 227.50 r/s
  pid 3101: 254.89 r/s
  pid 3106: 274.54 r/s

The Requests Per Second (RPS) for a single worker process hovered merely between 227-286. This RPS is exceptionally low, indicating that fewer than 300 requests were processed per second. Furthermore, the machine’s load was already nearing 4, which is equivalent to the number of logical CPU cores. This observation perfectly aligns with the severe blocking we identified.

Summary

In our in-depth analysis, we uncovered several critical issues significantly impacting system performance:

  • CPU resource contention among nginx worker processes led to extensive context switching, severely hindering effective CPU utilization.
  • Blocking Lua I/O operations, such as io.popen and read, were frequently invoked within the customize.lua file, becoming the primary bottleneck for system performance.
  • File I/O operations from functions like apr_generate_random_bytes and apr_sdbm_fetch also caused significant blocking effects on the event loop.
  • A single Nginx event loop blockage lasted up to 75 milliseconds, directly resulting in a mere 227-286 RPS (Requests Per Second) per worker process, far below expectations.

The presence of these issues was precisely the root cause of the system’s underperformance. Through OpenResty XRay’s precise diagnostics, we not only identified these bottlenecks but also provided clear directions for subsequent optimization.

Say Goodbye to Nginx Performance Nightmares with OpenResty XRay

OpenResty XRay is your ultimate weapon for solving complex performance challenges. With its unparalleled dynamic tracing capabilities, we don’t just pinpoint performance bottlenecks; we deliver profound value to your enterprise:

  • Breakthrough Diagnostic Efficiency: Bid farewell to lengthy, traditional troubleshooting processes. OpenResty XRay slashes problem identification time from days to mere hours, ensuring your team is always a step ahead.
  • Ultimate Hardware Resource Optimization: By eliminating performance bottlenecks, we can boost single-machine processing capacity severalfold, empowering you to handle larger business loads with fewer server resources.
  • Seamless Business Continuity Assurance: Proactively identify and eliminate potential performance risks, ensuring stable system operation and preventing service interruptions caused by overload.
  • Comprehensive Technical Debt Cleanup: Systematically uncover and resolve legacy performance issues, significantly enhancing code quality and the overall health of your system architecture.

In today’s digital economy, where success is often measured in milliseconds, application performance directly impacts user experience and business success. OpenResty XRay, with its advanced dynamic tracing technology, completely revolutionizes traditional performance optimization methods, providing a solid, scientific foundation for your technical decisions.

If your applications are grappling with performance challenges, OpenResty XRay will provide you with unparalleled precise insights, transforming performance into a powerful engine for your business growth.

What is OpenResty XRay

OpenResty XRay is a dynamic-tracing product that automatically analyzes your running applications to troubleshoot performance problems, behavioral issues, and security vulnerabilities with actionable suggestions. Under the hood, OpenResty XRay is powered by our Y language targeting various runtimes like Stap+, eBPF+, GDB, and ODB, depending on the contexts.

If you like this tutorial, please subscribe to this blog site and/or our YouTube channel. Thank you!

About The Author

Yichun Zhang (Github handle: agentzh), is the original creator of the OpenResty® open-source project and the CEO of OpenResty Inc..

Yichun is one of the earliest advocates and leaders of “open-source technology”. He worked at many internationally renowned tech companies, such as Cloudflare, Yahoo!. He is a pioneer of “edge computing”, “dynamic tracing” and “machine coding”, with over 22 years of programming and 16 years of open source experience. Yichun is well-known in the open-source space as the project leader of OpenResty®, adopted by more than 40 million global website domains.

OpenResty Inc., the enterprise software start-up founded by Yichun in 2017, has customers from some of the biggest companies in the world. Its flagship product, OpenResty XRay, is a non-invasive profiling and troubleshooting tool that significantly enhances and utilizes dynamic tracing technology. And its OpenResty Edge product is a powerful distributed traffic management and private CDN software product.

As an avid open-source contributor, Yichun has contributed more than a million lines of code to numerous open-source projects, including Linux kernel, Nginx, LuaJIT, GDB, SystemTap, LLVM, Perl, etc. He has also authored more than 60 open-source software libraries.