Check out how OpenResty XRay helps organizations troubleshoot issues and optimize the performance of their applications.

Learn More LIVE DEMO

Every engineer’s nightmare: all production servers simultaneously pegged at 100% CPU, handling zero requests. Rebooting doesn’t help. Rolling back doesn’t help. The system is alive but completely paralyzed.

This is what Bilibili’s engineering team faced in July 2021. Bilibili — often described as China’s YouTube, with over 267 million monthly active users at the time of the incident — had built its internal gateway system on top of the open-source OpenResty framework. When the incident struck, every OpenResty process across all online servers entered the same state simultaneously: full CPU utilization, no request processing.

After exhausting all internal resources, Bilibili’s team reached out to OpenResty Inc. for assistance.

What followed took only minutes to diagnose — and the root cause, once found, was the kind of thing that makes experienced engineers pause: a single string-type zero value ("0") passed where a numeric weight was expected, triggering infinite recursion across the entire system.

This post explains how OpenResty XRay — a non-invasive dynamic tracing tool — identified the exact root cause without modifying a single line of code or injecting anything into Bilibili’s production processes.

OpenResty XRay Robots

Bilibili: The Platform Behind Hundreds of Millions of Streams

Bilibili is a Shanghai-based video-sharing platform and one of the most popular streaming services in Asia. Originally established as the leading anime streaming destination in China, it has since expanded into a broad content platform comparable in scale and cultural significance to YouTube in its home market.

Bilibili is a commercial customer of OpenResty XRay. Its engineering team had built the platform’s internal gateway system on the open-source OpenResty development framework — a decision that would become directly relevant to how the incident was ultimately diagnosed.

When Rebooting Doesn’t Help

On July 13, 2021, Bilibili’s production environment entered a state that standard incident response playbooks are not designed for.

The OpenResty processes on all online servers simultaneously reached 100% CPU utilization — and stayed there. No requests were being processed. The system was not crashed in the traditional sense; it was running at full capacity while doing nothing useful.

CPU All 100%

The engineering team executed the standard responses: rebooting servers, rolling back recent changes. Neither worked. Every server returned to the same state. With internal resources exhausted and the incident ongoing in production, Bilibili’s team contacted OpenResty Inc. for assistance.

A subsequent post-mortem published by Bilibili — 2021.07.13 This is how we crashed — generated significant discussion and questions about the technical specifics of how the incident was diagnosed and resolved. This post addresses those questions directly, and introduces the tools involved to a broader technical audience.

Finding the Root Cause Without Touching a Single Line of Code

Bilibili’s team gave OpenResty Inc. access to the affected production environment — and OpenResty XRay, a non-invasive dynamic tracing tool, got to work.

No code changes. No instrumentation. No injection into Bilibili’s processes. No special compilation options or plug-ins required.

Step 1: C-level CPU flame graph

Using OpenResty XRay’s automatic sampling C-level CPU flame graph, the OpenResty team quickly identified that Bilibili’s OpenResty Nginx process was spending almost all of its CPU time executing Lua code.

C-Land CPU Flame Graph for Bilibili

For privacy and data security reasons, this graph only shows function frames from OpenResty’s open-source software, not including Lua code and relative information from Bilibili.

Step 2: Lua-level CPU flame graph

The team then used OpenResty XRay’s automatic sampling Lua-level CPU flame graph to confirm that almost all CPU time was concentrated on a single Lua code path — one that was executing in an infinite loop. OpenResty XRay’s Lua CPU flame graph operates at the granularity of individual Lua source code lines.

Lua-Land CPU Flame Graph for Bilibili

Again, only function frames from non-customer-related open-source code are shown.

The Lua flame graph referenced in Bilibili’s original post-mortem article was obtained by sampling the problematic OpenResty service process on Bilibili’s production server using OpenResty XRay.

It took OpenResty XRay a few minutes to generate both flame graphs. The graphs indicated the exact code paths containing the root cause. The entire troubleshooting process required no code modifications or injection into the system process or application.

The Root Cause: One Type Mismatch, Entire System Down

The Lua flame graph located the culprit.

A string-type zero value ("0") for a server weight had been inserted into the business logic’s configuration metadata. The Lua API of OpenResty’s lua-resty-balancer library expected a numerical weight value. The type mismatch caused infinite recursion, which manifested as infinite loops — simultaneously, across every server in the fleet.

It is worth noting that LuaJIT’s just-in-time (JIT) compiler was not at fault here. The JIT compiler was initially suspected because the corresponding Lua code path appeared correct at first glance, and a separate business team within Bilibili had made an unannounced change to the production environment that further obscured the picture. The distinction between a string zero ("0") and a numeric zero (0) is subtle enough to evade immediate scrutiny. After ruling out a JIT compiler bug, we were able to confirm the string-type zero value as the definitive root cause.

This is the kind of root cause that is nearly impossible to find through log analysis or traditional APM tooling: no error was thrown, no crash occurred, and the system appeared to be running normally by every surface-level metric except one — it was doing no useful work.

What Changed After the Incident

The resolution involved changes at three levels.

At Bilibili: The engineering team implemented safeguards to ensure that no string-typed weight values for upstream servers can be written to configuration data in the business logic code going forward.

In OpenResty XRay: The latest version introduces a new feature that prints the values of all local variables on the Lua call stack. For incidents of this type, this capability allows the root cause to be located more quickly and directly — without requiring the additional inference step that was necessary in this case.

In the open-source library: OpenResty Inc. has explicitly hardened the open-source lua-resty-balancer library against this class of API misuse. Any wrong weight values of incorrect data types will now always be converted to numeric types, preventing this failure mode at the library level.

How OpenResty XRay Works: Non-Invasive, Code-Level Diagnostics

OpenResty XRay is a non-invasive troubleshooting and analytic software utilizing enhanced proprietary dynamic tracing technology.

With OpenResty XRay, companies can automatically perform in-depth analysis and monitoring on software built with open-source programming languages and runtimes, including OpenResty, Nginx, LuaJIT, PHP, Python, Perl, Go, PostgreSQL, Redis, and more. Support for additional technology stacks — including Java and Ruby — is actively being added.

OpenResty XRay’s code-level troubleshooting requires:

  • Zero modifications to users’ applications
  • No special plug-ins, modules, or compilation options
  • No code injection into customer processes

It can also penetrate existing unmodified Docker or Kubernetes (K8s) containers and precisely analyze the application running inside them.

With OpenResty XRay, users can quickly identify and pinpoint performance, functional, and security issues — from high CPU spikes and memory leaks to abnormal request latency and high disk I/O — to ensure system stability across all environments.

In addition to Bilibili, OpenResty XRay has successfully helped Zoom, Microsoft, Qunar.com, and many other companies optimize performance and pinpoint issues in production.

OpenResty Console Screenshot

Closing

The incident was fully resolved. We appreciate Bilibili trusting our team and products throughout the process.

This post is part of a series introducing OpenResty XRay to a wider technical audience. As a company that has focused primarily on engineering innovation, we are now making a deliberate effort to document how our tools perform in real production environments — so that more engineering teams can benefit from them.

If you are interested in learning more about OpenResty XRay or would like to receive a free consultation report from our domain experts, please visit the OpenResty XRay product page and request a product trial.

For those interested in the dynamic tracing technology and the closed-source enhancements we have built on top of it, see: Ylang: Universal Language for eBPF, Stap+, GDB, and More.

About The Author

Yichun Zhang (Github handle: agentzh), is the original creator of the OpenResty® open-source project and the CEO of OpenResty Inc..

Yichun is one of the earliest advocates and leaders of “open-source technology”. He worked at many internationally renowned tech companies, such as Cloudflare, Yahoo!. He is a pioneer of “edge computing”, “dynamic tracing” and “machine coding”, with over 22 years of programming and 16 years of open source experience. Yichun is well-known in the open-source space as the project leader of OpenResty®, adopted by more than 40 million global website domains.

OpenResty Inc., the enterprise software start-up founded by Yichun in 2017, has customers from some of the biggest companies in the world. Its flagship product, OpenResty XRay, is a non-invasive profiling and troubleshooting tool that significantly enhances and utilizes dynamic tracing technology. And its OpenResty Edge product is a powerful distributed traffic management and private CDN software product.

As an avid open-source contributor, Yichun has contributed more than a million lines of code to numerous open-source projects, including Linux kernel, Nginx, LuaJIT, GDB, SystemTap, LLVM, Perl, etc. He has also authored more than 60 open-source software libraries.