From OOM to O(1): The Implementation Path of Streaming JSON Parser
In the field of big data processing, efficiency and resource utilization remain core challenges for engineers. Recently, our team discovered that an internally used data format conversion tool was experiencing serious memory usage issues when processing GB-level JSON input files, frequently triggering OOM (Out Of Memory) errors. These large JSON files were not in JSONL format, but existed as single, extremely large objects or arrays, posing enormous challenges for traditional parsing methods.
Differences between JSONL format and traditional JSON
JSONL (JSON Lines) format has become a common solution. JSONL places each JSON data item on a separate line, with each line being independent, allowing for line-by-line reading and processing of large datasets without loading the entire file into memory. However, real-world data doesn’t always exist in this ideal format. We were facing precisely those ultra-large JSON files in non-JSONL format, existing as single, indivisible data structures. Traditional parsing methods must load the entire structure into memory when processing, which is the root cause of OOM problems.
Solution
Facing this technical challenge, we adopted a lightweight but efficient solution. With just over 200 lines of code, we implemented a brand new streaming JSON parser from scratch.
Although mature streaming JSON parsing tools like YAJL are available on the market, independent development gave us greater flexibility and room for targeted optimization. This “from scratch” approach may seem simple, but it precisely solves our specific problem.
The key technical breakthrough is that the newly implemented parser reduces space complexity to nearly O(1), meaning that regardless of how large the input JSON file is, memory usage remains at a relatively constant level. This is a qualitative leap for processing ultra-large-scale data.
Technical verification
To ensure the stability and correctness of the parser, we designed a rigorous automated testing process. In our tests, we deliberately used extremely small memory cache buffers (one byte, two bytes, and three bytes respectively) to conduct extreme tests on various scales of real JSON inputs.
The test results were satisfactory: in all cases, the parser could correctly process the input data, and the reverse-converted JSON was completely identical to the original JSON. This proves that our implementation not only saves memory but also maintains data integrity.
Continuous optimization
Although the current implementation has solved the memory usage problem, there is still room for improvement in CPU efficiency for the script implementation. To this end, we have commissioned engineers from our Canadian team to convert the existing implementation to a C++ version to further improve processing speed.
In the future, if new performance bottlenecks emerge, we can rely on the OpenResty XRay dynamic tracing platform for in-depth analysis. This platform can precisely locate performance hotspots in the system, providing data support for continuous optimization.
Technical insights
This optimization practice once again proves that sometimes simple and elegant solutions are more effective than complex architectures. Just over 200 lines of code solved the memory bottleneck for GB-level data processing, embodying the engineering philosophy that OpenResty has always advocated: concise, efficient, and precisely solving problems.
In addition to OpenResty XRay, OpenResty Inc. also provides comprehensive private library services covering technical needs across various industries. These private libraries have significant advantages in performance optimization, security protection, and data processing, helping enterprises quickly build high-performance, high-reliability application systems. Whether in finance, e-commerce, or media industries, OpenResty Inc.’s private libraries can provide tailored solutions to meet specific needs in different scenarios.
Our technical team will continue to be dedicated to developing and improving the XRay toolkit and private library services, helping developers and enterprises discover and solve various performance bottlenecks. We believe that through precise performance analysis and targeted optimization, many seemingly unimprovable performance issues can achieve breakthrough solutions.
What is OpenResty XRay
OpenResty XRay is a dynamic-tracing product that automatically analyzes your running applications to troubleshoot performance problems, behavioral issues, and security vulnerabilities with actionable suggestions. Under the hood, OpenResty XRay is powered by our Y language targeting various runtimes like Stap+, eBPF+, GDB, and ODB, depending on the contexts.
If you like this tutorial, please subscribe to this blog site and/or our YouTube channel. Thank you!
About The Author
Yichun Zhang (Github handle: agentzh), is the original creator of the OpenResty® open-source project and the CEO of OpenResty Inc..
Yichun is one of the earliest advocates and leaders of “open-source technology”. He worked at many internationally renowned tech companies, such as Cloudflare, Yahoo!. He is a pioneer of “edge computing”, “dynamic tracing” and “machine coding”, with over 22 years of programming and 16 years of open source experience. Yichun is well-known in the open-source space as the project leader of OpenResty®, adopted by more than 40 million global website domains.
OpenResty Inc., the enterprise software start-up founded by Yichun in 2017, has customers from some of the biggest companies in the world. Its flagship product, OpenResty XRay, is a non-invasive profiling and troubleshooting tool that significantly enhances and utilizes dynamic tracing technology. And its OpenResty Edge product is a powerful distributed traffic management and private CDN software product.
As an avid open-source contributor, Yichun has contributed more than a million lines of code to numerous open-source projects, including Linux kernel, Nginx, LuaJIT, GDB, SystemTap, LLVM, Perl, etc. He has also authored more than 60 open-source software libraries.