OpenResty® uses LuaJIT for its main computing engine, and
mainly use the Lua programming language to write applications atop OpenResty,
sometimes very complex ones. For years, there has been a notorious hard
limit in the maximum memory size managed by LuaJIT’s garbage collector
(GC). This limit was 2 GB on 64-bit systems (including
since 2016, the official LuaJIT has introduced a new build mode
called “GC64”, which raises this limit to 128 TB (or the low 47-bit address
space). Effectively, this translates to no limit for almost all the PCs
and servers in the
market nowadays. Over the two years, the GC64 mode has matured enough and
starting from the OpenResty 18.104.22.168 release,
we enable this
by default on the
x86_64 architectures, just like the ARM64 (or AArch64)
architecture. This article will provide an overview of the old memory limit
as well as an explanation of the new GC64 mode.
By default, the official LuaJIT uses the so-called “X64” build mode on
x86_64 systems. This “X64” build mode is also used by default in OpenResty
releases before 22.214.171.124 on
x86_64. With this mode, LuaJIT can only use
memory address values in the low 31 bit address space for the memory
managed by the garbage collector (GC).This effectively limits such memory
to the total value of 231 bytes, or 2 GB.
What is it like when hitting the 2 GB limit? It is easy to use a simple Lua script to find out.
-- File grow.lua
This script has an infinite
while loop which simply keeps allocating
new Lua strings and inserting them into the Lua table (in order to prevent
the GC from collecting them). Each loop iteration creates a new Lua string
of approximately 1 MB and outputs the total size of GC-managed memory via
Lua API function collectgabarge.
One thing to note here is that the Lua table associated with the top-level
tb will also keep growing, thus taking more and more memory
itself, albeit at a much slower pace than the memory occupied by newly
To run this Lua script, we can simply invoke the resty command-line utility shipped with OpenResty, like below:
In this run, we used the X64 build of OpenResty. Apparently, the
quits after the GC-managed memory size approaches 2 GB. The process
luajit command-line utility can give us more details about
Thus this is the proof that we are indeed hitting the memory limit.
OpenResty inherits NGINX’s multiple process model to utilize multiple CPU
cores in a single operating system. So each NGINX worker process has its
own address space. Thus, the 2 GB memory limit on
x86_64 when using the
default X64 build of LuaJIT only applies to each individual NGINX worker
process. In the case of 12 workers in a single OpenResty or NGINX server
the total memory limit across all the worker processes would be
2 * 12 = 24 GB. This is why the 2 GB memory limit does not cause too many troubles
over the years for large OpenResty applications running on powerful machines.
Most OpenResty users are not even aware of this limitation.
The memory limit is not per LuaJIT virtual machine (VM) instance. For example, ngx_stream_lua_module and ngx_http_lua_module both create their own LuaJIT VM instances, even when sharing the same NGINX server instance. But the 2G memory limit applies to the whole process, no matter how many LuaJIT VM instances are created inside it. This is because the memory limit also has restrictions on the address space. The memory addresses have to be in the low 31 bit space.
Most of the standard Lua-land value objects (e.g., strings, tables, functions, userdata, cdata, threads, traces, upvalues, and protos) are managed by the GC. Upvalues and protos are associated with functions. These composite objects are also called “GC objects”.
Primitive values like numbers, booleans, and light userdata values are not managed by the GC. They are simply encoded as literal values, which are called “TValue” (or tagged values) in the LuaJIT internals. TValues are always 64-bit wide in LuaJIT, including double-precision floating-point numbers (LuaJIT uses the “NaN tagging” trick to achieve such efficiency). This is also one of the reasons that a Lua application usually uses significantly less memory with LuaJIT than with the standard Lua 5.1 interpreter.
LuaJIT’s cdata data type is a bit special. If the memory associated by
a cdata object is allocated by the standard LuaJIT Lua API function
then it is still managed by the GC. On the other hand, if the memory is
allocated by C-land routines like
mmap(), or other external
C library functions, then such memory blocks are not managed by the GC,
and are not subject to the memory limit. For instance, consider the following
simple Lua script:
-- File big-malloc.lua
Here, we call the standard C library function
malloc() to allocate a
GB memory block via the standard LuaJIT
ffi library. Running this script
a X64 build of OpenResty or LuaJIT does not produce any crashes:
The GC-managed memory size is merely 73 KB, excluding the 5 GB memory block we allocated using the system allocator.
However, non-GC-managed memory may still affect the memory limit of LuaJIT adversely. Why? Because it matters a lot whether the externally allocated memory blocks fall inside or outside the low 31 bit address space.
When using the
mmap()system call on Linux
x86_64 systems without specifying
address hints (nor any other flags set which can affect
the memory block locations), it is seldom to have memory blocks returned
in the low 31 bit address space. Whereas, the memory blocks allocated by
sbrk() calls or alike on Linux
x86_64 will almost always return
an address in the low address space, thus squeezing available memory space
for new LuaJIT GC-initiated allocations. This is due to how the "heap"
according to the Linux memory layout: the “program break” moves from low
to high addresses, this how the “data segment” grows on Linux
and also quite some other operating systems. Similarly, huge static
values in the data segment (such as constant C string values),
will also squeeze the available low address space since the data segment
is near the beginning of the low address space on Linux and etc.
For all the reasons mentioned above, the actual memory limit could be way
smaller than 2 GB on
x86_64, depending on how much and where the rest
of the application is allocating memory. We’ve seen reports from community
users that on FreeBSD, the shared memory zones allocated by NGINX (which
were essentially allocated by the libc allocator) squeezed the available
memory space for LuaJIT. There are also reports that using memory-intensive
NGINX modules like ngx_http_slice_module
more easily triggers the memory limit panic.
The theoretical memory limit for the existing X64 mode of LuaJIT is actually
4 GB (or 32 bit) instead of 2 GB. The same LuaJIT VM can utilize the full
low 4 GB address space on i386 systems anyway. The practical limit, however,
is lowered to 2 GB, because
the hand-crafted assembly code in the LuaJIT VM has yet to take into
account the sign extension semantics (from 32-bit pointer values to 64-bit
x86_64 CPUs (it is not a problem on i386 since the word size
is 32-bit anyway).
While 4 GB is already 2 times better than 2 GB, it still suffers limitations and all the pitfalls mentioned above. The LuaJIT developers decided that it would be much more beneficial to introduce a new VM which supports way bigger address spaces, hence the GC64 build mode. This is also the only choice on certain CPU architectures like ARM64 where the low address space cannot be (easily) preserved.
The development work of the new GC64 build mode of LuaJIT started in 2016.
It was pioneered by Peter Cawley and consolidated by Mike Pall.
Over the past 2 years or so, a lot of bugs have been fixed in the GC64
and our recent extensive testings show that it is already mature enough
for our own production use. It is thus natural to move onto the new GC64
mode for OpenResty on
x86_64 architectures (it is already
mandatory on ARM64).
The primitive Lua value representation (called “TValue” as mentioned above)
is still 64 bit in the GC64 mode, just like the old X64 mode. So we
wouldn’t expect a noticeable increase in memory usage when switching over
new mode. However, there are still some data value types getting larger
(from 32 bit to 64 bit), like the C data type
used inside the LuaJIT internals. Therefore, the memory footprint may
get a bit larger for the same Lua application, though not much.
In the GC64 mode, the GC-managed memory addresses can now extend to the low 47 bit space, which is 128 TB, way more than the total physical memory available on most (if not all) of the high end machines nowadays (mainstream consumer PC motherboard still maxes out at 64 GB as of today, and the largest "high memory" AWS EC2 instance only gets 12 TB of RAM). It is therefore safe to say that there is realistically no GC-managed memory limit in the real world with the GC64 mode.
To enable the GC64 mode in LuaJIT, one should build LuaJIT from source like this:
When building OpenResty from source before the 126.96.36.199 release (inclusive),
we can add the following option to the
./configure script of OpenResty:
Later OpenResty releases will include this option on
x86_64 systems by
default, including the binary pre-built packages
To see how large the impact the new GC64 mode will make is in the wild, let’s do some simple experiments with some of our large Lua programs.
Let’s try our Edge language (or "edgelang") compiler to compile some large input for a web application firewall (WAF) module. For the X64 mode:
PATH=/opt/openresty-x64/bin:$PATH /bin/time ./bin/edgelang waf.edge >
To compile the
waf.edge input file into Lua code with the
compiler, it takes 0.73 seconds of userland CPU time and the maximum resident
memory size used by this run is 119660 KB, or about 116.9 MB. Now let’s
try the GC64 mode of LuaJIT with the same command:
PATH=/opt/openresty-plus-gc64/bin:$PATH /bin/time ./bin/edgelang waf.edge
This time the max resident memory size is 133748 KB, or about 130.6 MB. Only 11.1% larger. The CPU time is almost the same; the difference is within the measurement error range.
The Edge language compiler is in pure Lua targeting the OpenResty platform. It is a large program of 83,315 lines of code, including comments and empty lines. The corresponding LuaJIT byte code file is 1.8 MB for both GC64 and X64 modes of LuaJIT, though the bytecode is incompatible between these 2 different build modes.
We then try the Y language (or “ylang”) compiler which is also a big Lua command-line program targeting the OpenResty platform.
The ylang compiler is even bigger than the edgelang compiler we discussed
earlier. The LuaJIT byte code is 2.1 MB in size (also for both the X64
and GC64 build modes of LuaJIT). For the X64 build mode, we experiment
ljftrace.y tool into the systemtap
PATH=/opt/openresty-x64/bin:$PATH /bin/time ./bin/ylang --stap --symtab
It takes 1.30 seconds of userland CPU time and 401184 KB of maximum resident memory. Now for the GC64 mode:
PATH=/opt/openresty-gc64/bin:$PATH /bin/time ./bin/ylang --stap --symtab
This time, it still takes 1.30 seconds of user CPU time and 433948 maximum resident memory. The CPU time difference is zero. And the memory footprint is merely 8.2% larger.
The open source dynamic tracing and debugging tools in the openresty-systemtap-toolkit, stap++, and openresty-gdb-toolkit have little to none support for the new GC64 mode. We are currently relying on community support to update these tools for the GC64 mode (though we still want to preserve the X64 mode support).
Our focus has been on the proprietary ylang compiler which can compile tools written in a superset of the standard C language (called ylang) down to both gdb tools in Python and systemtap tools in systemtap’s scripting language (more backends are also coming). With ylang, we get almost immediate GC64 support once we write the various dynamic tracing tools, thanks to the intelligent debuginfo and C source code level support in ylang.
Below is a Lua-land CPU flame graph we obtained using our ylang tools via systemtap, with a GC64 build of OpenResty:
And the gdb scripts generated from these ylang tools are also usable in gdb when analyzing core dump files, as in
(gdb) lbt 0 full
The function frames in this flame graph are both for Lua function frames and C function frames.
We provide the ylang compiler as well as various standard tracing and profiling tools as part of the OpenResty Trace platform.
Starting with version 2.1, the official LuaJIT comes with a built-in profiler implemented inside the virtual machine (VM). This will certainly keep working in the GC64 mode. This profiler should not be used in online profiling, however, because unlike profiling based on system-level tools like systemtap, it has to wipe out all the existing compiled Lua code (or “traces” in LuaJIT’s terminology) and re-compile everything from scratch in a profiling-specific way. This must happen both when the profiler is turned on and off. Doing this definitely changes a lot of the state in the target process (vulnerable to unexpected side effects and corner-case bugs), and adds quite some extra overhead during the sampling window. Besides, the target Lua application has to provide special APIs or hooks to trigger such sampling actions. Application-side collaborations are always required for this built-in profiler to work. On the other hand, profiling based on dynamic tracing tools do not need any collaborations from the Lua applications’ side, not even special build options.