OpenResty® uses LuaJIT for its main computing engine, and users mainly use the Lua programming language to write applications atop OpenResty, sometimes very complex ones. For years, there has been a notorious hard limit in the maximum memory size managed by LuaJIT’s garbage collector (GC). This limit was 2 GB on 64-bit systems (including x86_64). Fortunately, since 2016, the official LuaJIT[1] has introduced a new build mode called “GC64”, which raises this limit to 128 TB (or the low 47-bit address space). Effectively, this translates to no limit for almost all the PCs and servers in the market nowadays. Over the two years, the GC64 mode has matured enough and starting from the OpenResty 1.15.8.1 release, we enable this by default on the x86_64 architectures, just like the ARM64 (or AArch64) architecture. This article will provide an overview of the old memory limit as well as an explanation of the new GC64 mode.

# The old memory limit

By default, the official LuaJIT uses the so-called “X64” build mode on x86_64 systems. This “X64” build mode is also used by default in OpenResty releases before 1.13.6.2 on x86_64. With this mode, LuaJIT can only use memory address values in the low 31 bit address space for the memory managed by the garbage collector (GC).This effectively limits such memory to the total value of 231 bytes, or 2 GB.

## When hitting the memory limit

What is it like when hitting the 2 GB limit? It is easy to use a simple Lua script to find out.

This script has an infinite while loop which simply keeps allocating new Lua strings and inserting them into the Lua table (in order to prevent the GC from collecting them). Each loop iteration creates a new Lua string of approximately 1 MB and outputs the total size of GC-managed memory via the standard Lua API function collectgabarge. One thing to note here is that the Lua table associated with the top-level Lua variable tb will also keep growing, thus taking more and more memory itself, albeit at a much slower pace than the memory occupied by newly allocated Lua strings.

To run this Lua script, we can simply invoke the resty command-line utility shipped with OpenResty, like below:

In this run, we used the X64 build of OpenResty. Apparently, the resty utility quits after the GC-managed memory size approaches 2 GB. The process actually crashed:

Using the luajit command-line utility can give us more details about the crash:

Thus this is the proof that we are indeed hitting the memory limit.

## The memory limit is per process

OpenResty inherits NGINX’s multiple process model to utilize multiple CPU cores in a single operating system. So each NGINX worker process has its own address space. Thus, the 2 GB memory limit on x86_64 when using the default X64 build of LuaJIT only applies to each individual NGINX worker process. In the case of 12 workers in a single OpenResty or NGINX server instance, the total memory limit across all the worker processes would be 2 * 12 = 24 GB. This is why the 2 GB memory limit does not cause too many troubles over the years for large OpenResty applications running on powerful machines. Most OpenResty users are not even aware of this limitation.

The memory limit is not per LuaJIT virtual machine (VM) instance. For example, ngx_stream_lua_module and ngx_http_lua_module both create their own LuaJIT VM instances, even when sharing the same NGINX server instance. But the 2G memory limit applies to the whole process, no matter how many LuaJIT VM instances are created inside it. This is because the memory limit also has restrictions on the address space. The memory addresses have to be in the low 31 bit space.

## GC-managed memory

Most of the standard Lua-land value objects (e.g., strings, tables, functions, userdata, cdata, threads, traces, upvalues, and protos) are managed by the GC. Upvalues and protos are associated with functions. These composite objects are also called “GC objects”.

Primitive values like numbers, booleans, and light userdata values are not managed by the GC. They are simply encoded as literal values, which are called “TValue” (or tagged values) in the LuaJIT internals. TValues are always 64-bit wide in LuaJIT, including double-precision floating-point numbers (LuaJIT uses the “NaN tagging” trick to achieve such efficiency). This is also one of the reasons that a Lua application usually uses significantly less memory with LuaJIT than with the standard Lua 5.1 interpreter.

## Memory allocated outside GC

LuaJIT’s cdata data type is a bit special. If the memory associated by a cdata object is allocated by the standard LuaJIT Lua API function ffi.new(), then it is still managed by the GC. On the other hand, if the memory is allocated by C-land routines like malloc() and mmap(), or other external C library functions, then such memory blocks are not managed by the GC, and are not subject to the memory limit. For instance, consider the following simple Lua script:

Here, we call the standard C library function malloc() to allocate a 5 GB memory block via the standard LuaJIT ffi library. Running this script with a X64 build of OpenResty or LuaJIT does not produce any crashes:

The GC-managed memory size is merely 73 KB, excluding the 5 GB memory block we allocated using the system allocator.

However, non-GC-managed memory may still affect the memory limit of LuaJIT adversely. Why? Because it matters a lot whether the externally allocated memory blocks fall inside or outside the low 31 bit address space.

When using the mmap()system call on Linux x86_64 systems without specifying address hints (nor any other flags set which can affect the memory block locations), it is seldom to have memory blocks returned in the low 31 bit address space. Whereas, the memory blocks allocated by thesbrk() calls or alike on Linux x86_64 will almost always return an address in the low address space, thus squeezing available memory space for new LuaJIT GC-initiated allocations. This is due to how the "heap" grows according to the Linux memory layout: the “program break” moves from low to high addresses, this how the “data segment” grows on Linux and also quite some other operating systems. Similarly, huge static values in the data segment (such as constant C string values), will also squeeze the available low address space since the data segment is near the beginning of the low address space on Linux and etc.

For all the reasons mentioned above, the actual memory limit could be way smaller than 2 GB on x86_64, depending on how much and where the rest of the application is allocating memory. We’ve seen reports from community users that on FreeBSD, the shared memory zones allocated by NGINX (which were essentially allocated by the libc allocator) squeezed the available memory space for LuaJIT. There are also reports that using memory-intensive NGINX modules like ngx_http_slice_module more easily triggers the memory limit panic.

## Extending the X64 mode to the 4 GB limit

The theoretical memory limit for the existing X64 mode of LuaJIT is actually 4 GB (or 32 bit) instead of 2 GB. The same LuaJIT VM can utilize the full low 4 GB address space on i386 systems anyway. The practical limit, however, is lowered to 2 GB, because the hand-crafted assembly code in the LuaJIT VM has yet to take into account the sign extension semantics (from 32-bit pointer values to 64-bit ones) on x86_64 CPUs (it is not a problem on i386 since the word size is 32-bit anyway).

While 4 GB is already 2 times better than 2 GB, it still suffers limitations and all the pitfalls mentioned above. The LuaJIT developers decided that it would be much more beneficial to introduce a new VM which supports way bigger address spaces, hence the GC64 build mode. This is also the only choice on certain CPU architectures like ARM64 where the low address space cannot be (easily) preserved.

# The new GC64 mode

The development work of the new GC64 build mode of LuaJIT started in 2016. It was pioneered by Peter Cawley and consolidated by Mike Pall. Over the past 2 years or so, a lot of bugs have been fixed in the GC64 mode and our recent extensive testings show that it is already mature enough for our own production use. It is thus natural to move onto the new GC64 mode for OpenResty on x86_64 architectures (it is already mandatory on ARM64).

The primitive Lua value representation (called “TValue” as mentioned above) is still 64 bit in the GC64 mode, just like the old X64 mode. So we wouldn’t expect a noticeable increase in memory usage when switching over to the new mode. However, there are still some data value types getting larger (from 32 bit to 64 bit), like the C data type MRef and GCRef, commonly used inside the LuaJIT internals. Therefore, the memory footprint may get a bit larger for the same Lua application, though not much.

In the GC64 mode, the GC-managed memory addresses can now extend to the low 47 bit space, which is 128 TB, way more than the total physical memory available on most (if not all) of the high end machines nowadays (mainstream consumer PC motherboard still maxes out at 64 GB as of today, and the largest "high memory" AWS EC2 instance only gets 12 TB of RAM). It is therefore safe to say that there is realistically no GC-managed memory limit in the real world with the GC64 mode.

## How to enable the GC64 mode

To enable the GC64 mode in LuaJIT, one should build LuaJIT from source like this:

When building OpenResty from source before the 1.13.6.2 release (inclusive), we can add the following option to the ./configure script of OpenResty:

Later OpenResty releases will include this option on x86_64 systems by default, including the binary pre-built packages for OpenResty.

## Performance Impact

To see how large the impact the new GC64 mode will make is in the wild, let’s do some simple experiments with some of our large Lua programs.

### The Edge language compiler

Let’s try our Edge language (or "edgelang") compiler to compile some large input for a web application firewall (WAF) module. For the X64 mode:

To compile the waf.edge input file into Lua code with the edgelang compiler, it takes 0.73 seconds of userland CPU time and the maximum resident memory size used by this run is 119660 KB, or about 116.9 MB. Now let’s try the GC64 mode of LuaJIT with the same command:

This time the max resident memory size is 133748 KB, or about 130.6 MB. Only 11.1% larger. The CPU time is almost the same; the difference is within the measurement error range.

The Edge language compiler is in pure Lua targeting the OpenResty platform. It is a large program of 83,315 lines of code, including comments and empty lines. The corresponding LuaJIT byte code file is 1.8 MB for both GC64 and X64 modes of LuaJIT, though the bytecode is incompatible between these 2 different build modes.

### The Y language compiler

We then try the Y language (or “ylang”) compiler which is also a big Lua command-line program targeting the OpenResty platform.

The ylang compiler is even bigger than the edgelang compiler we discussed earlier. The LuaJIT byte code is 2.1 MB in size (also for both the X64 and GC64 build modes of LuaJIT). For the X64 build mode, we experiment compiling the ljftrace.y tool into the systemtap script:

It takes 1.30 seconds of userland CPU time and 401184 KB of maximum resident memory. Now for the GC64 mode:

This time, it still takes 1.30 seconds of user CPU time and 433948 maximum resident memory. The CPU time difference is zero. And the memory footprint is merely 8.2% larger.

## Debugging and profiling tool chains

The open source dynamic tracing and debugging tools in the openresty-systemtap-toolkit, stap++, and openresty-gdb-toolkit have little to none support for the new GC64 mode. We are currently relying on community support to update these tools for the GC64 mode (though we still want to preserve the X64 mode support).

Our focus has been on the proprietary ylang compiler which can compile tools written in a superset of the standard C language (called ylang) down to both gdb tools in Python and systemtap tools in systemtap’s scripting language (more backends are also coming). With ylang, we get almost immediate GC64 support once we write the various dynamic tracing tools, thanks to the intelligent debuginfo and C source code level support in ylang.

Below is a Lua-land CPU flame graph we obtained using our ylang tools via systemtap, with a GC64 build of OpenResty:

And the gdb scripts generated from these ylang tools are also usable in gdb when analyzing core dump files, as in

The function frames in this flame graph are both for Lua function frames and C function frames.

We provide the ylang compiler as well as various standard tracing and profiling tools as part of the OpenResty Trace platform.

### LuaJIT’s built-in profiler

Starting with version 2.1, the official LuaJIT comes with a built-in profiler implemented inside the virtual machine (VM). This will certainly keep working in the GC64 mode. This profiler should not be used in online profiling, however, because unlike profiling based on system-level tools like systemtap, it has to wipe out all the existing compiled Lua code (or “traces” in LuaJIT’s terminology) and re-compile everything from scratch in a profiling-specific way. This must happen both when the profiler is turned on and off. Doing this definitely changes a lot of the state in the target process (vulnerable to unexpected side effects and corner-case bugs), and adds quite some extra overhead during the sampling window. Besides, the target Lua application has to provide special APIs or hooks to trigger such sampling actions. Application-side collaborations are always required for this built-in profiler to work. On the other hand, profiling based on dynamic tracing tools do not need any collaborations from the Lua applications’ side, not even special build options.

1. OpenResty bundles its own branch of LuaJIT with more advanced features and special optimizations for OpenResty. This branch still periodically synchronize latest changes from the official LuaJIT.