llama.cpp and LLaMA 2 are projects that make large language models (LLMs) more accessible and efficient for everyone. llama.cpp is a port of Meta’s LLaMA model in C/C++. LLaMA 2 is a family of generative text models that are fine-tuned for programming tasks and use grouped-query attention. However, these models use a lot of CPU resources.

Today I will show you another step-by-step guide of how to use OpenResty XRay to analyze the llama.cpp application with LLaMA2 models. We’ll quickly pinpoint the most CPU-intensive C++ code paths in this application. These code paths are the ones that consume the most CPU time and may affect llama.cpp’s performance.

Problem: high CPU usage

Go to the directory of llama.cpp.

Screenshot

Let’s compile the C++ project first. We pass this parameter -g to enable debugging symbols.

Screenshot

Run the make command.

Screenshot

The compilation process is successfully completed.

Screenshot

Run the main program of llama.cpp. Llama2 is the latest large model open-sourced by Microsoft and Meta. Here we use the quantized 7B model that can be run by llama.cpp.

Screenshot

Specify the number of tokens to generate.

Screenshot

Use “Linux” as the prompt to generate the content.

Screenshot

Run the command. You can see that the text content is constantly being generated below.

Screenshot

Now open another terminal and run the top command to check the CPU usage.

Screenshot

As shown, the main command of llama.cpp, is using almost 400% of CPU cores.

Screenshot

Use the guidede analysis feature of OpenResty XRay to spot the hottest C++ code paths

Let’s use OpenResty XRay to check out this unmodified process. Open the OpenResty XRay web console in the web browser.

Screenshot

Here we can analyze it in real-time and find out which parts use the most CPU time.

Screenshot

Make sure it is the right machine you are watching.

Screenshot

You can choose the right machine from the list below if the current one is not correct.

Screenshot

Go to the “Guided Analysis” page.

Screenshot

Here you can see different types of problems that you can diagnose.

Screenshot

Let’s select “High CPU usage”.

Screenshot

Click on “Next”.

Screenshot

Select the target you want to analyze “By Processes”.

Screenshot

Select the process of the main program of llama.cpp.

Screenshot

Make sure that the application type is right. Usually the default should be correct.

Screenshot

The language level is just C/C++.

Screenshot

We can set the maximum analyzing time. We’ll leave it as 300 seconds, which is the default value.

Screenshot

Let’s start analyzing.

Screenshot

The system will keep performing different rounds of analysis. Now it’s executing the first round.

Screenshot

The first round is done and it’s on to the second one already. That’s enough for this case.

Screenshot

Let’s stop analyzing now.

Screenshot

It shows that the system is generating a report for the current analysis.

Screenshot

We can see it automatically creates an analysis report.

Screenshot

This is the type of problem we are going to diagnose. It’s CPU.

Screenshot

This is the #1 hottest C++ code path for the CPU time.

Screenshot

The first function ggml_compute_forward_mul_mat is a general matrix multiplication function in the ggml library.

Screenshot

Its caller function ggml_graph_compute_thread is responsible for performing single-threaded computation.

Screenshot

Click “More” to see details about this code path.

Screenshot

The code path was automatically derived from this CPU flame graph.

Screenshot

Below are more detailed explanations and suggestions regarding the current issue.

Screenshot

It mentions the function ggml_compute_forward_mul_mat.

Screenshot

And it performs forward multiplication of matrices.

Screenshot

Let’s check the #2 hottest C++ code path.

Screenshot

The first function ggml_vec_dot_q4_K_q8_K is a function in the ggml library for calculating the dot product of two vectors.

Screenshot

Click “More” to see details about this code path.

Screenshot

It mentions the function ggml_vec_dot_q4_K_q8_K.

Screenshot

It also mentions that it computes the dot product of two vectors.

Screenshot

Let’s go back to the second code path. Hover the mouse over the green box for the first function.

Screenshot

We can see the source file of this function. And its full path for the k_quants.c file in the tooltip.

Screenshot

The source line number is 2608.

Screenshot

Click the icon to copy the full C source file path for this function.

Screenshot

Use the vim editor to open the source file. And look at the C code in this file. You can use any editors you like.

Screenshot

Go to line 2608 as shown in the report tooltip.

Screenshot

We can see that this line of code is using bitwise operations in C to perform some operations on the elements of an array.

Screenshot

We see in the status bar that this line is in the function ggml_vec_dot_q4_K_q8_K as shown in the report.

Screenshot

Automatic analysis and reports

OpenResty XRay can also monitor online processes automatically and show analysis reports.

Screenshot

Go to the “Insights” page.

Screenshot

You can find the reports in the “Insights” page for daily and weekly periods.

Screenshot

For this reason, you don’t have to use the “Guided Analysis” feature. Guided analysis is useful for application development and demonstration purposes.

Screenshot

What is OpenResty XRay

OpenResty XRay is a dynamic-tracing product that automatically analyzes your running applications to troubleshoot performance problems, behavioral issues, and security vulnerabilities with actionable suggestions. Under the hood, OpenResty XRay is powered by our Y language targeting various runtimes like Stap+, eBPF+, GDB, and ODB, depending on the contexts.

If you like this tutorial, please subscribe to this blog site and/or our YouTube channel. Thank you!

About The Author

Yichun Zhang (Github handle: agentzh), is the original creator of the OpenResty® open-source project and the CEO of OpenResty Inc..

Yichun is one of the earliest advocates and leaders of “open-source technology”. He worked at many internationally renowned tech companies, such as Cloudflare, Yahoo!. He is a pioneer of “edge computing”, “dynamic tracing” and “machine coding”, with over 22 years of programming and 16 years of open source experience. Yichun is well-known in the open-source space as the project leader of OpenResty®, adopted by more than 40 million global website domains.

OpenResty Inc., the enterprise software start-up founded by Yichun in 2017, has customers from some of the biggest companies in the world. Its flagship product, OpenResty XRay, is a non-invasive profiling and troubleshooting tool that significantly enhances and utilizes dynamic tracing technology. And its OpenResty Edge product is a powerful distributed traffic management and private CDN software product.

As an avid open-source contributor, Yichun has contributed more than a million lines of code to numerous open-source projects, including Linux kernel, Nginx, LuaJIT, GDB, SystemTap, LLVM, Perl, etc. He has also authored more than 60 open-source software libraries.