How CPU time is spent inside llama.cpp + LLaMA2 (using OpenResty XRay)
llama.cpp and LLaMA 2 are projects that make large language models (LLMs) more accessible and efficient for everyone. llama.cpp is a port of Meta’s LLaMA model in C/C++. LLaMA 2 is a family of generative text models that are fine-tuned for programming tasks and use grouped-query attention. However, these models use a lot of CPU resources.
Today I will show you another step-by-step guide of how to use OpenResty XRay to analyze the llama.cpp application with LLaMA2 models. We’ll quickly pinpoint the most CPU-intensive C++ code paths in this application. These code paths are the ones that consume the most CPU time and may affect llama.cpp’s performance.
Problem: high CPU usage
Go to the directory of llama.cpp.
Let’s compile the C++ project first. We pass this parameter -g to enable debugging symbols.
Run the make command.
The compilation process is successfully completed.
Run the main program of llama.cpp. Llama2 is the latest large model open-sourced by Microsoft and Meta. Here we use the quantized 7B model that can be run by llama.cpp.
Specify the number of tokens to generate.
Use “Linux” as the prompt to generate the content.
Run the command. You can see that the text content is constantly being generated below.
Now open another terminal and run the top command to check the CPU usage.
As shown, the main command of llama.cpp, is using almost 400% of CPU cores.
Use the guidede analysis feature of OpenResty XRay to spot the hottest C++ code paths
Let’s use OpenResty XRay to check out this unmodified process. Open the OpenResty XRay web console in the web browser.
Here we can analyze it in real-time and find out which parts use the most CPU time.
Make sure it is the right machine you are watching.
You can choose the right machine from the list below if the current one is not correct.
Go to the “Guided Analysis” page.
Here you can see different types of problems that you can diagnose.
Let’s select “High CPU usage”.
Click on “Next”.
Select the target you want to analyze “By Processes”.
Select the process of the main program of llama.cpp.
Make sure that the application type is right. Usually the default should be correct.
The language level is just C/C++.
We can set the maximum analyzing time. We’ll leave it as 300 seconds, which is the default value.
Let’s start analyzing.
The system will keep performing different rounds of analysis. Now it’s executing the first round.
The first round is done and it’s on to the second one already. That’s enough for this case.
Let’s stop analyzing now.
It shows that the system is generating a report for the current analysis.
We can see it automatically creates an analysis report.
This is the type of problem we are going to diagnose. It’s CPU.
This is the #1 hottest C++ code path for the CPU time.
The first function ggml_compute_forward_mul_mat is a general matrix multiplication function in the ggml library.
Its caller function ggml_graph_compute_thread is responsible for performing single-threaded computation.
Click “More” to see details about this code path.
The code path was automatically derived from this CPU flame graph.
Below are more detailed explanations and suggestions regarding the current issue.
It mentions the function ggml_compute_forward_mul_mat.
And it performs forward multiplication of matrices.
Let’s check the #2 hottest C++ code path.
The first function ggml_vec_dot_q4_K_q8_K is a function in the ggml library for calculating the dot product of two vectors.
Click “More” to see details about this code path.
It mentions the function ggml_vec_dot_q4_K_q8_K.
It also mentions that it computes the dot product of two vectors.
Let’s go back to the second code path. Hover the mouse over the green box for the first function.
We can see the source file of this function. And its full path for the k_quants.c file in the tooltip.
The source line number is 2608.
Click the icon to copy the full C source file path for this function.
Use the vim editor to open the source file. And look at the C code in this file. You can use any editors you like.
Go to line 2608 as shown in the report tooltip.
We can see that this line of code is using bitwise operations in C to perform some operations on the elements of an array.
We see in the status bar that this line is in the function ggml_vec_dot_q4_K_q8_K as shown in the report.
Automatic analysis and reports
OpenResty XRay can also monitor online processes automatically and show analysis reports.
Go to the “Insights” page.
You can find the reports in the “Insights” page for daily and weekly periods.
For this reason, you don’t have to use the “Guided Analysis” feature. Guided analysis is useful for application development and demonstration purposes.
What is OpenResty XRay
OpenResty XRay is a dynamic-tracing product that automatically analyzes your running applications to troubleshoot performance problems, behavioral issues, and security vulnerabilities with actionable suggestions. Under the hood, OpenResty XRay is powered by our Y language targeting various runtimes like Stap+, eBPF+, GDB, and ODB, depending on the contexts.
If you like this tutorial, please subscribe to this blog site and/or our YouTube channel. Thank you!
About The Author
Yichun Zhang (Github handle: agentzh), is the original creator of the OpenResty® open-source project and the CEO of OpenResty Inc..
Yichun is one of the earliest advocates and leaders of “open-source technology”. He worked at many internationally renowned tech companies, such as Cloudflare, Yahoo!. He is a pioneer of “edge computing”, “dynamic tracing” and “machine coding”, with over 22 years of programming and 16 years of open source experience. Yichun is well-known in the open-source space as the project leader of OpenResty®, adopted by more than 40 million global website domains.
OpenResty Inc., the enterprise software start-up founded by Yichun in 2017, has customers from some of the biggest companies in the world. Its flagship product, OpenResty XRay, is a non-invasive profiling and troubleshooting tool that significantly enhances and utilizes dynamic tracing technology. And its OpenResty Edge product is a powerful distributed traffic management and private CDN software product.
As an avid open-source contributor, Yichun has contributed more than a million lines of code to numerous open-source projects, including Linux kernel, Nginx, LuaJIT, GDB, SystemTap, LLVM, Perl, etc. He has also authored more than 60 open-source software libraries.









































































