Skip to content

llama.cpp b8575 (LLM inference engine)⚓︎

Homepage github.com/ggml-org/llama.cpp
Changelog github.com/ggml-org/llama.cpp/releases
Repository github.com/ggml-org/llama.cpp
Package make/pkgs/llama-cpp/
Steward Ircama

Overview⚓︎

llama.cpp is a high-performance, pure-C/C++ LLM inference engine that runs
quantized GGUF language models entirely on the CPU — no GPU, CUDA, or driver
stack required.

The Freetz-EVO package cross-compiles llama.cpp to MIPS32 for the FRITZ!Box.
It uses the CMake build system with shared libraries (libllama.so, libggml*.so)
so that multiple tools share a single copy of the inference code, keeping total
storage usage manageable on a 100 MB device.

Practical device constraints⚓︎

Resource Value Impact
Storage ~100 MB All binaries + libs must fit; models go to USB/NAS
RAM 512 MB Practical model limit: Q4_K_M ≤ ~400 MB
CPU cores 4 × MIPS34Kc Use --threads 4 for maximum throughput
GPU none --n-gpu-layers 0 (always)
Model GGUF quantisation Size RAM usage
SmolLM2-360M Q4_K_M ~220 MB ~280 MB
Qwen2.5-0.5B-Instruct Q4_K_M ~350 MB ~420 MB
SmolLM2-1.7B Q4_K_M ~1.1 GB too large

Inference speed on MIPS34Kc (single-threaded) is ~1–5 tokens/s depending on model
size and quantisation. Enable threading (-t 4) for better throughput.


Installation⚓︎

Enable the package in make menuconfig under:

Packages → L → llama.cpp

Optional tools and server features can be selected within the llama.cpp submenu:

[*] llama.cpp b8575 (LLM inference engine)
    --- Optional tools
    [ ] llama-bench          — performance benchmark
    [ ] llama-perplexity     — model quality measurement
    [ ] llama-tokenize       — tokenizer diagnostic
    [ ] llama-imatrix        — importance matrix for smart quantisation
    [ ] llama-gguf-split     — split/merge multi-shard GGUF files
    [ ] llama-batched-bench  — batched inference benchmark
    [ ] llama-tts            — text-to-speech (OuteTTS)
    [ ] llama-mtmd-cli       — multimodal (vision) inference
    --- llama-server options
    [ ] Enable HTTPS (OpenSSL) support in llama-server
    [ ] Embed built-in web UI in llama-server (~2 MB extra)

What gets installed⚓︎

Item Location
llama-cli /usr/bin/llama-cli
llama-server /usr/bin/llama-server
llama-quantize /usr/bin/llama-quantize
Optional tools /usr/bin/llama-*
libllama.so /usr/lib/libllama.so
libggml.so, libggml-cpu.so, … /usr/lib/libggml*.so*
Init script /etc/init.d/rc.llama-cpp
Default config /mod/etc/default.llama-cpp/llama-cpp.cfg

All binaries and shared libraries are strongly recommended for externalisation
(total size can be 40–80 MB stripped). The externalization submenu is pre-enabled
with default y for both binaries and libraries.


Quick start⚓︎

1. Place a GGUF model on USB storage⚓︎

# On the FRITZ!Box, after USB drive mounts
mkdir -p /var/media/ftp/llama-cpp
# Copy a model file there (e.g. via SCP)
scp qwen2.5-0.5b-instruct-q4_k_m.gguf root@fritz.box:/var/media/ftp/llama-cpp/

2. Configure via the Freetz web UI⚓︎

Go to http://fritz.box:81/ → Packages → llama.cpp and set:

Field Example
Base directory /var/media/ftp/llama-cpp
Model path /var/media/ftp/llama-cpp/qwen2.5-0.5b-instruct-q4_k_m.gguf
Host 0.0.0.0
Port 8080
Threads 4
Context size 512 ← reduce if RAM is tight
Enabled yes

3. Start the server⚓︎

/etc/init.d/rc.llama-cpp start
# or restart the web UI page and let it start at next boot

4. Run inference⚓︎

# Interactive CLI (on the device itself)
llama-cli -m /var/media/ftp/llama-cpp/model.gguf \
          -p "Explain quantum computing in one paragraph" \
          -n 200 -t 4 --no-display-prompt

# REST API from a client on the same LAN
curl http://fritz.box:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'

# Check server health
curl http://fritz.box:8080/health

Runtime configuration⚓︎

The init script and Freetz web UI expose these settings:

Variable Default Description
LLAMA_CPP_ENABLED no Auto-start at boot
LLAMA_CPP_BASEDIR /var/media/ftp/llama-cpp Base directory for models
LLAMA_CPP_MODEL (empty) Path to GGUF model (optional at startup)
LLAMA_CPP_HOST 0.0.0.0 Bind address
LLAMA_CPP_PORT 8080 HTTP port
LLAMA_CPP_THREADS 4 Number of inference threads
LLAMA_CPP_CTX_SIZE 2048 Context size in tokens
LLAMA_CPP_NGL 0 GPU layers (must be 0, no GPU)
LLAMA_CPP_PARALLEL 1 Simultaneous client slots
LLAMA_CPP_EXTRA_ARGS (empty) Extra llama-server flags
LLAMA_CPP_CONFIG_WAIT 120 Boot wait time (0 = sync start)
LLAMA_CPP_NICE 5 Nice priority

All variables are persisted in /mod/etc/conf/llama-cpp.cfg.


Using llama-quantize⚓︎

llama-quantize converts full-precision or half-precision GGUF files to smaller
quantised formats:

# Q4_K_M is the best quality/size tradeoff for most models
llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M

# Q4_K_M with importance matrix for higher quality
llama-imatrix -m model-f16.gguf -f calibration_data.txt -o imatrix.dat
llama-quantize --imatrix imatrix.dat model-f16.gguf model-q4_k_m.gguf Q4_K_M

Quantisation is CPU-intensive and takes significant time on MIPS. It is usually
more practical to quantise on a host PC and copy the resulting GGUF to the device.


Build details⚓︎

CMake cross-compilation⚓︎

The package uses $(PKG_CONFIGURED_CMAKE) with the following key flags:

CMake flag Value Reason
CMAKE_SYSTEM_NAME Linux Explicit cross-compile target
CMAKE_SYSTEM_PROCESSOR mips Target architecture
CMAKE_C_COMPILER $(TARGET_CC) MIPS cross-compiler
CMAKE_CXX_COMPILER $(TARGET_CXX) MIPS C++ cross-compiler
GGML_NATIVE OFF Critical: prevents -march=native from using host CPU
BUILD_SHARED_LIBS ON Shared libs save storage vs. static-linked copies per tool
LLAMA_BUILD_EXAMPLES OFF Reduces build time; tools cover all needed functionality
LLAMA_BUILD_TESTS OFF No test framework on device
All GPU backends OFF FRITZ!Box has no GPU

No submodule issue⚓︎

At tag b8575, llama.cpp has an empty .gitmodules — all ggml source code
is inlined directly in the ggml/ directory of the main repository. The GitHub
archive tarball at the commit hash therefore contains everything needed; no
secondary downloads are required.

Shared libraries and storage⚓︎

The CMake build produces:
- libggml.so, libggml-cpu.so, and related ggml libraries
- libllama.so (llama API layer)

Each tool binary links against these shared libraries, so the per-binary overhead
is small (~1–5 MB stripped each for most tools). Total storage including all shared
libraries is typically 40–80 MB stripped. Externalisation to the external filesystem
(SquashFS or ext2 partition) is strongly recommended.

Model storage⚓︎

Models are NOT part of the firmware image. They must be on writable external
storage (USB hard drive or USB flash drive, preferably formatted as ext4).
The LLAMA_CPP_BASEDIR configuration variable points to the model directory.


Relevant files⚓︎

File Purpose
make/pkgs/llama-cpp/llama-cpp.mk Package makefile: download, CMake configure, build, install
make/pkgs/llama-cpp/Config.in Kconfig: main package + optional tools + server options
make/pkgs/llama-cpp/external.in Externalisation config (binaries and shared libraries)
make/pkgs/llama-cpp/external.files List of files to externalise
make/pkgs/llama-cpp/files/root/etc/init.d/rc.llama-cpp Init script: start/stop/status for llama-server
make/pkgs/llama-cpp/files/root/mod/etc/default.llama-cpp/llama-cpp.cfg Default configuration (all exported variables)
make/pkgs/llama-cpp/files/root/mod/etc/default.llama-cpp/llama-cpp.save Save hooks (pre/apply) for the modconf web UI