llama.cpp b8575 (LLM inference engine)¶

Homepage: https://github.com/ggml-org/llama.cpp
Changelog: https://github.com/ggml-org/llama.cpp/releases
Repository: https://github.com/ggml-org/llama.cpp
Package: master/make/pkgs/llama-cpp/
Steward: @Ircama


Homepage	github.com/ggml-org/llama.cpp
Changelog	github.com/ggml-org/llama.cpp/releases
Repository	github.com/ggml-org/llama.cpp
Package	`make/pkgs/llama-cpp/`
Maintainer	@Ircama

Overview¶

llama.cpp is a high-performance, pure-C/C++ LLM inference engine that runs
quantized GGUF language models entirely on the CPU — no GPU, CUDA, or driver
stack required.

The Freetz-EVO package cross-compiles llama.cpp to MIPS32 for the FRITZ!Box.
It uses the CMake build system with shared libraries (libllama.so, libggml*.so)
so that multiple tools share a single copy of the inference code, keeping total
storage usage manageable on a 100 MB device.

Practical device constraints¶

Resource	Value	Impact
Storage	~100 MB	All binaries + libs must fit; models go to USB/NAS
RAM	512 MB	Practical model limit: Q4_K_M ≤ ~400 MB
CPU cores	4 × MIPS34Kc	Use `--threads 4` for maximum throughput
GPU	none	`--n-gpu-layers 0` (always)

Recommended models for this device¶

Model	GGUF quantisation	Size	RAM usage
SmolLM2-360M	Q4_K_M	~220 MB	~280 MB
Qwen2.5-0.5B-Instruct	Q4_K_M	~350 MB	~420 MB
SmolLM2-1.7B	Q4_K_M	~1.1 GB	too large

Inference speed on MIPS34Kc (single-threaded) is ~1–5 tokens/s depending on model
size and quantisation. Enable threading (-t 4) for better throughput.

Installation¶

Enable the package in make menuconfig under:

Text Only

Packages → L → llama.cpp

Optional tools and server features can be selected within the llama.cpp submenu:

Text Only

[*] llama.cpp b8575 (LLM inference engine)
    --- Optional tools
    [ ] llama-bench          — performance benchmark
    [ ] llama-perplexity     — model quality measurement
    [ ] llama-tokenize       — tokenizer diagnostic
    [ ] llama-imatrix        — importance matrix for smart quantisation
    [ ] llama-gguf-split     — split/merge multi-shard GGUF files
    [ ] llama-batched-bench  — batched inference benchmark
    [ ] llama-tts            — text-to-speech (OuteTTS)
    [ ] llama-mtmd-cli       — multimodal (vision) inference
    --- llama-server options
    [ ] Enable HTTPS (OpenSSL) support in llama-server
    [ ] Embed built-in web UI in llama-server (~2 MB extra)

What gets installed¶

Item	Location
`llama-cli`	`/usr/bin/llama-cli`
`llama-server`	`/usr/bin/llama-server`
`llama-quantize`	`/usr/bin/llama-quantize`
Optional tools	`/usr/bin/llama-*`
`libllama.so`	`/usr/lib/libllama.so`
`libggml.so`, `libggml-cpu.so`, …	`/usr/lib/libggml.so`
Init script	`/etc/init.d/rc.llama-cpp`
Default config	`/mod/etc/default.llama-cpp/llama-cpp.cfg`

All binaries and shared libraries are strongly recommended for externalisation
(total size can be 40–80 MB stripped). The externalization submenu is pre-enabled
with default y for both binaries and libraries.

Quick start¶

1. Place a GGUF model on USB storage¶

Bash

# On the FRITZ!Box, after USB drive mounts
mkdir -p /var/media/ftp/llama-cpp
# Copy a model file there (e.g. via SCP)
scp qwen2.5-0.5b-instruct-q4_k_m.gguf root@fritz.box:/var/media/ftp/llama-cpp/

2. Configure via the Freetz web UI¶

Go to http://fritz.box:81/ → Packages → llama.cpp and set:

Field	Example
Base directory	`/var/media/ftp/llama-cpp`
Model path	`/var/media/ftp/llama-cpp/qwen2.5-0.5b-instruct-q4_k_m.gguf`
Host	`0.0.0.0`
Port	`8080`
Threads	`4`
Context size	`512` ← reduce if RAM is tight
Enabled	`yes`

3. Start the server¶

Bash

/etc/init.d/rc.llama-cpp start
# or restart the web UI page and let it start at next boot

4. Run inference¶

Bash

# Interactive CLI (on the device itself)
llama-cli -m /var/media/ftp/llama-cpp/model.gguf \
          -p "Explain quantum computing in one paragraph" \
          -n 200 -t 4 --no-display-prompt

# REST API from a client on the same LAN
curl http://fritz.box:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'

# Check server health
curl http://fritz.box:8080/health

Runtime configuration¶

The init script and Freetz web UI expose these settings:

Variable	Default	Description
`LLAMA_CPP_ENABLED`	`no`	Auto-start at boot
`LLAMA_CPP_BASEDIR`	`/var/media/ftp/llama-cpp`	Base directory for models
`LLAMA_CPP_MODEL`	(empty)	Path to GGUF model (optional at startup)
`LLAMA_CPP_HOST`	`0.0.0.0`	Bind address
`LLAMA_CPP_PORT`	`8080`	HTTP port
`LLAMA_CPP_THREADS`	`4`	Number of inference threads
`LLAMA_CPP_CTX_SIZE`	`2048`	Context size in tokens
`LLAMA_CPP_NGL`	`0`	GPU layers (must be 0, no GPU)
`LLAMA_CPP_PARALLEL`	`1`	Simultaneous client slots
`LLAMA_CPP_EXTRA_ARGS`	(empty)	Extra `llama-server` flags
`LLAMA_CPP_CONFIG_WAIT`	`120`	Boot wait time (0 = sync start)
`LLAMA_CPP_NICE`	`5`	Nice priority

All variables are persisted in /mod/etc/conf/llama-cpp.cfg.

Using llama-quantize¶

llama-quantize converts full-precision or half-precision GGUF files to smaller
quantised formats:

Bash

# Q4_K_M is the best quality/size tradeoff for most models
llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M

# Q4_K_M with importance matrix for higher quality
llama-imatrix -m model-f16.gguf -f calibration_data.txt -o imatrix.dat
llama-quantize --imatrix imatrix.dat model-f16.gguf model-q4_k_m.gguf Q4_K_M

Quantisation is CPU-intensive and takes significant time on MIPS. It is usually
more practical to quantise on a host PC and copy the resulting GGUF to the device.

Build details¶

CMake cross-compilation¶

The package uses $(PKG_CONFIGURED_CMAKE) with the following key flags:

CMake flag	Value	Reason
`CMAKE_SYSTEM_NAME`	`Linux`	Explicit cross-compile target
`CMAKE_SYSTEM_PROCESSOR`	`mips`	Target architecture
`CMAKE_C_COMPILER`	`$(TARGET_CC)`	MIPS cross-compiler
`CMAKE_CXX_COMPILER`	`$(TARGET_CXX)`	MIPS C++ cross-compiler
`GGML_NATIVE`	`OFF`	Critical: prevents `-march=native` from using host CPU
`BUILD_SHARED_LIBS`	`ON`	Shared libs save storage vs. static-linked copies per tool
`LLAMA_BUILD_EXAMPLES`	`OFF`	Reduces build time; tools cover all needed functionality
`LLAMA_BUILD_TESTS`	`OFF`	No test framework on device
All GPU backends	`OFF`	FRITZ!Box has no GPU

No submodule issue¶

At tag b8575, llama.cpp has an empty .gitmodules — all ggml source code
is inlined directly in the ggml/ directory of the main repository. The GitHub
archive tarball at the commit hash therefore contains everything needed; no
secondary downloads are required.

Shared libraries and storage¶

The CMake build produces:
- libggml.so, libggml-cpu.so, and related ggml libraries
- libllama.so (llama API layer)

Each tool binary links against these shared libraries, so the per-binary overhead
is small (~1–5 MB stripped each for most tools). Total storage including all shared
libraries is typically 40–80 MB stripped. Externalisation to the external filesystem
(SquashFS or ext2 partition) is strongly recommended.

Model storage¶

Models are NOT part of the firmware image. They must be on writable external
storage (USB hard drive or USB flash drive, preferably formatted as ext4).
The LLAMA_CPP_BASEDIR configuration variable points to the model directory.

Relevant files¶

File	Purpose
`make/pkgs/llama-cpp/llama-cpp.mk`	Package makefile: download, CMake configure, build, install
`make/pkgs/llama-cpp/Config.in`	Kconfig: main package + optional tools + server options
`make/pkgs/llama-cpp/external.in`	Externalisation config (binaries and shared libraries)
`make/pkgs/llama-cpp/external.files`	List of files to externalise
`make/pkgs/llama-cpp/files/root/etc/init.d/rc.llama-cpp`	Init script: start/stop/status for llama-server
`make/pkgs/llama-cpp/files/root/mod/etc/default.llama-cpp/llama-cpp.cfg`	Default configuration (all exported variables)
`make/pkgs/llama-cpp/files/root/mod/etc/default.llama-cpp/llama-cpp.save`	Save hooks (pre/apply) for the modconf web UI