llama.cpp b8575 (LLM inference engine)⚓︎
- Homepage: https://github.com/ggml-org/llama.cpp
- Changelog: https://github.com/ggml-org/llama.cpp/releases
- Repository: https://github.com/ggml-org/llama.cpp
- Package: master/make/pkgs/llama-cpp/
- Steward: Ircama
| Homepage | github.com/ggml-org/llama.cpp |
| Changelog | github.com/ggml-org/llama.cpp/releases |
| Repository | github.com/ggml-org/llama.cpp |
| Package | make/pkgs/llama-cpp/ |
| Steward | Ircama |
Overview⚓︎
llama.cpp is a high-performance, pure-C/C++ LLM inference engine that runs
quantized GGUF language models entirely on the CPU — no GPU, CUDA, or driver
stack required.
The Freetz-EVO package cross-compiles llama.cpp to MIPS32 for the FRITZ!Box.
It uses the CMake build system with shared libraries (libllama.so, libggml*.so)
so that multiple tools share a single copy of the inference code, keeping total
storage usage manageable on a 100 MB device.
Practical device constraints⚓︎
| Resource | Value | Impact |
|---|---|---|
| Storage | ~100 MB | All binaries + libs must fit; models go to USB/NAS |
| RAM | 512 MB | Practical model limit: Q4_K_M ≤ ~400 MB |
| CPU cores | 4 × MIPS34Kc | Use --threads 4 for maximum throughput |
| GPU | none | --n-gpu-layers 0 (always) |
Recommended models for this device⚓︎
| Model | GGUF quantisation | Size | RAM usage |
|---|---|---|---|
| SmolLM2-360M | Q4_K_M | ~220 MB | ~280 MB |
| Qwen2.5-0.5B-Instruct | Q4_K_M | ~350 MB | ~420 MB |
| SmolLM2-1.7B | Q4_K_M | ~1.1 GB | too large |
Inference speed on MIPS34Kc (single-threaded) is ~1–5 tokens/s depending on model
size and quantisation. Enable threading (-t 4) for better throughput.
Installation⚓︎
Enable the package in make menuconfig under:
Optional tools and server features can be selected within the llama.cpp submenu:
[*] llama.cpp b8575 (LLM inference engine)
--- Optional tools
[ ] llama-bench — performance benchmark
[ ] llama-perplexity — model quality measurement
[ ] llama-tokenize — tokenizer diagnostic
[ ] llama-imatrix — importance matrix for smart quantisation
[ ] llama-gguf-split — split/merge multi-shard GGUF files
[ ] llama-batched-bench — batched inference benchmark
[ ] llama-tts — text-to-speech (OuteTTS)
[ ] llama-mtmd-cli — multimodal (vision) inference
--- llama-server options
[ ] Enable HTTPS (OpenSSL) support in llama-server
[ ] Embed built-in web UI in llama-server (~2 MB extra)
What gets installed⚓︎
| Item | Location |
|---|---|
llama-cli |
/usr/bin/llama-cli |
llama-server |
/usr/bin/llama-server |
llama-quantize |
/usr/bin/llama-quantize |
| Optional tools | /usr/bin/llama-* |
libllama.so |
/usr/lib/libllama.so |
libggml.so, libggml-cpu.so, … |
/usr/lib/libggml*.so* |
| Init script | /etc/init.d/rc.llama-cpp |
| Default config | /mod/etc/default.llama-cpp/llama-cpp.cfg |
All binaries and shared libraries are strongly recommended for externalisation
(total size can be 40–80 MB stripped). The externalization submenu is pre-enabled
with default y for both binaries and libraries.
Quick start⚓︎
1. Place a GGUF model on USB storage⚓︎
# On the FRITZ!Box, after USB drive mounts
mkdir -p /var/media/ftp/llama-cpp
# Copy a model file there (e.g. via SCP)
scp qwen2.5-0.5b-instruct-q4_k_m.gguf root@fritz.box:/var/media/ftp/llama-cpp/
2. Configure via the Freetz web UI⚓︎
Go to http://fritz.box:81/ → Packages → llama.cpp and set:
| Field | Example |
|---|---|
| Base directory | /var/media/ftp/llama-cpp |
| Model path | /var/media/ftp/llama-cpp/qwen2.5-0.5b-instruct-q4_k_m.gguf |
| Host | 0.0.0.0 |
| Port | 8080 |
| Threads | 4 |
| Context size | 512 ← reduce if RAM is tight |
| Enabled | yes |
3. Start the server⚓︎
4. Run inference⚓︎
# Interactive CLI (on the device itself)
llama-cli -m /var/media/ftp/llama-cpp/model.gguf \
-p "Explain quantum computing in one paragraph" \
-n 200 -t 4 --no-display-prompt
# REST API from a client on the same LAN
curl http://fritz.box:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "local",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'
# Check server health
curl http://fritz.box:8080/health
Runtime configuration⚓︎
The init script and Freetz web UI expose these settings:
| Variable | Default | Description |
|---|---|---|
LLAMA_CPP_ENABLED |
no |
Auto-start at boot |
LLAMA_CPP_BASEDIR |
/var/media/ftp/llama-cpp |
Base directory for models |
LLAMA_CPP_MODEL |
(empty) | Path to GGUF model (optional at startup) |
LLAMA_CPP_HOST |
0.0.0.0 |
Bind address |
LLAMA_CPP_PORT |
8080 |
HTTP port |
LLAMA_CPP_THREADS |
4 |
Number of inference threads |
LLAMA_CPP_CTX_SIZE |
2048 |
Context size in tokens |
LLAMA_CPP_NGL |
0 |
GPU layers (must be 0, no GPU) |
LLAMA_CPP_PARALLEL |
1 |
Simultaneous client slots |
LLAMA_CPP_EXTRA_ARGS |
(empty) | Extra llama-server flags |
LLAMA_CPP_CONFIG_WAIT |
120 |
Boot wait time (0 = sync start) |
LLAMA_CPP_NICE |
5 |
Nice priority |
All variables are persisted in /mod/etc/conf/llama-cpp.cfg.
Using llama-quantize⚓︎
llama-quantize converts full-precision or half-precision GGUF files to smaller
quantised formats:
# Q4_K_M is the best quality/size tradeoff for most models
llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
# Q4_K_M with importance matrix for higher quality
llama-imatrix -m model-f16.gguf -f calibration_data.txt -o imatrix.dat
llama-quantize --imatrix imatrix.dat model-f16.gguf model-q4_k_m.gguf Q4_K_M
Quantisation is CPU-intensive and takes significant time on MIPS. It is usually
more practical to quantise on a host PC and copy the resulting GGUF to the device.
Build details⚓︎
CMake cross-compilation⚓︎
The package uses $(PKG_CONFIGURED_CMAKE) with the following key flags:
| CMake flag | Value | Reason |
|---|---|---|
CMAKE_SYSTEM_NAME |
Linux |
Explicit cross-compile target |
CMAKE_SYSTEM_PROCESSOR |
mips |
Target architecture |
CMAKE_C_COMPILER |
$(TARGET_CC) |
MIPS cross-compiler |
CMAKE_CXX_COMPILER |
$(TARGET_CXX) |
MIPS C++ cross-compiler |
GGML_NATIVE |
OFF |
Critical: prevents -march=native from using host CPU |
BUILD_SHARED_LIBS |
ON |
Shared libs save storage vs. static-linked copies per tool |
LLAMA_BUILD_EXAMPLES |
OFF |
Reduces build time; tools cover all needed functionality |
LLAMA_BUILD_TESTS |
OFF |
No test framework on device |
| All GPU backends | OFF |
FRITZ!Box has no GPU |
No submodule issue⚓︎
At tag b8575, llama.cpp has an empty .gitmodules — all ggml source code
is inlined directly in the ggml/ directory of the main repository. The GitHub
archive tarball at the commit hash therefore contains everything needed; no
secondary downloads are required.
Shared libraries and storage⚓︎
The CMake build produces:
- libggml.so, libggml-cpu.so, and related ggml libraries
- libllama.so (llama API layer)
Each tool binary links against these shared libraries, so the per-binary overhead
is small (~1–5 MB stripped each for most tools). Total storage including all shared
libraries is typically 40–80 MB stripped. Externalisation to the external filesystem
(SquashFS or ext2 partition) is strongly recommended.
Model storage⚓︎
Models are NOT part of the firmware image. They must be on writable external
storage (USB hard drive or USB flash drive, preferably formatted as ext4).
The LLAMA_CPP_BASEDIR configuration variable points to the model directory.
Relevant files⚓︎
| File | Purpose |
|---|---|
make/pkgs/llama-cpp/llama-cpp.mk |
Package makefile: download, CMake configure, build, install |
make/pkgs/llama-cpp/Config.in |
Kconfig: main package + optional tools + server options |
make/pkgs/llama-cpp/external.in |
Externalisation config (binaries and shared libraries) |
make/pkgs/llama-cpp/external.files |
List of files to externalise |
make/pkgs/llama-cpp/files/root/etc/init.d/rc.llama-cpp |
Init script: start/stop/status for llama-server |
make/pkgs/llama-cpp/files/root/mod/etc/default.llama-cpp/llama-cpp.cfg |
Default configuration (all exported variables) |
make/pkgs/llama-cpp/files/root/mod/etc/default.llama-cpp/llama-cpp.save |
Save hooks (pre/apply) for the modconf web UI |