Update
1
2
3
4
5
| sudo apt update
sudo apt upgrade -y
sudo apt dist-upgrade
sudo fwupdmgr refresh
sudo fwupdmgr upgrade
|
Useful commands
to check the status of the GPU and its usage.
uv
You need a lot of python for all this AI stull so you should install uv to manage your python environments and dependencies.
1
| curl -LsSf https://astral.sh/uv/install.sh | sh && source $HOME/.local/bin/env
|
nvtop
nvtop needs to be installed from source, but it is a very useful tool to monitor the GPU usage in real time.
1
2
3
4
5
6
7
| git clone https://github.com/Syllo/nvtop.git && cd nvtop
mkdir -p build && cd build
sudo apt-get update
sudo apt-get install libncurses5-dev libncursesw5-dev
cmake .. -DNVIDIA_SUPPORT=ON -DAMDGPU_SUPPORT=OFF -DINTEL_SUPPORT=OFF -DV3D_SUPPORT=OFF -DMSM_SUPPORT=OFF -DPANFROST_SUPPORT=OFF -DPANTHOR_SUPPORT=OFF -DMETAX_SUPPORT=OFF
make
sudo make install
|
After installation, you can run nvtop in the terminal to see the GPU usage in real time. Its basically like htop but for the GPU. It shows the GPU usage, memory usage, temperature, wattage.
Downloading LLMs
To run LLMs on the DGX Spark, you need to download the models first. You can download the models from Hugging Face.
1
2
3
4
| HF_HUB_DISABLE_XET=1 uvx --from huggingface_hub hf download \
unsloth/Qwen3-Coder-Next-GGUF \
--include "Qwen3-Coder-Next-Q4_K_M.gguf" \
--local-dir ~/models
|
Some model files are split into multiple parts, so you need to download all the parts and then merge them together.
1
2
3
4
5
| mkdir -p models/Q4_K_M
HF_HUB_DISABLE_XET=1 uvx --from huggingface_hub hf download \
unsloth/gpt-oss-120b-GGUF \
--include "Q4_K_M/*.gguf" \
--local-dir ~/models
|
Now the parts can be merged together using the gguf_merge tool from llama.cpp.
1
2
3
| ~/llama.cpp/build/bin/llama-gguf-split --merge \
~/models/Q4_K_M/gpt-oss-120b-Q4_K_M-00001-of-00002.gguf \
~/models/gpt-oss-120b-Q4_K_M.gguf
|
You only need to refrence the first part of the model, because the gguf_merge tool will automatically merge the parts together when you run the model.
llama.cpp
Llama.cpp is a C++ implementation of the LLaMA language model that can run on CPU and GPU. It is optimized for performance and can be used for inference and quantization. To compile a spark optimized version of llama.cpp, you can use the following command:
1
2
3
| git clone https://github.com/ggerganov/llama.cpp.git && cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_ARCHITECTURES=121a-real
cmake --build build --config Release -j 20
|
Also some flags that can be used for better inference on the DGX Spark:
1
2
3
4
5
6
7
8
| --jinja \
-ub 2048 \
-b 2048 \
-ngl 999 \
-fa 1 \
--no-mmap \
--reasoning-format auto \
--kv-unified
|
quantizing
The MXFP4_MOE quantization format is a new quantization format that is optimized for the DGX Spark. It is a mixed precision format that uses 4 bits for the weights and 8 bits for the activations. This allows for a significant reduction in memory usage while still maintaining good performance.
1
2
3
4
| mkdir -p ~/models/Qwen3-Coder-Next-HF
HF_HUB_DISABLE_XET=1 uvx --from huggingface_hub hf download \
Qwen/Qwen3-Coder-Next \
--local-dir ~/models/Qwen3-Coder-Next-HF
|
Convert to BF16 gguf first
1
2
3
4
5
| cd ~/llama.cpp
uv run python convert_hf_to_gguf.py \
~/models/Qwen3-Coder-Next-HF \
--outfile ~/models/Qwen3-Coder-Next-BF16.gguf \
--outtype bf16
|
Quantize to MXFP4_MOE
1
2
3
4
| ./build/bin/llama-quantize \
~/models/Qwen3-Coder-Next-BF16.gguf \
~/models/Qwen3-Coder-Next-MXFP4_MOE.gguf \
MXFP4_MOE
|
unsloth
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| mkdir -p unsloth && cd unsloth
uv venv .venv --python=3.12 --seed
source .venv/bin/activate
uv pip install -U vllm --torch-backend=cu128
uv pip install unsloth unsloth_zoo bitsandbytes
uv pip install -qqq "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo" "unsloth[base] @ git+https://github.com/unslothai/unsloth"
# First uninstall xformers installed by previous libraries
pip uninstall xformers -y
# Clone and build
pip install ninja
export TORCH_CUDA_ARCH_LIST="12.0"
git clone --depth=1 https://github.com/facebookresearch/xformers --recursive
cd xformers && python setup.py install && cd ..
uv pip install -U transformers
|
Run the unsloth server:
1
| uv run unsloth server -H 0.0.0.0 -p 8888
|
unsloth studio
Getting unsloth studio up and running on the DGX Spark is triky, because the install script tries to install pip packages for wich there are no arm64 versions yet. But you can install the dependencies manually and then run the unsloth studio server.
1
2
3
4
5
6
| mkdir -p ~/.unsloth/studio
cd ~/.unsloth/studio
uv venv --python 3.13 unsloth_studio
source unsloth_studio/bin/activate
uv pip install --index-url https://download.pytorch.org/whl/cu130 torch torchvision torchaudio
uv pip install --index-url https://download.pytorch.org/whl/cu130 torchcodec==0.10.0
|
1
| curl -fsSL https://unsloth.ai/install.sh | sh
|
This will fail because of the missing arm64 versions of the dependencies, but create somen needed directories and files.
1
| uv pip install structlog fastapi starlette uvicorn rich typer pydantic diceware
|
Run a finall update
1
| uv run --active unsloth studio update
|
Now you can run the unsloth studio server using the following command
1
| uv run unsloth studio -H 0.0.0.0 -p 8888
|