Post

DGX Spark

DGX Spark

Update

1
2
3
4
5
sudo apt update
sudo apt upgrade -y
sudo apt dist-upgrade
sudo fwupdmgr refresh
sudo fwupdmgr upgrade

Useful commands

1
nvidia-smi

to check the status of the GPU and its usage.

uv

You need a lot of python for all this AI stull so you should install uv to manage your python environments and dependencies.

1
curl -LsSf https://astral.sh/uv/install.sh | sh && source $HOME/.local/bin/env

nvtop

nvtop needs to be installed from source, but it is a very useful tool to monitor the GPU usage in real time.

1
2
3
4
5
6
7
git clone https://github.com/Syllo/nvtop.git && cd nvtop
mkdir -p build && cd build
sudo apt-get update
sudo apt-get install libncurses5-dev libncursesw5-dev
cmake .. -DNVIDIA_SUPPORT=ON -DAMDGPU_SUPPORT=OFF -DINTEL_SUPPORT=OFF -DV3D_SUPPORT=OFF -DMSM_SUPPORT=OFF -DPANFROST_SUPPORT=OFF -DPANTHOR_SUPPORT=OFF -DMETAX_SUPPORT=OFF
make
sudo make install

After installation, you can run nvtop in the terminal to see the GPU usage in real time. Its basically like htop but for the GPU. It shows the GPU usage, memory usage, temperature, wattage.

Downloading LLMs

To run LLMs on the DGX Spark, you need to download the models first. You can download the models from Hugging Face.

1
2
3
4
HF_HUB_DISABLE_XET=1 uvx --from huggingface_hub hf download \
  unsloth/Qwen3-Coder-Next-GGUF \
  --include "Qwen3-Coder-Next-Q4_K_M.gguf" \
  --local-dir ~/models

Some model files are split into multiple parts, so you need to download all the parts and then merge them together.

1
2
3
4
5
mkdir -p models/Q4_K_M
HF_HUB_DISABLE_XET=1 uvx --from huggingface_hub hf download \
  unsloth/gpt-oss-120b-GGUF \
  --include "Q4_K_M/*.gguf" \
  --local-dir ~/models

Now the parts can be merged together using the gguf_merge tool from llama.cpp.

1
2
3
~/llama.cpp/build/bin/llama-gguf-split --merge \
  ~/models/Q4_K_M/gpt-oss-120b-Q4_K_M-00001-of-00002.gguf \
  ~/models/gpt-oss-120b-Q4_K_M.gguf

You only need to refrence the first part of the model, because the gguf_merge tool will automatically merge the parts together when you run the model.

llama.cpp

Llama.cpp is a C++ implementation of the LLaMA language model that can run on CPU and GPU. It is optimized for performance and can be used for inference and quantization. To compile a spark optimized version of llama.cpp, you can use the following command:

1
2
3
git clone https://github.com/ggerganov/llama.cpp.git && cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_ARCHITECTURES=121a-real
cmake --build build --config Release -j 20

Also some flags that can be used for better inference on the DGX Spark:

1
2
3
4
5
6
7
8
  --jinja \
  -ub 2048 \
  -b 2048 \
  -ngl 999 \
  -fa 1 \
  --no-mmap \
  --reasoning-format auto \
  --kv-unified

quantizing

The MXFP4_MOE quantization format is a new quantization format that is optimized for the DGX Spark. It is a mixed precision format that uses 4 bits for the weights and 8 bits for the activations. This allows for a significant reduction in memory usage while still maintaining good performance.

1
2
3
4
mkdir -p ~/models/Qwen3-Coder-Next-HF
HF_HUB_DISABLE_XET=1 uvx --from huggingface_hub hf download \
  Qwen/Qwen3-Coder-Next \
  --local-dir ~/models/Qwen3-Coder-Next-HF

Convert to BF16 gguf first

1
2
3
4
5
cd ~/llama.cpp
uv run python convert_hf_to_gguf.py \
  ~/models/Qwen3-Coder-Next-HF \
  --outfile ~/models/Qwen3-Coder-Next-BF16.gguf \
  --outtype bf16

Quantize to MXFP4_MOE

1
2
3
4
./build/bin/llama-quantize \
  ~/models/Qwen3-Coder-Next-BF16.gguf \
  ~/models/Qwen3-Coder-Next-MXFP4_MOE.gguf \
  MXFP4_MOE

unsloth

1
2
3
4
5
6
7
8
9
10
11
12
13
14
mkdir -p unsloth && cd unsloth
uv venv .venv --python=3.12 --seed
source .venv/bin/activate
uv pip install -U vllm --torch-backend=cu128
uv pip install unsloth unsloth_zoo bitsandbytes
uv pip install -qqq "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo" "unsloth[base] @ git+https://github.com/unslothai/unsloth"
# First uninstall xformers installed by previous libraries
pip uninstall xformers -y
# Clone and build
pip install ninja
export TORCH_CUDA_ARCH_LIST="12.0"
git clone --depth=1 https://github.com/facebookresearch/xformers --recursive
cd xformers && python setup.py install && cd ..
uv pip install -U transformers

Run the unsloth server:

1
uv run unsloth server -H 0.0.0.0 -p 8888

unsloth studio

Getting unsloth studio up and running on the DGX Spark is triky, because the install script tries to install pip packages for wich there are no arm64 versions yet. But you can install the dependencies manually and then run the unsloth studio server.

1
2
3
4
5
6
mkdir -p ~/.unsloth/studio
cd ~/.unsloth/studio
uv venv --python 3.13 unsloth_studio
source unsloth_studio/bin/activate
uv pip install --index-url https://download.pytorch.org/whl/cu130 torch torchvision torchaudio
uv pip install --index-url https://download.pytorch.org/whl/cu130 torchcodec==0.10.0
1
curl -fsSL https://unsloth.ai/install.sh | sh

This will fail because of the missing arm64 versions of the dependencies, but create somen needed directories and files.

1
uv pip install structlog fastapi starlette uvicorn rich typer pydantic diceware

Run a finall update

1
uv run --active unsloth studio update

Now you can run the unsloth studio server using the following command

1
uv run unsloth studio -H 0.0.0.0 -p 8888
This post is licensed under CC BY 4.0 by the author.