A tutorial posted on Hugging Face walks through running Gemma 4 as a vision-language-action (VLA) agent on an NVIDIA Jetson Orin Nano Super with 8 GB of RAM. The pipeline is: speech input via Parakeet STT, Gemma 4 for reasoning and tool decisions, optional webcam capture when the model determines it needs visual context, and Kokoro TTS for spoken output. The model decides autonomously whether to activate the camera — there is no keyword matching, no hardcoded trigger logic.
The demo is built around a single Python script (Gemma4_vla.py) available at github.com/asierarranz/Google_Gemma, in the Gemma4 subdirectory alongside earlier Gemma 2 demos. On first run the script downloads Parakeet STT, Kokoro TTS, and generates voice prompt WAVs from Hugging Face. Subsequent runs skip the download step.
Hardware and software requirements
The hardware used in the demo is a Jetson Orin Nano Super at 8 GB, a Logitech C920 webcam with built-in mic, a USB speaker, and a USB keyboard for triggering recording. The tutorial notes these are not strict requirements — any webcam, USB mic, and USB speaker visible to Linux should work.
The software setup starts with standard system packages (build tools, Python dev headers, audio utilities including alsa-utils and pulseaudio-utils, v4l-utils for webcam access, ffmpeg, and libsndfile1). A Python virtual environment gets five packages: opencv-python-headless, onnx_asr, kokoro-onnx, soundfile, huggingface-hub, and numpy. The build-essential and cmake packages are listed as only required for the native llama.cpp route.
Memory management is treated as a first-class concern throughout. The tutorial adds 8 GB of swap as a safety net to prevent OOM kills during model loading, stops Docker, containerd, and background processes like tracker-miner-fs-3 and gnome-software, and asks the user to close browser tabs and IDE windows. The quantization recommended for this board is Q4_K_M via native llama.cpp build, described as the sweet spot between quality and memory use. If memory stays tight, dropping to Q3_K_M is offered as a fallback.
Serving Gemma 4 with llama.cpp
The tutorial builds llama.cpp natively on the Jetson rather than using Docker, citing better performance and full access to the vision projector required for the VLA demo. The build command enables CUDA with architecture 87 (targeting the Jetson’s GPU), native CPU tuning, and Release mode compiled with four threads.
Two files are downloaded: gemma-4-E2B-it-Q4_K_M.gguf from the unsloth organization and mmproj-gemma4-e2b-f16.gguf from ggml-org. The post notes the mmproj file is the vision projector and must not be skipped — without it Gemma cannot process images. The server launches with all model layers pushed to GPU (-ngl 99), flash attention enabled, and the --jinja flag, which the tutorial identifies as what activates Gemma’s native tool-calling support.
The tutorial exposes exactly one tool to the model:
{
"name": "look_and_answer",
"description": "Take a photo with the webcam and analyze what is visible."
}
When a question is posed: speech is transcribed locally by Parakeet STT, the text plus the tool definition go to Gemma, and the model decides whether to call look_and_answer. If it does, the script grabs a webcam frame and sends it back. Gemma answers, and Kokoro speaks the response. If the question does not need vision, the camera is never activated.
Configuring audio and video devices
Device discovery requires three commands. arecord -l lists recording devices to find the USB mic (the C920 showed up as plughw:3,0 in the demo). pactl list short sinks lists PulseAudio output sinks to identify the speaker. v4l2-ctl --list-devices lists webcam devices, typically at index 0. The three values are passed as environment variables: MIC_DEVICE, SPK_DEVICE, and WEBCAM. Voice selection uses a fourth variable, VOICE, with options including af_jessica, af_nova, am_puck, bf_emma, and am_onyx from Kokoro’s built-in voice set.
A text-only mode (python3 Gemma4_vla.py --text) skips audio setup and exercises the LLM path directly, which is useful for testing the reasoning and tool-calling logic without configuring microphone or speaker hardware.
The demo is a concrete existence proof that multimodal, tool-calling, speech-in-speech-out agents can run on consumer-grade edge hardware at the 8 GB tier. The architecture — local STT, local LLM, single-tool decision loop, local TTS — is simple enough to adapt to other edge devices or other tool definitions without significant rework.