brend-2b-260602
A GRPO fine-tune of Qwen3-VL-2B-Instruct for GUI element grounding, trained with a click-in-bbox reward. Targeted at high-resolution professional software, IDEs, CAD, DAWs, scientific tools, office suites, and OS chrome.
Scores
Evaluated on ScreenSpot-Pro, the full 1,581-sample test set, single-pass inference with no zoom-in.
| Setup | Score |
|---|---|
| brend-2b-260602 (single-pass) | 48.64% |
| Base Qwen3-VL-2B-Instruct (same harness) | 43.26% |
| Δ from GRPO | +5.38 pp |
Runtime requirements
The evaluation that produced 48.64% used vLLM 0.17.0 with specific flags. Two things will silently give you wrong answers if you skip them.
- Pin
vllm==0.17.0exactly. Newer vLLM releases process the Qwen3-VL image preprocessor differently and return coordinates in the wrong space. - Pass
--mm-processor-kwargs '{"min_pixels": 1024, "max_pixels": 99999999}'. Without it, vLLM downsamples the 4K–6K screenshots to ~1280-wide and the tiny widget targets become invisible to the model.
Install
Create a fresh conda environment and install the pinned stack.
# fresh env conda create -n vllm011 python=3.11 -y conda activate vllm011 # pinned vLLM — do not upgrade python -m uv pip install vllm==0.17.0 # client deps python -m uv pip install transformers==4.57.6 pillow requests
GPU: any CUDA 12.x card with ≥10 GB VRAM. Tested on RTX PRO 6000 Blackwell; works on H100 / A100 / RTX 4090 unchanged.
Serve
CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server \
--model Datawall/brend-2b-260602 \
--served-model-name brend-2b \
--port 8003 \
--gpu-memory-utilization 0.4 \
--max-model-len 16384 \
--max-num-seqs 32 \
--limit-mm-per-prompt '{"image": 1}' \
--mm-processor-kwargs '{"min_pixels": 1024, "max_pixels": 99999999}' A 2B BF16 model fits in ~5 GB; the rest of the 0.4 budget is KV cache for batched serving. Bump --gpu-memory-utilization and --max-num-seqs if you have headroom.
Use it
The server is OpenAI-compatible. Send a screenshot and an instruction; the model replies with a computer_use tool call containing a coordinate.
import base64, re from io import BytesIO from PIL import Image import requests VLLM_URL = "http://localhost:8003/v1/chat/completions" MODEL = "brend-2b" # Full SYSTEM_PROMPT (computer_use tool spec) is on the model card. def ground(image_path, instruction): img = Image.open(image_path).convert("RGB") buf = BytesIO(); img.save(buf, format="PNG") data_url = "data:image/png;base64," + base64.b64encode(buf.getvalue()).decode() payload = { "model": MODEL, "temperature": 0.0, "max_tokens": 64, "messages": [ {"role": "system", "content": [{"type": "text", "text": SYSTEM_PROMPT}]}, {"role": "user", "content": [ {"type": "image_url", "image_url": {"url": data_url}}, {"type": "text", "text": instruction}, ]}, ], } r = requests.post(VLLM_URL, json=payload, timeout=60); r.raise_for_status() text = r.json()["choices"][0]["message"]["content"] # coordinates come back in [0, 1000] relative space m = re.search(r'"coordinate"\s*:\s*\[\s*(-?\d+\.?\d*)\s*,\s*(-?\d+\.?\d*)\s*\]', text) if not m: return None x, y = float(m.group(1)) / 1000.0, float(m.group(2)) / 1000.0 return (x * img.width, y * img.height) print(ground("screenshot.png", "the save button in the top toolbar"))
Coordinate convention
The model emits (x, y) in [0, 1000] relative space — the computer_use tool prompt declares a fake 1000×1000 screen, and Qwen3-VL is trained to honor that. Divide by 1000 to get normalized [0, 1] coordinates, then multiply by the original image's width and height to get pixels.
mm-processor-kwargs flags above; client-side resizing throws off the model.Eval breakdown
ScreenSpot-Pro, full test set, single-pass inference, by software category.
| Section | Avg | Text | Icon |
|---|---|---|---|
| Development | 48.49 | 70.13 | 25.52 |
| Creative | 45.45 | 61.62 | 23.08 |
| CAD | 32.95 | 38.07 | 17.19 |
| Scientific | 49.21 | 65.28 | 28.18 |
| Office | 70.00 | 80.23 | 35.85 |
| Operating Systems | 47.45 | 62.62 | 29.21 |
| Overall | 48.39 | 62.23 | 25.99 |
Text grounding is meaningfully stronger than icon grounding across every category — typical for 2B-class grounders. The 48.39% micro-average and the 48.64% model-index figure differ by a known Creative-group accounting discrepancy in the eval harness.
Comparison to other 2B models
| Model | Inference | Avg |
|---|---|---|
| MAI-UI-2B | Zoom in | 62.81 |
| UI-Venus-1-5-2B | Single-pass | 57.75 |
| brend-2b-260602 | Single-pass | 48.64 |
| Qwen3-VL-2B-Instruct (base) | Single-pass | 43.26 |
MAI-UI uses inference-time crop/re-query and isn't apples-to-apples. UI-Venus-2B is the legitimate single-pass 2B comparison.
Training details
- Base model: Qwen/Qwen3-VL-2B-Instruct
- Method: GRPO with click-in-bbox reward, scored in the
[0, 1000]relative space the model natively emits - Hardware: 1× NVIDIA RTX PRO 6000 Blackwell (96 GB GDDR7)
- Precision: BF16, single GPU,
sdpaattention - Effective batch size: 64 (per-device 2 × grad-accum 32), 2 completions per prompt
- Max completion length: 32 tokens
- Wall clock: ~17 hours for 2 epochs (~1,875 steps)
- Checkpoint published: step 1350 (peak)
Eval harness: ScreenSpot-Pro test set, all 1,581 English instruction samples, single-pass with no zoom-in, agentic loop, refiner, or consistency router, using the official Qwen-team prompt via vLLM.
Citation
@misc{chen2026brend2b260602,
title = {brend-2b-260602: GRPO fine-tune of Qwen3-VL-2B for GUI grounding},
author = {Kenneth Chen, Sheldon Zhu, Jiabao Zhang},
year = {2026},
howpublished = {https://huggingface.co/Datawall/brend-2b-260602},
} Licensed Apache-2.0, inheriting the base model's license. Training data and eval benchmark are subject to their own upstream licenses.
Run it. Fine-tune it.
The weights are open.
