Model card · Open weights · Apache-2.0

brend-2b-260602

A GRPO fine-tune of Qwen3-VL-2B-Instruct for GUI element grounding, trained with a click-in-bbox reward. Targeted at high-resolution professional software, IDEs, CAD, DAWs, scientific tools, office suites, and OS chrome.

View on Hugging Face →

Results

Scores

Evaluated on ScreenSpot-Pro, the full 1,581-sample test set, single-pass inference with no zoom-in.

Setup	Score
brend-2b-260602 (single-pass)	48.64%
Base Qwen3-VL-2B-Instruct (same harness)	43.26%
Δ from GRPO	+5.38 pp

Base modelQwen3-VL-2B-Instruct

MethodGRPO · click-in-bbox

Parameters2B · BF16

TaskGUI grounding

LicenseApache-2.0

WeightsOpen · Hugging Face

Read first

Runtime requirements

The evaluation that produced 48.64% used vLLM 0.17.0 with specific flags. Two things will silently give you wrong answers if you skip them.

Pin vllm==0.17.0 exactly. Newer vLLM releases process the Qwen3-VL image preprocessor differently and return coordinates in the wrong space.
Pass --mm-processor-kwargs '{"min_pixels": 1024, "max_pixels": 99999999}'. Without it, vLLM downsamples the 4K–6K screenshots to ~1280-wide and the tiny widget targets become invisible to the model.

Both are environmental, not model issues, but they're load-bearing for reproducing the published number.

Setup

Install

Create a fresh conda environment and install the pinned stack.

# fresh env
conda create -n vllm011 python=3.11 -y
conda activate vllm011

# pinned vLLM — do not upgrade
python -m uv pip install vllm==0.17.0

# client deps
python -m uv pip install transformers==4.57.6 pillow requests

GPU: any CUDA 12.x card with ≥10 GB VRAM. Tested on RTX PRO 6000 Blackwell; works on H100 / A100 / RTX 4090 unchanged.

Setup

Serve

CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server \
  --model Datawall/brend-2b-260602 \
  --served-model-name brend-2b \
  --port 8003 \
  --gpu-memory-utilization 0.4 \
  --max-model-len 16384 \
  --max-num-seqs 32 \
  --limit-mm-per-prompt '{"image": 1}' \
  --mm-processor-kwargs '{"min_pixels": 1024, "max_pixels": 99999999}'

A 2B BF16 model fits in ~5 GB; the rest of the 0.4 budget is KV cache for batched serving. Bump --gpu-memory-utilization and --max-num-seqs if you have headroom.

Usage

Use it

The server is OpenAI-compatible. Send a screenshot and an instruction; the model replies with a computer_use tool call containing a coordinate.

import base64, re
from io import BytesIO
from PIL import Image
import requests

VLLM_URL = "http://localhost:8003/v1/chat/completions"
MODEL    = "brend-2b"

# Full SYSTEM_PROMPT (computer_use tool spec) is on the model card.

def ground(image_path, instruction):
    img = Image.open(image_path).convert("RGB")
    buf = BytesIO(); img.save(buf, format="PNG")
    data_url = "data:image/png;base64," + base64.b64encode(buf.getvalue()).decode()
    payload = {
        "model": MODEL, "temperature": 0.0, "max_tokens": 64,
        "messages": [
            {"role": "system", "content": [{"type": "text", "text": SYSTEM_PROMPT}]},
            {"role": "user", "content": [
                {"type": "image_url", "image_url": {"url": data_url}},
                {"type": "text", "text": instruction},
            ]},
        ],
    }
    r = requests.post(VLLM_URL, json=payload, timeout=60); r.raise_for_status()
    text = r.json()["choices"][0]["message"]["content"]

    # coordinates come back in [0, 1000] relative space
    m = re.search(r'"coordinate"\s*:\s*\[\s*(-?\d+\.?\d*)\s*,\s*(-?\d+\.?\d*)\s*\]', text)
    if not m: return None
    x, y = float(m.group(1)) / 1000.0, float(m.group(2)) / 1000.0
    return (x * img.width, y * img.height)

print(ground("screenshot.png", "the save button in the top toolbar"))

Usage

Coordinate convention

The model emits (x, y) in [0, 1000] relative space — the computer_use tool prompt declares a fake 1000×1000 screen, and Qwen3-VL is trained to honor that. Divide by 1000 to get normalized [0, 1] coordinates, then multiply by the original image's width and height to get pixels.

Do not pre-resize the image client-side. vLLM's preprocessor handles smart-resize internally given the mm-processor-kwargs flags above; client-side resizing throws off the model.

Evaluation

Eval breakdown

ScreenSpot-Pro, full test set, single-pass inference, by software category.

Section	Avg	Text	Icon
Development	48.49	70.13	25.52
Creative	45.45	61.62	23.08
CAD	32.95	38.07	17.19
Scientific	49.21	65.28	28.18
Office	70.00	80.23	35.85
Operating Systems	47.45	62.62	29.21
Overall	48.39	62.23	25.99

Text grounding is meaningfully stronger than icon grounding across every category — typical for 2B-class grounders. The 48.39% micro-average and the 48.64% model-index figure differ by a known Creative-group accounting discrepancy in the eval harness.

Comparison to other 2B models

Model	Inference	Avg
MAI-UI-2B	Zoom in	62.81
UI-Venus-1-5-2B	Single-pass	57.75
brend-2b-260602	Single-pass	48.64
Qwen3-VL-2B-Instruct (base)	Single-pass	43.26

MAI-UI uses inference-time crop/re-query and isn't apples-to-apples. UI-Venus-2B is the legitimate single-pass 2B comparison.

Method

Training details

Base model: Qwen/Qwen3-VL-2B-Instruct
Method: GRPO with click-in-bbox reward, scored in the [0, 1000] relative space the model natively emits
Hardware: 1× NVIDIA RTX PRO 6000 Blackwell (96 GB GDDR7)
Precision: BF16, single GPU, sdpa attention
Effective batch size: 64 (per-device 2 × grad-accum 32), 2 completions per prompt
Max completion length: 32 tokens
Wall clock: ~17 hours for 2 epochs (~1,875 steps)
Checkpoint published: step 1350 (peak)

Eval harness: ScreenSpot-Pro test set, all 1,581 English instruction samples, single-pass with no zoom-in, agentic loop, refiner, or consistency router, using the official Qwen-team prompt via vLLM.

Reference

Citation

@misc{chen2026brend2b260602,
  title  = {brend-2b-260602: GRPO fine-tune of Qwen3-VL-2B for GUI grounding},
  author = {Kenneth Chen, Sheldon Zhu, Jiabao Zhang},
  year   = {2026},
  howpublished = {https://huggingface.co/Datawall/brend-2b-260602},
}

Licensed Apache-2.0, inheriting the base model's license. Training data and eval benchmark are subject to their own upstream licenses.

Run it. Fine-tune it.
The weights are open.

Get brend-2b-260602 → Back to research

Modelbrend-2b-260602

Hubhuggingface.co/Datawall

ContactGet in touch