Research
Model card · Open weights · Apache-2.0

brend-2b-260602

A GRPO fine-tune of Qwen3-VL-2B-Instruct for GUI element grounding, trained with a click-in-bbox reward. Targeted at high-resolution professional software, IDEs, CAD, DAWs, scientific tools, office suites, and OS chrome.

Results

Scores

Evaluated on ScreenSpot-Pro, the full 1,581-sample test set, single-pass inference with no zoom-in.

SetupScore
brend-2b-260602 (single-pass)48.64%
Base Qwen3-VL-2B-Instruct (same harness)43.26%
Δ from GRPO+5.38 pp
Base modelQwen3-VL-2B-Instruct
MethodGRPO · click-in-bbox
Parameters2B · BF16
TaskGUI grounding
LicenseApache-2.0
WeightsOpen · Hugging Face
Read first

Runtime requirements

The evaluation that produced 48.64% used vLLM 0.17.0 with specific flags. Two things will silently give you wrong answers if you skip them.

  1. Pin vllm==0.17.0 exactly. Newer vLLM releases process the Qwen3-VL image preprocessor differently and return coordinates in the wrong space.
  2. Pass --mm-processor-kwargs '{"min_pixels": 1024, "max_pixels": 99999999}'. Without it, vLLM downsamples the 4K–6K screenshots to ~1280-wide and the tiny widget targets become invisible to the model.
Both are environmental, not model issues, but they're load-bearing for reproducing the published number.
Setup

Install

Create a fresh conda environment and install the pinned stack.

# fresh env
conda create -n vllm011 python=3.11 -y
conda activate vllm011

# pinned vLLM — do not upgrade
python -m uv pip install vllm==0.17.0

# client deps
python -m uv pip install transformers==4.57.6 pillow requests

GPU: any CUDA 12.x card with ≥10 GB VRAM. Tested on RTX PRO 6000 Blackwell; works on H100 / A100 / RTX 4090 unchanged.

Setup

Serve

CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server \
  --model Datawall/brend-2b-260602 \
  --served-model-name brend-2b \
  --port 8003 \
  --gpu-memory-utilization 0.4 \
  --max-model-len 16384 \
  --max-num-seqs 32 \
  --limit-mm-per-prompt '{"image": 1}' \
  --mm-processor-kwargs '{"min_pixels": 1024, "max_pixels": 99999999}'

A 2B BF16 model fits in ~5 GB; the rest of the 0.4 budget is KV cache for batched serving. Bump --gpu-memory-utilization and --max-num-seqs if you have headroom.

Usage

Use it

The server is OpenAI-compatible. Send a screenshot and an instruction; the model replies with a computer_use tool call containing a coordinate.

import base64, re
from io import BytesIO
from PIL import Image
import requests

VLLM_URL = "http://localhost:8003/v1/chat/completions"
MODEL    = "brend-2b"

# Full SYSTEM_PROMPT (computer_use tool spec) is on the model card.

def ground(image_path, instruction):
    img = Image.open(image_path).convert("RGB")
    buf = BytesIO(); img.save(buf, format="PNG")
    data_url = "data:image/png;base64," + base64.b64encode(buf.getvalue()).decode()
    payload = {
        "model": MODEL, "temperature": 0.0, "max_tokens": 64,
        "messages": [
            {"role": "system", "content": [{"type": "text", "text": SYSTEM_PROMPT}]},
            {"role": "user", "content": [
                {"type": "image_url", "image_url": {"url": data_url}},
                {"type": "text", "text": instruction},
            ]},
        ],
    }
    r = requests.post(VLLM_URL, json=payload, timeout=60); r.raise_for_status()
    text = r.json()["choices"][0]["message"]["content"]

    # coordinates come back in [0, 1000] relative space
    m = re.search(r'"coordinate"\s*:\s*\[\s*(-?\d+\.?\d*)\s*,\s*(-?\d+\.?\d*)\s*\]', text)
    if not m: return None
    x, y = float(m.group(1)) / 1000.0, float(m.group(2)) / 1000.0
    return (x * img.width, y * img.height)

print(ground("screenshot.png", "the save button in the top toolbar"))
Usage

Coordinate convention

The model emits (x, y) in [0, 1000] relative space — the computer_use tool prompt declares a fake 1000×1000 screen, and Qwen3-VL is trained to honor that. Divide by 1000 to get normalized [0, 1] coordinates, then multiply by the original image's width and height to get pixels.

Do not pre-resize the image client-side. vLLM's preprocessor handles smart-resize internally given the mm-processor-kwargs flags above; client-side resizing throws off the model.
Evaluation

Eval breakdown

ScreenSpot-Pro, full test set, single-pass inference, by software category.

SectionAvgTextIcon
Development48.4970.1325.52
Creative45.4561.6223.08
CAD32.9538.0717.19
Scientific49.2165.2828.18
Office70.0080.2335.85
Operating Systems47.4562.6229.21
Overall48.3962.2325.99

Text grounding is meaningfully stronger than icon grounding across every category — typical for 2B-class grounders. The 48.39% micro-average and the 48.64% model-index figure differ by a known Creative-group accounting discrepancy in the eval harness.

Comparison to other 2B models

ModelInferenceAvg
MAI-UI-2BZoom in62.81
UI-Venus-1-5-2BSingle-pass57.75
brend-2b-260602Single-pass48.64
Qwen3-VL-2B-Instruct (base)Single-pass43.26

MAI-UI uses inference-time crop/re-query and isn't apples-to-apples. UI-Venus-2B is the legitimate single-pass 2B comparison.

Method

Training details

  • Base model: Qwen/Qwen3-VL-2B-Instruct
  • Method: GRPO with click-in-bbox reward, scored in the [0, 1000] relative space the model natively emits
  • Hardware: 1× NVIDIA RTX PRO 6000 Blackwell (96 GB GDDR7)
  • Precision: BF16, single GPU, sdpa attention
  • Effective batch size: 64 (per-device 2 × grad-accum 32), 2 completions per prompt
  • Max completion length: 32 tokens
  • Wall clock: ~17 hours for 2 epochs (~1,875 steps)
  • Checkpoint published: step 1350 (peak)

Eval harness: ScreenSpot-Pro test set, all 1,581 English instruction samples, single-pass with no zoom-in, agentic loop, refiner, or consistency router, using the official Qwen-team prompt via vLLM.

Reference

Citation

@misc{chen2026brend2b260602,
  title  = {brend-2b-260602: GRPO fine-tune of Qwen3-VL-2B for GUI grounding},
  author = {Kenneth Chen, Sheldon Zhu, Jiabao Zhang},
  year   = {2026},
  howpublished = {https://huggingface.co/Datawall/brend-2b-260602},
}

Licensed Apache-2.0, inheriting the base model's license. Training data and eval benchmark are subject to their own upstream licenses.

Run it. Fine-tune it.
The weights are open.

Modelbrend-2b-260602
Hubhuggingface.co/Datawall