Logo  Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification

What is Vision2Web

Vision2Web is a benchmark for evaluating whether multimodal coding agents can build real websites from visual prototypes and structured requirements. It goes beyond small code edits and static UI generation to measure end-to-end web development ability in realistic settings.

Each task provides multimodal inputs such as UI prototype images, requirement descriptions, and development assets. Agents are expected to generate executable websites that satisfy both functional behavior and visual fidelity.

To support reliable evaluation, Vision2Web introduces an automated verification framework that combines workflow-driven GUI testing with VLM-based visual judging.

Vision2Web Cover

Why Vision2Web

Existing coding benchmarks mainly focus on localized code edits, while most multimodal website benchmarks are limited to static webpage reproduction. These settings do not fully capture the complexity of modern web development, where agents must reason over visual layouts, interaction flows, application state, and system behavior across multiple pages.

Vision2Web closes this gap by evaluating the full spectrum of visual website development, from responsive UI implementation to interactive frontend engineering and complete full-stack applications.

Benchmark Highlights

193

Tasks

918

Prototype Images

1,255

Test Cases

16

Categories

Vision2Web spans 16 subcategories across 4 major domains and covers progressively harder development settings, from static responsive webpages to interaction-heavy frontends and requirement-driven full-stack systems.

Season:
Period: VLM Judge: GUI Agent:
Last updated: Loading...
Reset Sort Level 1: Static Webpage Level 2: Interactive Level 3: Full-Stack
# Model + Framework Params Date Overall Desktop Tablet Mobile Avg VS FS Avg VS FS Avg
VS = Visual Score  |  FS = Functional Score

We welcome submissions to the Vision2Web leaderboard. Submit your inference outputs and our team will run the official evaluation.

1

Run Inference

Run your agent on Vision2Web benchmark tasks to generate website code. You can use the official evaluation pipeline locally to test your results before submitting.

git clone https://github.com/zai-org/Vision2Web.git
cd Vision2Web
pip install -e .
bash scripts/run_inference.sh
2

Fork the Leaderboard Repository

Fork the Vision2Web Leaderboard dataset repository on Hugging Face.

3

Organize Inference Outputs

Structure your submission using the required directory layout:

<agent>+<model>/
    submission.json
    <level>/              # webpage, frontend, or website
        <task-name>/      # generated code for each task

For example: OpenHands+GPT-4o/, ClaudeCode+Claude-3.5-Sonnet/

Required submission.json:

{
  "name": "model-name + agent-framework",
  "org": "organization-name",
  "date": "YYYY-MM-DD",
  "season": "S1-2026"
}

Do not include:

  • node_modules/, __pycache__/, .venv/, or any dependency directories
  • Lock files (package-lock.json, yarn.lock, pnpm-lock.yaml)
  • Build artifacts (dist/, .next/, build/)
  • Model weights or external datasets

Submissions containing excessive dependency files will be rejected.

4

Open a Pull Request

Submit a Pull Request to the leaderboard repository with your inference outputs.

5

Evaluation & Results

Our team will verify your submission format, run the official evaluation pipeline using the current season's VLM Judge and GUI Agent, and publish the results on the leaderboard.

Questions?

If you have any question about the submission process, please open an issue on our GitHub Issues page.

If you use Vision2Web in your research, please cite our paper:

@misc{he2026vision2webhierarchicalbenchmarkvisual,
      title={Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification},
      author={Zehai He and Wenyi Hong and Zhen Yang and Ziyang Pan and Mingdao Liu and Xiaotao Gu and Jie Tang},
      year={2026},
      eprint={2603.26648},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2603.26648},
}

License

Vision2Web is released under the CC BY-SA 4.0 license.