Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification

Vision2Web Team

What is Vision2Web

Vision2Web is a benchmark for evaluating whether multimodal coding agents can build real websites from visual prototypes and structured requirements. It goes beyond small code edits and static UI generation to measure end-to-end web development ability in realistic settings.

Each task provides multimodal inputs such as UI prototype images, requirement descriptions, and development assets. Agents are expected to generate executable websites that satisfy both functional behavior and visual fidelity.

To support reliable evaluation, Vision2Web introduces an automated verification framework that combines workflow-driven GUI testing with VLM-based visual judging.

Why Vision2Web

Existing coding benchmarks mainly focus on localized code edits, while most multimodal website benchmarks are limited to static webpage reproduction. These settings do not fully capture the complexity of modern web development, where agents must reason over visual layouts, interaction flows, application state, and system behavior across multiple pages.

Vision2Web closes this gap by evaluating the full spectrum of visual website development, from responsive UI implementation to interactive frontend engineering and complete full-stack applications.

Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification

What is Vision2Web

Why Vision2Web

Benchmark Highlights

193

918

1,255

16

Run Inference

Fork the Leaderboard Repository

Organize Inference Outputs

Open a Pull Request

Evaluation & Results

License

Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification

What is Vision2Web

Why Vision2Web

Benchmark Highlights

193

918

1,255

16

Vision2Web Leaderboard

Vision2Web Submission Guide

Run Inference

Fork the Leaderboard Repository

Organize Inference Outputs

Open a Pull Request

Evaluation & Results

Vision2Web Citation

License