WebVoyager evaluation for Browser Use

This repository is a fork of original repo

Evaluation runs

The file structure is the same as the original repo. The only difference is that the run_browser_use.py is modified to add the browser use evaluation, we also changed the prompts to be suitable for the browser use evaluation (VERY minimal changes - evaluate multiple images, not just one) and switched to langchain.

We also have a list of impossible tasks that are not possible anymore (completely outdated, can't be fixed with dates).

We changed some tasks that included dates to be more in the future instead of the past since the data is outdated which would make the task impossible (e.g. "Please find me a hotel on 2023-12-01 on booking.com", which is impossible since you can't search for a hotel in the past).

We ran the evaluation on 16gb of RAM with 15 concurrent tasks.

requirements.txt is missing browser-use on purpose since we install it by building the package locally.

Manual correction of evaluations

The eval model is not good. That's why we added another success criteria - unknown if the eval model is not sure.

Most of the tasks are indeed correct, but some tasks had wrong assesment, and unknown either went into success or failed.

We manually reviewed the evaluations for the tasks that are either "unknown" or "failed" and corrected them. This is due to the fact that the default WebVoyager evaluator is not good.

Costs

The whole cost of running the dataset once is around 250 USD for gpt4o.

Interesting findings

WebVoyager is a terrible dataset.
- a lot of tasks are straight up impossible (outdated usually)
- many prompts are VERY ambiguous and can be interpreted in many ways - which confuses the model (both agent and eval model)
why do we have to rely on third party websites? That is not scalable.
the dataset does not test for actual understanding of website, mostly of planning and reasoning - NOT what you want from web agent evaluations on COMPLEX sites

Todos

make manually labeled items more transparent
add proxies
test all kinds of models and different setups (Claude, GPT-4o, Llama 3, etc.)
test different setups (single vs multiple images, single vs multiple tasks, etc.)

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.vscode		.vscode
assets		assets
data		data
downloads		downloads
evaluation		evaluation
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analysis.ipynb		analysis.ipynb
requirements.txt		requirements.txt
run.sh		run.sh
run_browser_use.py		run_browser_use.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WebVoyager evaluation for Browser Use

Evaluation runs

Manual correction of evaluations

Costs

Interesting findings

Todos

About

Releases

Packages

Languages

License

browser-use/eval

Folders and files

Latest commit

History

Repository files navigation

WebVoyager evaluation for Browser Use

Evaluation runs

Manual correction of evaluations

Costs

Interesting findings

Todos

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages