Vision-Based Agent

Important

This repository is meant to be used for educational purposes only!

This repository contains a demo of an automatic LLM/VLM-based agent which visits Google News (specifically, the Technology > AI page), scrolls through it and look for interesting articles. It then clicks on the links it selected, and extracts the full article as plain text from the opened page. It then returns to Google News, scrolls down and continues this routine.

The agent uses only small local models for this:

Quantized llama-3.2-vision (11b, via Ollama)
Quantized llama-3.1 (3b, via Ollama)
Florence-2-base

Installation

Install Ollama and pull the required models:

ollama pull llama-3.1
ollama pull llama-3.2-vision

Create a new virtual environment (recommended) and clone this repo to it.
From the root repo of this repo, install using:

pip install -e .
playwright install

Running

Run the agent using the vba command:

usage: vba [-h] [-s SCROLLS] [-o OUTPUT_FILE] [--debug]

options:
  -h, --help            show this help message and exit
  -s SCROLLS, --scrolls SCROLLS
                        Number of mouse-scrolls to perform (non-negative integer, default = 1)
  -o OUTPUT_FILE, --output-file OUTPUT_FILE
                        Output filename (defaults to `output_[run-time].json`)
  --debug               Turn on debug mode

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Vision-Based Agent

Installation

Running

Files

README.md

Latest commit

History

README.md

File metadata and controls

Vision-Based Agent

Installation

Running