Skip to content

Latest commit

 

History

History
97 lines (72 loc) · 3.37 KB

README.md

File metadata and controls

97 lines (72 loc) · 3.37 KB

LLProxy - Large Language Proxy

LLProxy

Release Notes GitHub Repo stars Open Issues GitHub Go version License: Apache 2.0 Twitter Discord

Summary

LLProxy was designed for the task of effectively managing rate limits and scheduling of workload across multiple different LLM based applications. The rate limits for these services are complex, beyond what can easily be configured with the simplest of reverse proxies. LLProxy addresses this by creating a scheduler that deeply understandings the core LLM providers rate limiting behavior.

Features

  • The following providers are currently supported: [openai]
  • The following scheduling is currently supported: [FIFO]

Usage

  1. Setup your configuration file:

    cp config-example.json config.json

    Each provider can be defined as a specific route.

    config.json

    {
        "routes": {
            "openai": {
                "forward": "https://api.openai.com",
                "provider": "openai",
                "models": {
                    "gpt-4": {
                        "maxQueueSize": 10,
                        "maxQueueWait": 30,
                        "rpm": 200,
                        "tpm": 40000
                    },
                    ...
                }
            }
            ...
        }
    }
    

    The above creates a route http://proxyhost:8080/openai/... that routes all traffic sent to that route to https://api.openai.com/...

    It further defines a scheduler for the gpt-4 model that sets:

    • maxQueueSize defines how many requests are allowed to sit in the queue prior to being scheduled
    • maxQueueWait defines how long, in seconds, it will allow a request to wait before it starts rejecting additional requests with RateLimit errors.
    • rpm the maximum requests per minute
    • tpm the maximum tokens per minute

    Requests and tokens per minute are consumed as requests come in and recover over time. If a request cannot be immediately processed then it will sit in the queue for up to maxQueueWait seconds, and up to maxQueueSize items can be outstanding in the queue.

    Set a config for every model you want to support.

  2. [Optional] Run tests

    ./test.sh
  3. [Optional] Look at code coverage

    go tool cover -html=coverage.out -o coverage.html
    
  4. Build the application

    ./build.sh
  5. Run the application

    ./llproxy
  6. Direct traffic to your proxy server

    import openai
    openai.api_base = 'http://<your-proxy-address>:8080/openai/v1'
    ...