Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: How to have >100 e.g. 10K or more seeds in a "list of pages". #2312

Open
tuehlarsen opened this issue Jan 15, 2025 · 2 comments
Open
Labels
enhancement New feature or request

Comments

@tuehlarsen
Copy link

What change would you like to see?

We would like to have the possibility to increase the maxnumber of seeds or Urls in a list of pages. Today the maxlimit is hardwired to 100. see screen dump here:
image

Context

see above

@ikreymer
Copy link
Member

Yes, this is something we'd like to support, including unlimited number of page URLs.
We can raise this limit a bit, but need to update how we store the large list on the backend.
Perhaps we'd add support for uploading a text file instead of entering URLs in the textbox here, and just store the file in the S3 bucket.

@pirate
Copy link

pirate commented Jan 27, 2025

Yup +1 for this, I routinely have to run crawls with 10k+ URLs. My current workaround is to split the list of URLs into batches of crawls with 25 URLs each using the API. It works well for the actual crawling, but unfortunately viewing a collection with too many crawls in it seems to crash the frontend container and shows "no pages found" when trying to replay.

Image
# expects a file urls.txt in the current working directory containing all URLs

import requests

# --------------------------------------------------------
# Configuration
# --------------------------------------------------------
BROWSERTIX_API_BASE = "http://browsertrix.example.com"
ORG_ID = "your-browsertrix-org-id"  # Replace with your actual org UUID
AUTH_TOKEN = "your-browsertrix-auth-token"  # Replace with your actual auth token
PROFILE_ID = "your-browsertrix-profile-id"  # Replace with your actual browser profile UUID
COLLECTION_ID = "your-browsertrix-collection-id"  # Replace with your actual collection UUID

# The endpoint to create a new crawl config is:
# POST /api/orgs/{oid}/crawlconfigs/
# --------------------------------------------------------


def chunker(seq, size):
    """
    Generator to yield successive chunks of a given list (seq) of the given size.
    """
    for pos in range(0, len(seq), size):
        yield seq[pos : pos + size]


def main():
    # 1) Read in all URLs from urls.txt (assumes exactly 10,000 lines)
    with open("urls.txt", "r", encoding="utf-8") as f:
        all_urls = [line.strip() for line in f if line.strip()]

    # 2) Split URLs into chunks of 25
    chunk_size = 25
    chunks = list(chunker(all_urls, chunk_size))

    # 3) For each chunk, create a new crawl config and set "runNow": True
    #    This will instruct Browsertrix to start the crawl immediately.
    headers = {
        "Authorization": f"Bearer {AUTH_TOKEN}",
        "Content-Type": "application/json",
    }

    for i, urls_subset in enumerate(chunks, start=1):
        if i == 1:
            continue
        # Prepare the seeds block
        seeds = [{"url": u} for u in urls_subset]

        # Body matches the CrawlConfigIn schema
        payload = {
            "name": f"BulkCrawl #{i}",
            "runNow": True,
            "jobType": "url-list",
            "profileid": PROFILE_ID,
            "tags": [ "labelstudio" ],
            "autoAddCollections": [ COLLECTION_ID ],
            "config": {
                "seeds": seeds,
                "scopeType": "page",
                "workers": 4,
                "postLoadDelay": 5,
                # You can set other config fields here if needed:
                # "blockAds": True,
                # "useSitemap": True,
                # etc.
            },
        }

        url = f"{BROWSERTIX_API_BASE}/api/orgs/{ORG_ID}/crawlconfigs/"
        resp = requests.post(url, json=payload, headers=headers)

        if resp.status_code == 200:
            data = resp.json()
            print(f"Created crawl config {i} successfully. ID={data.get('id')}")
        else:
            print(
                f"Error creating crawl config {i}. "
                f"HTTP {resp.status_code}. Response: {resp.text}"
            )


if __name__ == "__main__":
    main()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Triage
Development

No branches or pull requests

3 participants