[Feature]: How to have >100 e.g. 10K or more seeds in a "list of pages". #2312

tuehlarsen · 2025-01-15T11:54:04Z

What change would you like to see?

We would like to have the possibility to increase the maxnumber of seeds or Urls in a list of pages. Today the maxlimit is hardwired to 100. see screen dump here:

Context

see above

ikreymer · 2025-01-17T21:07:13Z

Yes, this is something we'd like to support, including unlimited number of page URLs.
We can raise this limit a bit, but need to update how we store the large list on the backend.
Perhaps we'd add support for uploading a text file instead of entering URLs in the textbox here, and just store the file in the S3 bucket.

pirate · 2025-01-27T20:03:28Z

Yup +1 for this, I routinely have to run crawls with 10k+ URLs. My current workaround is to split the list of URLs into batches of crawls with 25 URLs each using the API. It works well for the actual crawling, but unfortunately viewing a collection with too many crawls in it seems to crash the frontend container and shows "no pages found" when trying to replay.

# expects a file urls.txt in the current working directory containing all URLs

import requests

# --------------------------------------------------------
# Configuration
# --------------------------------------------------------
BROWSERTIX_API_BASE = "http://browsertrix.example.com"
ORG_ID = "your-browsertrix-org-id"  # Replace with your actual org UUID
AUTH_TOKEN = "your-browsertrix-auth-token"  # Replace with your actual auth token
PROFILE_ID = "your-browsertrix-profile-id"  # Replace with your actual browser profile UUID
COLLECTION_ID = "your-browsertrix-collection-id"  # Replace with your actual collection UUID

# The endpoint to create a new crawl config is:
# POST /api/orgs/{oid}/crawlconfigs/
# --------------------------------------------------------


def chunker(seq, size):
    """
    Generator to yield successive chunks of a given list (seq) of the given size.
    """
    for pos in range(0, len(seq), size):
        yield seq[pos : pos + size]


def main():
    # 1) Read in all URLs from urls.txt (assumes exactly 10,000 lines)
    with open("urls.txt", "r", encoding="utf-8") as f:
        all_urls = [line.strip() for line in f if line.strip()]

    # 2) Split URLs into chunks of 25
    chunk_size = 25
    chunks = list(chunker(all_urls, chunk_size))

    # 3) For each chunk, create a new crawl config and set "runNow": True
    #    This will instruct Browsertrix to start the crawl immediately.
    headers = {
        "Authorization": f"Bearer {AUTH_TOKEN}",
        "Content-Type": "application/json",
    }

    for i, urls_subset in enumerate(chunks, start=1):
        if i == 1:
            continue
        # Prepare the seeds block
        seeds = [{"url": u} for u in urls_subset]

        # Body matches the CrawlConfigIn schema
        payload = {
            "name": f"BulkCrawl #{i}",
            "runNow": True,
            "jobType": "url-list",
            "profileid": PROFILE_ID,
            "tags": [ "labelstudio" ],
            "autoAddCollections": [ COLLECTION_ID ],
            "config": {
                "seeds": seeds,
                "scopeType": "page",
                "workers": 4,
                "postLoadDelay": 5,
                # You can set other config fields here if needed:
                # "blockAds": True,
                # "useSitemap": True,
                # etc.
            },
        }

        url = f"{BROWSERTIX_API_BASE}/api/orgs/{ORG_ID}/crawlconfigs/"
        resp = requests.post(url, json=payload, headers=headers)

        if resp.status_code == 200:
            data = resp.json()
            print(f"Created crawl config {i} successfully. ID={data.get('id')}")
        else:
            print(
                f"Error creating crawl config {i}. "
                f"HTTP {resp.status_code}. Response: {resp.text}"
            )


if __name__ == "__main__":
    main()

tuehlarsen added the enhancement New feature or request label Jan 15, 2025

github-project-automation bot added this to Webrecorder Projects Jan 15, 2025

github-project-automation bot moved this to Triage in Webrecorder Projects Jan 15, 2025

ikreymer mentioned this issue Jan 18, 2025

[Feature]: Support Bulk Seed List Uploads #2319

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: How to have >100 e.g. 10K or more seeds in a "list of pages". #2312

[Feature]: How to have >100 e.g. 10K or more seeds in a "list of pages". #2312

tuehlarsen commented Jan 15, 2025

ikreymer commented Jan 17, 2025

pirate commented Jan 27, 2025 •

edited

Loading

[Feature]: How to have >100 e.g. 10K or more seeds in a "list of pages". #2312

[Feature]: How to have >100 e.g. 10K or more seeds in a "list of pages". #2312

Comments

tuehlarsen commented Jan 15, 2025

What change would you like to see?

Context

ikreymer commented Jan 17, 2025

pirate commented Jan 27, 2025 • edited Loading

pirate commented Jan 27, 2025 •

edited

Loading