Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch all exceptions; save full responses and errors #1

Merged
merged 1 commit into from
Jan 15, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 29 additions & 11 deletions crawl.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
#!/usr/bin/env python3

import json
from json.decoder import JSONDecodeError
import requests
import csv
import argparse
import signal
import sys
import traceback
import os

parser = argparse.ArgumentParser(description='Make requests to Cookiemonster API')
Expand Down Expand Up @@ -57,15 +60,11 @@ def post_request(url, location):

try:
response = requests.post(endpoint, json=payload, headers=headers, timeout=120)
status_code = response.status_code
response_json = response.json()
identified = response_json.get("identified", False)
error = response_json.get("error")
return url, status_code, identified, error
return response.status_code, response.text
except requests.Timeout:
return url, None, False, "Request timed out"
return None, "Request error: timeout"
except requests.RequestException as e:
return url, None, False, f"Request error: {str(e)}"
return None, f"Request error: {str(e)}"

# Read input CSV with list of domains
# Read line by line, skipping rows if necessary
Expand All @@ -91,11 +90,24 @@ def read_csv_make_requests(skip):

# Write to both stdout and output file
def crawl_url(url, location):
url, status_code, identified, error = post_request(url, location)
region = "San Francisco" if location == "" else "Europe"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what gets printed now if no location is specified?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just prints the proxy URL, so for the direct case it's empty string.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we'll be putting this in EC2, I think we should print out what the vantage point is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I figured the primary interface for analyzing the results will be by running scripts on the output file anyways - the console logging is just a convenience thing)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just quickly show what the output looks like for a sample website?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just ran it for a bit; log output:

failed for https://www.nehnutelnosti.sk (bg.stealthtunnel.net): Navigation timeout of 30000 ms exceeded
failed for https://spserv.microsoft.com (): net::ERR_CERT_COMMON_NAME_INVALID at https://spserv.microsoft.com
failed for https://spserv.microsoft.com (bg.stealthtunnel.net): net::ERR_CERT_COMMON_NAME_INVALID at https://spserv.microsoft.com
failed for https://informaticacloud.com (): net::ERR_NAME_NOT_RESOLVED at https://informaticacloud.com
failed for https://informaticacloud.com (bg.stealthtunnel.net): net::ERR_TUNNEL_CONNECTION_FAILED at https://informaticacloud.com
failed for https://ninebot.com (): Navigation timeout of 30000 ms exceeded
failed for https://ninebot.com (bg.stealthtunnel.net): Navigation timeout of 30000 ms exceeded
failed for https://first-ns.de (): net::ERR_SSL_PROTOCOL_ERROR at https://first-ns.de
failed for https://first-ns.de (bg.stealthtunnel.net): net::ERR_SSL_PROTOCOL_ERROR at https://first-ns.de
failed for https://cdnhwc8.com (): net::ERR_NAME_NOT_RESOLVED at https://cdnhwc8.com
failed for https://cdnhwc8.com (bg.stealthtunnel.net): net::ERR_TUNNEL_CONNECTION_FAILED at https://cdnhwc8.com
failed for https://audiencenet.ru (): net::ERR_NAME_NOT_RESOLVED at https://audiencenet.ru
failed for https://audiencenet.ru (bg.stealthtunnel.net): net::ERR_TUNNEL_CONNECTION_FAILED at https://audiencenet.ru
identified for https://www.viabcp.com ()!
identified for https://www.viabcp.com (bg.stealthtunnel.net)!

that only shows failures and identifications, which was the most relevant stuff for me to see when checking on the status of a crawl.

The full results from the file look more like this (truncated for brevity, since there are a lot more uninteresting crawl results that did not get logged to the console above):

[200, "", "{\"url\":\"https://www.tanishq.co.in\",\"timestamp\":1736886632362,\"scriptSources\":[\"www.tanishq.co.in\",\"code.jquery.com\",\"ajax.googleapis.com\",\"accounts.tatadigital.com\",\"cdn-api.syteapi.com\",\"asset.fwcdn3.com\",\"cdn.cquotient.com\",\"cdn.syteapi.com\",\"fireworkapi1.com\",\"cdn.mirrar.com\",\"e.cquotient.com\"],\"classifiersUsed\":[\"llm\"],\"scrollBlocked\":false}"]
[200, "bg.stealthtunnel.net", "{\"url\":\"https://www.tanishq.co.in\",\"timestamp\":1736886640770,\"scriptSources\":[\"www.tanishq.co.in\"],\"classifiersUsed\":[],\"scrollBlocked\":false}"]
[200, "", "{\"url\":\"https://audiencenet.ru\",\"timestamp\":1736886647729,\"scriptSources\":[],\"error\":\"net::ERR_NAME_NOT_RESOLVED at https://audiencenet.ru\"}"]
[200, "bg.stealthtunnel.net", "{\"url\":\"https://audiencenet.ru\",\"timestamp\":1736886649361,\"scriptSources\":[],\"error\":\"net::ERR_TUNNEL_CONNECTION_FAILED at https://audiencenet.ru\"}"]
[200, "", "{\"url\":\"https://www.novibet.gr\",\"timestamp\":1736886652017,\"scriptSources\":[\"www.novibet.gr\"],\"classifiersUsed\":[\"llm\"],\"scrollBlocked\":false}"]
[200, "bg.stealthtunnel.net", "{\"url\":\"https://www.novibet.gr\",\"timestamp\":1736886658173,\"scriptSources\":[\"www.novibet.gr\"],\"classifiersUsed\":[\"llm\"],\"scrollBlocked\":false}"]
[200, "", "{\"url\":\"https://www.viabcp.com\",\"timestamp\":1736886666177,\"scriptSources\":[\"www.viabcp.com\",\"assets.adobedtm.com\",\"unruffled-shannon-1a7413.netlify.app\",\"apis.google.com\",\"www.google.com\",\"www.gstatic.com\",\"bcpr42sh.staticmon.com\"],\"identified\":true,\"markup\":\"<div class=\\\"bcp_contenedor_aviso container\\\">\\n  <div class=\\\"bcp_grupo_texto\\\">\\n    <div data-translate=\\\"true\\\" class=\\\"bcp_titulo\\\" tabindex=\\\"0\\\" id=\\\"dialogTitleModalConsentimiento\\\">Pol\u00edtica de Cookies</div>\\n    <div class=\\\"bcp_mensaje\\\" tabindex=\\\"0\\\" id=\\\"dialogDescriptionModalConsentimiento\\\">\\n      <span data-translate=\\\"true\\\">Esta web utiliza cookies necesarias y, con tu consentimiento, utilizaremos cookies de personalizaci\u00f3n y marketing.</span>\\n      <span data-translate=\\\"true\\\">Para m\u00e1s informaci\u00f3n revisa nuestra </span><a data-translate=\\\"true\\\" href=\\\"/transparencia/#protecciondedatos\\\" rel=\\\"noopener noreferrer\\\" target=\\\"_blank\\\" title=\\\"\\\">Pol\u00edtica de Privacidad y Pol\u00edtica de Cookies.</a>\\n    </div>\\n  </div>\\n  <div class=\\\"bcp_grupo_botones\\\">\\n    <button class=\\\"bcp_btn_configurar bcp_boton_blanco\\\" data-translate=\\\"true\\\">Configuraci\u00f3n\\n</button>\\n    <button class=\\\"bcp_btn_aceptar bcp_boton_naranja\\\" data-translate=\\\"true\\\">Aceptar todo</button>\\n  </div>\\n</div>\",\"classifiersUsed\":[\"llm\"],\"scrollBlocked\":false}"]
[200, "bg.stealthtunnel.net", "{\"url\":\"https://www.viabcp.com\",\"timestamp\":1736886673032,\"scriptSources\":[\"www.viabcp.com\",\"assets.adobedtm.com\",\"unruffled-shannon-1a7413.netlify.app\",\"apis.google.com\",\"bcpr42sh.staticmon.com\",\"www.google.com\",\"www.gstatic.com\"],\"identified\":true,\"markup\":\"<div class=\\\"bcp_contenedor_aviso container\\\">\\n  <div class=\\\"bcp_grupo_texto\\\">\\n    <div data-translate=\\\"true\\\" class=\\\"bcp_titulo\\\" tabindex=\\\"0\\\" id=\\\"dialogTitleModalConsentimiento\\\">Pol\u00edtica de Cookies</div>\\n    <div class=\\\"bcp_mensaje\\\" tabindex=\\\"0\\\" id=\\\"dialogDescriptionModalConsentimiento\\\">\\n      <span data-translate=\\\"true\\\">Esta web utiliza cookies necesarias y, con tu consentimiento, utilizaremos cookies de personalizaci\u00f3n y marketing.</span>\\n      <span data-translate=\\\"true\\\">Para m\u00e1s informaci\u00f3n revisa nuestra </span><a data-translate=\\\"true\\\" href=\\\"/transparencia/#protecciondedatos\\\" rel=\\\"noopener noreferrer\\\" target=\\\"_blank\\\" title=\\\"\\\">Pol\u00edtica de Privacidad y Pol\u00edtica de Cookies.</a>\\n    </div>\\n  </div>\\n  <div class=\\\"bcp_grupo_botones\\\">\\n    <button class=\\\"bcp_btn_configurar bcp_boton_blanco\\\" data-translate=\\\"true\\\">Configuraci\u00f3n\\n</button>\\n    <button class=\\\"bcp_btn_aceptar bcp_boton_naranja\\\" data-translate=\\\"true\\\">Aceptar todo</button>\\n  </div>\\n</div>\",\"classifiersUsed\":[\"llm\"],\"scrollBlocked\":false}"]

Each line contains the full response from cookiemonster, so no need to worry about figuring out what data needs to be captured, it can all be handled at analysis time

log_entry = f"URL: {url}, Identified: {identified}, Status: {status_code}, Location: {region}, Error: {error}\n"
print(log_entry, end='', flush=True) # Print to console
status_code, response_body = post_request(url, location)
if status_code is None:
print(f"failed for {url} ({location}): {response_body}")
else:
try:
response_json = json.loads(response_body)
error = response_json.get("error")
identified = response_json.get("identified", False)
except JSONDecodeError as e:
error = e
finally:
if error is not None:
print(f"failed for {url} ({location}): {error}")
elif identified:
print(f"identified for {url} ({location})!")
log_entry = json.dumps([status_code, location, response_body])
output_file.write(log_entry) # Write to output file
output_file.write('\n')

if __name__ == "__main__":
# Check if the input file exists
Expand All @@ -106,6 +118,12 @@ def crawl_url(url, location):
output_file = open(args.output, 'a')
try:
read_csv_make_requests(args.skip)
except Exception as e:
print(e)
tb = traceback.format_exc()
else:
tb = "No error"
finally:
print(tb)
cleanup() # Ensure cleanup is always called

Loading