Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor auto-archiver to use a modular structure for feeders/extractors/enrichers etc. #185

Open
wants to merge 83 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
83 commits
Select commit Hold shift + click to select a range
c41d93a
Use already implemented helper to get version
pjrobertson Jan 21, 2025
bdfc855
Ignore pylint statements for manifest files
pjrobertson Jan 21, 2025
03f3770
Add __manifest__.py for generic_extractor
pjrobertson Jan 21, 2025
241b350
Initial changes to move to '__manifest__' format
pjrobertson Jan 21, 2025
4830f99
Get parsing of manifest and combining with config file working
pjrobertson Jan 21, 2025
7b3a146
Create manifest files for archiver modules.
erinhmclark Jan 21, 2025
54995ad
Further tweaks based on __manifest__.py files
pjrobertson Jan 22, 2025
b6b0858
Switch back to using yaml with dot notation
pjrobertson Jan 22, 2025
ade5ea0
Tidy up imports + start on loading modules - program now starts much …
pjrobertson Jan 22, 2025
99c8c69
Manifests for databases
erinhmclark Jan 22, 2025
c517d35
Merge branch 'load_modules' into more_mainifests
erinhmclark Jan 22, 2025
550097a
Get module loading working properly
pjrobertson Jan 22, 2025
65ef46d
Fix loading already loaded modules - don't load them twice
pjrobertson Jan 22, 2025
79684f8
Set up feeder manifests (not merged by source yet)
erinhmclark Jan 23, 2025
9db26cd
Merge branch 'load_modules' into more_mainifests
erinhmclark Jan 23, 2025
1274a1b
More manifests, base modules and rename from archiver to extractor.
erinhmclark Jan 23, 2025
c3403ce
Rename storages for clarity
erinhmclark Jan 23, 2025
50f4ebc
Move storage configs into individual manifests, assert format on useage.
erinhmclark Jan 23, 2025
b27bf8f
Fix up loading/storing configs + unit tests
pjrobertson Jan 23, 2025
06f6e34
Revert changes to orchestrator to avoid merge conflicts
pjrobertson Jan 23, 2025
9befb97
Fix loading modules when entry_point isn't set
pjrobertson Jan 23, 2025
cbafbfa
Revert Dockerfile changes
erinhmclark Jan 24, 2025
ba4b330
Merge remote-tracking branch 'origin/more_mainifests' into more_maini…
erinhmclark Jan 24, 2025
aa7ca93
Update manifests and modules
erinhmclark Jan 24, 2025
0453d95
fix config parsing in manifests
erinhmclark Jan 24, 2025
024fe58
fix config parsing in manifests, remove module level configs
erinhmclark Jan 24, 2025
1942e8b
Gsheets utility revert
erinhmclark Jan 24, 2025
f1e9ab6
Merge branch 'main' into load_modules
pjrobertson Jan 24, 2025
3fc6ddf
Tweaks to logging strings
pjrobertson Jan 24, 2025
dd402b4
Fix and add types to manifest
erinhmclark Jan 24, 2025
96b35a2
Rm gsheet references in utils
erinhmclark Jan 24, 2025
21a7ff0
Fix types in manifests
erinhmclark Jan 27, 2025
ebebd27
Fix archiver to extractor naming
erinhmclark Jan 27, 2025
0b03f54
Fix up config validation, and allow for custom 'validators'
pjrobertson Jan 27, 2025
14e2479
Merge branch 'more_mainifests' into load_modules
pjrobertson Jan 27, 2025
7fd9586
Further fixes/changes to loading 'types' for config + manifest edits
pjrobertson Jan 27, 2025
f68e272
Refactor loader + step into module, use LazyBaseModule and BaseModule
pjrobertson Jan 27, 2025
e307401
Fix loading/saving to orchestration file with comments
pjrobertson Jan 27, 2025
e1a9373
Refactoring for new config setup
erinhmclark Jan 27, 2025
6c67eff
remove name reference in local_storage.py
erinhmclark Jan 27, 2025
57b3bec
Google sheets feeder and database implemented.
erinhmclark Jan 27, 2025
1d2a1d4
Allow framework for config settings that should not be stored in conf…
pjrobertson Jan 28, 2025
27b25c5
Validate orchestration.yaml file inputs - so if a user enters invalid…
pjrobertson Jan 28, 2025
9635449
more user friendly error logging when config issues are found
pjrobertson Jan 28, 2025
7a4871d
Fix up unit tests for new structure
pjrobertson Jan 28, 2025
dcd5576
set metadata enricher to requires_setup=True (requires exiftool which…
pjrobertson Jan 28, 2025
3d37c49
Tidy ups + unit tests:
pjrobertson Jan 29, 2025
00a7018
Fix up dependency checking (use 'dependencies' instead of 'external_d…
pjrobertson Jan 29, 2025
18ff36c
Add ruamel to dependencies (replaces pyyaml)
pjrobertson Jan 29, 2025
cddae65
Update modules for new core structure.
erinhmclark Jan 30, 2025
b7d9145
Further tidyups + refactoring for new structure
pjrobertson Jan 30, 2025
fade68c
Fix up unit tests - dataclass + subclasses not having @dataclass was …
pjrobertson Jan 30, 2025
5274388
Fix manifests for required configs.
erinhmclark Jan 30, 2025
953011f
Don't make modules 'dataclasses'
pjrobertson Jan 30, 2025
d6b4b7a
Further cleanup
pjrobertson Jan 30, 2025
d76063c
Fix unit tests
pjrobertson Jan 30, 2025
c25d5ca
Remove ArchivingContext completely
pjrobertson Jan 30, 2025
9a8c94b
Fix getting/setting folder context for metadata
pjrobertson Feb 3, 2025
9c9e9b3
Remove lingering reference to ArchivingContext
pjrobertson Feb 3, 2025
7a2be5a
Add cookie extraction to 'authentication' options, get generic_extrac…
pjrobertson Feb 3, 2025
7ec328a
Remove cookie options from generic_extractor - it now uses 'authentic…
pjrobertson Feb 3, 2025
c574b69
Set up screenshot enricher to use authentication/cookies
pjrobertson Feb 3, 2025
72b5ea9
Restore headless arg
pjrobertson Feb 3, 2025
a873e56
Remove old csv_feeder file - now inside a module
pjrobertson Feb 4, 2025
b301f60
Fix using validators set in __manifest__.py
pjrobertson Feb 4, 2025
78e6418
Unit tests for csv feeder + fix some bugs
pjrobertson Feb 4, 2025
034197a
Fix typos in csv feeder docs (in manifest)
pjrobertson Feb 4, 2025
0633e17
Close the facebook 'login' window if it's there - to allow for proper…
pjrobertson Feb 4, 2025
91ca325
Update yt-dlp to latest version + remove code no longer needed from b…
pjrobertson Feb 4, 2025
48abb5e
Remove dangling screenshot_enricher file. Moved to modules/screenshot…
pjrobertson Feb 4, 2025
6ab8fd2
Tidy up setting modules as Orchestrator attributes on startup.
pjrobertson Feb 5, 2025
a506f2a
Clarify that an extractor's method can also return False if no valid…
pjrobertson Feb 6, 2025
63aba6a
Fix sphinx-autoapi imports
pjrobertson Feb 7, 2025
1fad37f
Remove blank file
pjrobertson Feb 7, 2025
e9dd321
Fix setting cli_feeder as default feeder on clean install
pjrobertson Feb 10, 2025
74207d7
Implementation tests for auto-archiver
pjrobertson Feb 10, 2025
f3f6b92
Implementation test cleanup
pjrobertson Feb 10, 2025
7c84804
adds better info about wrong/missing modules
msramalho Feb 10, 2025
8fb3dc7
fixing telethon extractor to use default entrypoint
msramalho Feb 10, 2025
15abf68
decouples s3_storage from hash_enricher
msramalho Feb 10, 2025
ab6cf52
fixes bad hash initialization
msramalho Feb 10, 2025
12f14cc
fixes gsheet feeder<->db connection via context.
msramalho Feb 10, 2025
ed81dcd
Remove dangling 'b = ' from config.py
pjrobertson Feb 10, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .pylintrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[MAIN]

ignore-patterns=(.*tests.*.py, __manifest__.py)
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -218,7 +218,7 @@ configurations:
## Running on Google Sheets Feeder (gsheet_feeder)
The `--gsheet_feeder.sheet` property is the name of the Google Sheet to check for URLs.
This sheet must have been shared with the Google Service account used by `gspread`.
This sheet must also have specific columns (case-insensitive) in the `header` as specified in [Gsheet.configs](src/auto_archiver/utils/gsheet.py). The default names of these columns and their purpose is:
This sheet must also have specific columns (case-insensitive) in the `header` as specified in [gsheet_feeder.__manifest__.py](src/auto_archiver/modules/gsheet_feeder/__manifest__.py). The default names of these columns and their purpose is:

Inputs:

Expand Down
204 changes: 167 additions & 37 deletions poetry.lock

Large diffs are not rendered by default.

6 changes: 4 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,6 @@ dependencies = [
"pdqhash (>=0.0.0)",
"pillow (>=0.0.0)",
"python-slugify (>=0.0.0)",
"pyyaml (>=0.0.0)",
"dateparser (>=0.0.0)",
"python-twitter-v2 (>=0.0.0)",
"instaloader (>=0.0.0)",
Expand All @@ -47,7 +46,7 @@ dependencies = [
"cryptography (>=41.0.0,<42.0.0)",
"boto3 (>=1.28.0,<2.0.0)",
"dataclasses-json (>=0.0.0)",
"yt-dlp (==2025.1.12)",
"yt-dlp (>=2025.1.26,<2026.0.0)",
"numpy (==2.1.3)",
"vk-url-scraper (>=0.0.0)",
"requests[socks] (>=0.0.0)",
Expand All @@ -57,11 +56,14 @@ dependencies = [
"retrying (>=0.0.0)",
"tsp-client (>=0.0.0)",
"certvalidator (>=0.0.0)",
"rich-argparse (>=1.6.0,<2.0.0)",
"ruamel-yaml (>=0.18.10,<0.19.0)",
]

[tool.poetry.group.dev.dependencies]
pytest = "^8.3.4"
autopep8 = "^2.3.1"
pytest-loguru = "^0.4.0"

[tool.poetry.group.docs.dependencies]
sphinx = "^8.1.3"
Expand Down
43 changes: 23 additions & 20 deletions scripts/create_update_gdrive_oauth_token.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
# Code below from https://developers.google.com/drive/api/quickstart/python
# Example invocation: py scripts/create_update_gdrive_oauth_token.py -c secrets/credentials.json -t secrets/gd-token.json

SCOPES = ['https://www.googleapis.com/auth/drive']
SCOPES = ["https://www.googleapis.com/auth/drive.file"]


@click.command(
Expand All @@ -23,67 +23,70 @@
"-c",
type=click.Path(exists=True),
help="path to the credentials.json file downloaded from https://console.cloud.google.com/apis/credentials",
required=True
required=True,
)
@click.option(
"--token",
"-t",
type=click.Path(exists=False),
default="gd-token.json",
help="file where to place the OAuth token, defaults to gd-token.json which you must then move to where your orchestration file points to, defaults to gd-token.json",
required=True
required=True,
)
def main(credentials, token):
# The file token.json stores the user's access and refresh tokens, and is
# created automatically when the authorization flow completes for the first time.
creds = None
if os.path.exists(token):
with open(token, 'r') as stream:
with open(token, "r") as stream:
creds_json = json.load(stream)
# creds = Credentials.from_authorized_user_file(creds_json, SCOPES)
creds_json['refresh_token'] = creds_json.get("refresh_token", "")
creds_json["refresh_token"] = creds_json.get("refresh_token", "")
creds = Credentials.from_authorized_user_info(creds_json, SCOPES)

# If there are no (valid) credentials available, let the user log in.
if not creds or not creds.valid:
if creds and creds.expired and creds.refresh_token:
print('Requesting new token')
print("Requesting new token")
creds.refresh(Request())
else:
print('First run through so putting up login dialog')
print("First run through so putting up login dialog")
# credentials.json downloaded from https://console.cloud.google.com/apis/credentials
flow = InstalledAppFlow.from_client_secrets_file(credentials, SCOPES)
creds = flow.run_local_server(port=55192)
# Save the credentials for the next run
with open(token, 'w') as token:
print('Saving new token')
with open(token, "w") as token:
print("Saving new token")
token.write(creds.to_json())
else:
print('Token valid')
print("Token valid")

try:
service = build('drive', 'v3', credentials=creds)
service = build("drive", "v3", credentials=creds)

# About the user
results = service.about().get(fields="*").execute()
emailAddress = results['user']['emailAddress']
emailAddress = results["user"]["emailAddress"]
print(emailAddress)

# Call the Drive v3 API and return some files
results = service.files().list(
pageSize=10, fields="nextPageToken, files(id, name)").execute()
items = results.get('files', [])
results = (
service.files()
.list(pageSize=10, fields="nextPageToken, files(id, name)")
.execute()
)
items = results.get("files", [])

if not items:
print('No files found.')
print("No files found.")
return
print('Files:')
print("Files:")
for item in items:
print(u'{0} ({1})'.format(item['name'], item['id']))
print("{0} ({1})".format(item["name"], item["id"]))

except HttpError as error:
print(f'An error occurred: {error}')
print(f"An error occurred: {error}")


if __name__ == '__main__':
if __name__ == "__main__":
main()
29 changes: 29 additions & 0 deletions scripts/telegram_setup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
"""
This script is used to create a new session file for the Telegram client.
To do this you must first create a Telegram application at https://my.telegram.org/apps
And store your id and hash in the environment variables TELEGRAM_API_ID and TELEGRAM_API_HASH.
Create a .env file, or add the following to your environment :
```
export TELEGRAM_API_ID=[YOUR_ID_HERE]
export TELEGRAM_API_HASH=[YOUR_HASH_HERE]
```
Then run this script to create a new session file.

You will need to provide your phone number and a 2FA code the first time you run this script.
"""


import os
from telethon.sync import TelegramClient
from loguru import logger


# Create a
API_ID = os.getenv("TELEGRAM_API_ID")
API_HASH = os.getenv("TELEGRAM_API_HASH")
SESSION_FILE = "secrets/anon-insta"

os.makedirs("secrets", exist_ok=True)
with TelegramClient(SESSION_FILE, API_ID, API_HASH) as client:
logger.success(f"New session file created: {SESSION_FILE}.session")

7 changes: 0 additions & 7 deletions src/auto_archiver/__init__.py

This file was deleted.

10 changes: 3 additions & 7 deletions src/auto_archiver/__main__.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,9 @@
""" Entry point for the auto_archiver package. """
from . import Config
from . import ArchivingOrchestrator
from auto_archiver.core.orchestrator import ArchivingOrchestrator
import sys

def main():
config = Config()
config.parse()
orchestrator = ArchivingOrchestrator(config)
for r in orchestrator.feed(): pass

ArchivingOrchestrator().run(sys.argv[1:])

if __name__ == "__main__":
main()
16 changes: 0 additions & 16 deletions src/auto_archiver/archivers/__init__.py

This file was deleted.

1 change: 0 additions & 1 deletion src/auto_archiver/archivers/generic_archiver/__init__.py

This file was deleted.

2 changes: 0 additions & 2 deletions src/auto_archiver/archivers/youtubedl_archiver.py

This file was deleted.

12 changes: 9 additions & 3 deletions src/auto_archiver/core/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,15 @@
"""
from .metadata import Metadata
from .media import Media
from .step import Step
from .context import ArchivingContext
from .module import BaseModule

# cannot import ArchivingOrchestrator/Config to avoid circular dep
# from .orchestrator import ArchivingOrchestrator
# from .config import Config
# from .config import Config

from .database import Database
from .enricher import Enricher
from .feeder import Feeder
from .storage import Storage
from .extractor import Extractor
from .formatter import Formatter
142 changes: 142 additions & 0 deletions src/auto_archiver/core/base_module.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@

from urllib.parse import urlparse
from typing import Mapping, Any
from abc import ABC
from copy import deepcopy, copy
from tempfile import TemporaryDirectory
from auto_archiver.utils import url as UrlUtil

from loguru import logger

class BaseModule(ABC):

"""
Base module class. All modules should inherit from this class.

The exact methods a class implements will depend on the type of module it is,
however all modules have a .setup(config: dict) method to run any setup code
(e.g. logging in to a site, spinning up a browser etc.)

See BaseModule.MODULE_TYPES for the types of modules you can create, noting that
a subclass can be of multiple types. For example, a module that extracts data from
a website and stores it in a database would be both an 'extractor' and a 'database' module.

Each module is a python package, and should have a __manifest__.py file in the
same directory as the module file. The __manifest__.py specifies the module information
like name, author, version, dependencies etc. See BaseModule._DEFAULT_MANIFEST for the
default manifest structure.

"""

MODULE_TYPES = [
'feeder',
'extractor',
'enricher',
'database',
'storage',
'formatter'
]

_DEFAULT_MANIFEST = {
'name': '', # the display name of the module
'author': 'Bellingcat', # creator of the module, leave this as Bellingcat or set your own name!
'type': [], # the type of the module, can be one or more of BaseModule.MODULE_TYPES
'requires_setup': True, # whether or not this module requires additional setup such as setting API Keys or installing additional softare
'description': '', # a description of the module
'dependencies': {}, # external dependencies, e.g. python packages or binaries, in dictionary format
'entry_point': '', # the entry point for the module, in the format 'module_name::ClassName'. This can be left blank to use the default entry point of module_name::ModuleName
'version': '1.0', # the version of the module
'configs': {} # any configuration options this module has, these will be exposed to the user in the config file or via the command line
}

config: Mapping[str, Any]
authentication: Mapping[str, Mapping[str, str]]
name: str

# this is set by the orchestrator prior to archiving
tmp_dir: TemporaryDirectory = None

@property
def storages(self) -> list:
return self.config.get('storages', [])

def setup(self, config: dict):

authentication = config.get('authentication', {})
# extract out concatenated sites
for key, val in copy(authentication).items():
if "," in key:
for site in key.split(","):
authentication[site] = val
del authentication[key]

# this is important. Each instance is given its own deepcopied config, so modules cannot
# change values to affect other modules
config = deepcopy(config)
authentication = deepcopy(config.pop('authentication', {}))

self.authentication = authentication
self.config = config
for key, val in config.get(self.name, {}).items():
setattr(self, key, val)

def auth_for_site(self, site: str, extract_cookies=True) -> Mapping[str, Any]:
"""
Returns the authentication information for a given site. This is used to authenticate
with a site before extracting data. The site should be the domain of the site, e.g. 'twitter.com'

extract_cookies: bool - whether or not to extract cookies from the given browser and return the
cookie jar (disabling can speed up) processing if you don't actually need the cookies jar

Currently, the dict can have keys of the following types:
- username: str - the username to use for login
- password: str - the password to use for login
- api_key: str - the API key to use for login
- api_secret: str - the API secret to use for login
- cookie: str - a cookie string to use for login (specific to this site)
- cookies_jar: YoutubeDLCookieJar | http.cookiejar.MozillaCookieJar - a cookie jar compatible with requests (e.g. `requests.get(cookies=cookie_jar)`)
"""
# TODO: think about if/how we can deal with sites that have multiple domains (main one is x.com/twitter.com)
# for now the user must enter them both, like "x.com,twitter.com" in their config. Maybe we just hard-code?

site = UrlUtil.domain_for_url(site)
# add the 'www' version of the site to the list of sites to check
authdict = {}


for to_try in [site, f"www.{site}"]:
if to_try in self.authentication:
authdict.update(self.authentication[to_try])
break

# do a fuzzy string match just to print a warning - don't use it since it's insecure
if not authdict:
for key in self.authentication.keys():
if key in site or site in key:
logger.debug(f"Could not find exact authentication information for site '{site}'. \
did find information for '{key}' which is close, is this what you meant? \
If so, edit your authentication settings to make sure it exactly matches.")

def get_ytdlp_cookiejar(args):
import yt_dlp
from yt_dlp import parse_options
logger.debug(f"Extracting cookies from settings: {args[1]}")
# parse_options returns a named tuple as follows, we only need the ydl_options part
# collections.namedtuple('ParsedOptions', ('parser', 'options', 'urls', 'ydl_opts'))
ytdlp_opts = getattr(parse_options(args), 'ydl_opts')
return yt_dlp.YoutubeDL(ytdlp_opts).cookiejar

# get the cookies jar, prefer the browser cookies than the file
if 'cookies_from_browser' in self.authentication:
authdict['cookies_from_browser'] = self.authentication['cookies_from_browser']
if extract_cookies:
authdict['cookies_jar'] = get_ytdlp_cookiejar(['--cookies-from-browser', self.authentication['cookies_from_browser']])
elif 'cookies_file' in self.authentication:
authdict['cookies_file'] = self.authentication['cookies_file']
if extract_cookies:
authdict['cookies_jar'] = get_ytdlp_cookiejar(['--cookies', self.authentication['cookies_file']])

return authdict

def repr(self):
return f"Module<'{self.display_name}' (config: {self.config[self.name]})>"
Loading
Loading