This project aims to ease the process of archiving files uploaded to Salesforce. Storage in Salesforce can be very expensive and often there is no real need to keep old files inside Salesforce. Using this project allows you to download files attached to objects in Salesforce and store them on disk. To be sure that download process was executed correctly there is also a way to validate all downloaded files on disk against checksums from Salesforce.
I was recently tasked with cleaning old files and archiving those in GCS bucket. I did not work with Salesforce
before, so I started to search for existing tools. I found a few but only
hardisgroupcom/sfdx-hardis was free and could be used in CLI
inside a VM. sfdx-hardis
has hardis:org:files:export
command that is responsible for file export. It is a good tool
but has few shortcomings that can appear quite fast after using it with existing SF org. Here are some of those:
- It is focused on object - This approach is problematic when one object has thousands of files attached to it and something breaks or some files were not downloaded properly. Re-run skips download if object directory exist on disk. You need to remove whole directory to trigger download again.
- Downloading big files sometimes breaks - I encountered some issues when downloading big files (>1G) and was forced
to manually download some files using
curl
. - No download validation - There seems to be no way to validate if all files were downloaded and if checksums match.
- It does not calculate API calls correctly - Before actual download starts you are presented with statistics about how many calls will be issued. The problem is that those calculations do not take into account actual download calls. You can hit API limit pretty quickly and lock all other apps using same REST API.
- Focused on files - Based on configuration it fetches a list of files to download and process it. In case of error re-run only downloads missing files. No problems when SF object has thousands of files attached.
- Reuse already downloaded files - Because same files can be linked to multiple objects, it will reuse already downloaded file instead of downloading it again for new objects
- Big files download - using python
simple-salesforce
library can handle downloading big files pretty well. - Validation command - When download is complete you can run validation command to check disk files against checksums from Salesforce. No additional API calls made during validation process.
- Mindful of Salesforce API limits - Will pause download when configured API usage limit is hit and resume automatically when API usage drops.
- Parallel downloads and validation - Can download and validate in parallel using threads.
Project is implemented in Python 3.11
and is using poetry
as package manager. Main libraries used:
simple-salesforce
- handling Salesforce APIclick
- working with CLIPyYaml
- config parsingschema
- config validationpytest
- testingmypy
- static type checksruff
- linting and code stylepoethepoet
- task automation
There are also some other smaller libraries used. You can check them inside pyproject.toml
.
Important
By default, all commands in production image are run by user with UID=1000
and GID=1000
.
Be sure to set proper permissions to mounted volumes before running archivist, otherwise it may not have
permissions to configuration (/archivist/config.yaml
) or data directory (/archivist/data/
) inside container.
You can use pre-made docker production image like this:
docker run --interactive --tty \
--volume /path/to/data/directory:/archivist/data \
--volume /path/to/your.config.yaml:/archivist/config.yaml \
ghcr.io/piotrekkr/salesforce-archivist:latest \
--help
Tip
Depending on the amount of files and download size, operations can take long time. You should probably use
some terminal multiplexer like tmux
or screen
when running this tool on production VM.
- Install
python
(version3.11
or greater) - Install
poetry
- Clone project
git clone [email protected]:piotrekkr/salesforce-archivist.git cd salesforce-archivist
- Install packages
poetry install
- Copy example configuration file and adjust to your needs
cp config.example.yaml config.yaml #... edit config.yaml
- Go to poetry shell and run archivist
poetry shell archivist --help
Important
When building image, Docker Compose will create archivist
user with UID=1000
and GID=1000
. By default,
all commands will be run as this user. If you have different UID/GID
, you should adjust those values by
setting ARCHIVIST_UID
and ARCHIVIST_GID
in environment and then rebuilding image.
You can build and run project using Docker Compose like this:
- Clone project
git clone [email protected]:piotrekkr/salesforce-archivist.git cd salesforce-archivist
- Build image and start container
docker compose up -d --build
- Copy example configuration file and adjust to your needs
cp config.example.yaml config.yaml #... edit config.yaml
- Run project
docker compose exec -it archivist archivist --help
You can also just go into container shell docker compose exec -it archivist bash
and just execute archivist
command.
Important
Dev container is using same base Docker Compose configuration. This means that when building image same UID and GID rules apply. Check "Docker Compose" section above for more details.
If your IDE can handle development containers (like VSCode do), upon opening project you should be prompted to rebuild and reopen project within dev container. After this is done archivist will be installed inside. Next steps:
- Open terminal inside dev container
- Copy example configuration file and adjust to your needs
cp config.example.yaml config.yaml #... edit config.yaml
- Run
archivist --help
Currently, this project can work with JWT authorization flow. You can follow this tutorial to configure private key, self-signed certificate and a connected app. Following first and second step should be enough to make it work.
Example configuration file contains comments explaining purpose of each configuration option. You can copy it and adjust to your own needs.
cp config.example.yaml config.yaml
Important
Before you can use private key (server.key) in config.yaml
you should encode it as base64
string.
Relation between Salesforce objects (entities) and files looks like this:
erDiagram
ContentDocumentLink {
reference ContentDocumentId
reference LinkedEntityId
}
ContentDocument {
string Id
datetime ContentModifiedDate
}
"Entity (User, Event,...)" {
string Id
}
ContentVersion {
string Id
reference ContentDocumentId
string Checksum
string Title
string FileExtension
string VersionNumber
}
Attachment {
string Id
reference ParentId
string Name
datetime LastModifiedDate
}
ContentDocument ||--|{ ContentVersion : ""
ContentDocumentLink }o--|| ContentDocument : ""
ContentDocumentLink }o--|| "Entity (User, Event,...)" : ""
Attachment }o--|| "Entity (User, Event,...)" : ""
Files (ContentDocument
objects) can be linked (using ContentDocumentLink
objects) to multiple entities
(SF objects like User
, Case
, and so on). File can have multiple versions (ContentVersion
objects).
Attachment
object is a legacy way of storing files in Salesforce. It is not recommended to use it anymore.
However, it is still used in some orgs. Attachment
object is linked to other objects using ParentId
field and
no versioning is available.
This project can handle both ContentDocument
and Attachment
objects.
Based on configuration, download process will work as follows:
- If exists, load already downloaded files list (
{data_dir}/downloaded_versions.csv)
). - For each object type defined in configuration:
- Load existing content document link list (
{data_dir}/{obj_type}/document_links.csv
) or download from Salesforce with specified conditions. - If exists, load content version list (
{data_dir}/{obj_type}/content_versions.csv
) or download it from Salesforce (based on document link list). - Based on those two lists generate in memory mapping of files to download with objects they are linked to.
- For each file on list above:
- Combine file path (
{data_dir}/{obj_type}/files/{obj_id|custom_field}/{doc_id}_{version_num}_{id}_{title}.{ext}
) - Check if file is already on disk or was downloaded for some other object, and if needed, copy file to new location and update downloaded files list.
- If above is not the case then fetch file from Salesforce and update downloaded files list.
- Check API limits, and if needed, wait for usage to drop below threshold.
- Combine file path (
- Save downloaded files list on disk.
- Load existing content document link list (
- When all object download is complete, show statistics.
Based on configuration, validation process will work as follows:
- If exists, load already validated files list (
{data_dir}/validated_versions.csv)
). - For each object type defined in configuration:
- Load existing content document link list (
{data_dir}/{obj_type}/document_links.csv
) or download from Salesforce with specified conditions. - If exists, load content version list (
{data_dir}/{obj_type}/content_versions.csv
) or download it from Salesforce (based on document link list). - Based on those two lists generate in memory mapping of files to download with objects they are linked to.
- For each file on list above:
- If file does not exist on disk, or was already validated and checksum does not match with Salesforce, then mark file as invalid.
- If file was not validated before, calculate checksum of disk file, update validated list, compare checksum and if needed mark file as invalid.
- Save validated files list on disk.
- Load existing content document link list (
- When validation is complete, show statistics.
You can remove CSV files from disk and next download will download full lists again from Salesforce.
# for chosen type
rm -rf {data_dir}/{object_type}/*.csv
# or for all types
rm -rf {data_dir}/*/*.csv
Already calculated checksums for downloaded files are kept in {data_dir}/validated_versions.csv
.
You can remove this file or selected lines from inside this file. This will trigger full validation again.
// TODO
- Add cli options to force re-download versions and document link lists
- Add some example terraform and/or ansible to use for deploy to VM in cloud
- Use proper logging