Skip to content

dasch-swiss/fileidentification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

a script to identify file formats and convert them if necessary

disclaimer: it has a lot of dependencies, i.e. siegfried to identify the files and ffmpeg, imagemagick and LibreOffice if you want to test and convert files. But they are useful anyway so you can install them for Mac OS X using brew (optionally install inkscape, but before imagemagick):

brew install richardlehane/digipres/siegfried
brew install ffmpeg
brew install --cask inkscape
brew install imagemagick
brew install ghostscript
brew install --cask libreoffice

or for Linux depending on your distribution

https://github.com/richardlehane/siegfried/wiki/Getting-started
apt-get install ffmpeg
https://inkscape.org/de/release/inkscape-1.2/gnulinux/ubuntu/ppa/dl/
https://imagemagick.org/script/download.php#linux

LibreOffice https://www.libreoffice.org/download/download-libreoffice/

it is not optimised on speed, especially when converting files. the idea was to write a script that has some default file conversion but is at the same time highly customisable.

the script turns the output from siegfried into a SfInfo dataclass instance per file, looks up the policies defined in conf/policies.py and writes out a default policies.json. in a second iteration, it applies the policies probes the file - if it is corrupt - if file format is accepted or it need to be converted). then it converts the files flagged for conversion, verifies their output.

it writes all relevant metadata to a log.json containing a sequence of SfInfo objects that got enriched with the the file (conversion) processing logs (so all file manipulation and format issues are logged).

installation

If you don't have uv installed, install it with

curl -LsSf https://astral.sh/uv/install.sh | sh

Then, you can use uv run to run the fileidentification script:

uv run identify.py testdata

By prepending uv run before identify.py, uv executes the script in a virtual environment that contains all necessary python dependencies. (The virtual environment can optionally be activated with source .venv/bin/activate, but this is not necessary when using uv run.)

Usage

in your terminal, switch to the root directory of fileidentification and run the script with
uv run identify.py path/to/folder or file or activate the virtual environment and run python3 identify.py path/to/folder or file

to get an overview of the options

uv run identify.py --help

generating policies

uv run identify.py path/to/directory

this does generate a default policies file according to the settings in conf/policies.py
you get:
path/to/directory_policies.json -> the policies for that folder
path/to/directory_log.json -> the technical metadata of all the files in the folder

if you run it against a single file ( path/to/file.ext ) the json are located in the parent of that file:
path/to.file_policies.json
path/to.file_log.json

file integrity tests

uv run identify.py path/to/directory -i

tests the files for their integrity and moves corrupted files to a folder path/to/directory_WORKINGDIR/_REMOVED. the affected SfInfos in the log.json are flagged with removed

you can also add the flag -v (--verbose) for more detailed inspection. (see options below)

applying the policies

if you're happy with the policies, you can apply them with

uv run identify.py path/to/directory -a

you get the converted files in path/to/directory_WORKINGDIR (default) with the log.txt next to it.

You can set the path of the workingdir either with the option -w path/to/workingdir (see options below) or change it permanent in conf/settings.py

this might be helpful if your files are on a external drive

remove temp

if you're happy with the outcome you can run

uv run identify.py path/to/directory -r

this deletes all temporary folders and moves the converted file next to their parents.

if you don't want to keep the parents of the converted files, you can add the flag -x (--remove-original). this replaces the parent files with the converted ones. see options below.

if you don't need these intermediate states and e.g. additionally want the script run in verbose mode and temporarily set a custom working directory, you can simply run a combination of those flags

uv run identify.py path/to/directory -ariv -w path/to/workingdirectory

which does all at once.

log, status and output

the path/to/directory_log.json takes track of all modifications and appends logs of what changed in the folder.
e.g.: if a file got removed from the folder, in the log.json of that folder the respective SfInfo object of that file gets an entry "status": {"removed": true}, so it documented that this file was once in the folder, but not anymore. its kind of a simple database.

if you wish a simpler csv output, you can add the flag --csv anytime when you run the script, which converts the log.json of the actual status of the directory to a csv. as an addition, you get also a mapping file (if you need to replace the paths of converted files somewhere else)

moving the directory
as long as you keep the files directory_log.json and directory_log.json.sha256 (needed to verify the log) on the same path/to/ as the directory, you can move the directory anywhere between the steps.
(optional the directory_policies.json if its specific. otherwise a new one gets created)

you should not move the WORKINGDIR between applying the policies (-a) and removing tmp files (-r) though.

advanced usage

you can also create your own policies file, and with that, customise the file conversion output (and executionsteps of the script.) simply edit the default file path/to/directory_policies.json before applying.
if you want to start from scratch, you can create a blank template with all the file formats encountered in the folder with uv run indentify.py path/to/folder -b

policy examples:
a policy for Audio/Video Interleaved Format thats need to be transcoded to MPEG-4 Media File (Codec: AVC/H.264, Audio: AAC) looks like this

{
    "fmt/5": {
            "format_name": "Audio/Video Interleaved Format",  # optional
            "bin": "ffmpeg",
            "accepted": false,
            "remove_original": false,
            "target_container": "mp4",
            "processing_args": "-c:v libx264 -crf 18 -pix_fmt yuv420p -c:a aac"
            "expected": [
            "fmt/199"
            ],
    }
}

a policy for Portable Network Graphics that is accepted as it is, but gets tested

{
    "fmt/13": {
        "format_name": "Portable Network Graphics",  # optional
        "bin": "magick",
        "accepted": true
    }
}



key is the puid (fmt/XXX)
format_name (optional) str
bin str: program to convert the file or test the file (testing currently only is supported on image/audio/video, i.e. ffmpeg and imagemagick)
accepted bool: false if the file needs to be converted
remove_original (required if not accepted) bool: whether to keep the parent of the converted file in the directory, default is false
target_container (required if not accepted) str: the container the file needs to be converted to
processing_args (required if not accepted) str: the arguments used with bin
expected (required if not accepted) list: the expected file format for the converted file



accepted values for bin are:

"" no program used
magick use imagemagick
ffmpeg use ffmpeg
soffice use libre office
inkscape use inkscape



you can test an entire policies file (given that the path is path/to/directory_policies.json, otherwise pass the path to the file with -p) with

uv run identify.py path/to/directory -t

if you just want to test a specific policy, append f and the puid

uv run identify.py path/to/directory -tf fmt/XXX

the test conversions are located in _WORKINGDIR/_TEST

presets

once you've done your work on a customised polices, you might want to save it as a preset (given the path of the policies is path/to/directory_policies.json )

uv run identify.py path/to/directory -S

you can reuse it on another folder with the flag -p:

uv run identify.py path/to/directory -p presets/yourSavedPoliciesName

if you're not sure what files to expect in that folder and you mind skipping them during processing, you can add the flag -e which expands the policies with blank ones that are not yet in the policies and writes a new file path/to/directory_policies.json which you can adjust before applying them.

uv run identify.py path/to/directory -ep presets/yourSavedPoliciesName

default settings

the default setting for file conversion are in conf/policies.py, you can add or modify the entries there. all other settings such as default path values or hash algorithm are in conf/settings.py

options

-i
[--integrity-tests] tests the files for their integrity

-v
[--verbose] catches more warnings on video and image files during the integrity tests. this can take a significantly longer based on what files you have. As an addition, it handles some warnings as an error. e.g. it moves images that have an incomplete data stream into the _REMOVED folder

-a
[--apply] applies the policies

-r
[--remove-tmp] removes all temporary items and adds the converted files next to their parents.

-x
[--remove-original] this overwrites the remove_original value in the policies and sets it to true when removing the tmp files. the original files are moved to the _REMOVED folder in the WORKINGDIR.
when used in generating policies, it sets remove_original in the policies to true (default false)

-p path/to/policies.json
[--policies-path] load a custom policies json file

-w path/to/workingdir
[--working-dir] set a custom working directory. default is path/to/directory_WORKINGDIR

-s
[--strict] when run in strict mode, it moves the files that are not listed in policies.json to the folder _REMOVED (instead of throwing a warning)
when used in generating policies, it does not add blank ones for formats that are not mentioned in conf/policies.py

-b
[--blank] creates a blank policies based on the files encountered in the given directory

-e
[--extend-policies] append filetypes found in the directory to the given policies if they are missing in it.

-S
[--save-policies] save the policies in presets

-q
[--quiet] just print errors and warnings

--csv
get an additional output as csv aside from the log.json

--convert
re-convert the files that failed during file conversion

iterations

as the SfInfo objects of converted files have an derived_from attribute that is again a SfInfo object of its parent, and an existing log.json is extended if a folder is run against a different policy, the log.json keeps track of all iterations.
so iterations like A -> B, B -> C, ... is logged in one log.json.

e.g. if you have different types of doc and docx files in a folder, you dont allow doc (delete them) and you want a pdf as an addition to the docx files.

using it in your code

as long as you have all the dependencies installed and run python version >=3.8, have typer installed in your project, you can copy the fileidentification folder into your project folder and import the FileHandler to your code

from fileidentification.filehandling import FileHandler


# this runs it with default parameters (flags -ivarq), but change the parameters to your needs
fh = FileHandler()
fh.run(path/to/directory)


# or if you just want to do integrity tests
fh = FileHandler()
fh.integrity_tests(path/to/directoy)

# log it at some point and have an additional csv
fh.write_logs(path/where/to/log, to_csv=True)

updating signatures

uv run update.py

useful links

you'll find a good resource for query fileformats on
https://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=new

siegfried
https://www.itforarchivists.com/siegfried/

signatures
https://en.wikipedia.org/wiki/List_of_file_signatures

recommendations on what file format to archive data
kost: https://kost-ceco.ch/cms/de.html
bundesarchiv: https://www.bar.admin.ch/dam/bar/de/dokumente/konzepte_und_weisungen/archivtaugliche_dateiformate.1.pdf.download.pdf/archivtaugliche_dateiformate.pdf

NOTE if you want to convert to pdf/A, you need libreOffice version 7.4+
it is implemented in wrappers.wrappers.Converter and conf.models.LibreOfficePdfSettings

when you convert svg, you might run into errors as the default library of imagemagick is not that good. easiest workaround is installing inkscape ( brew install --cask inkscape ), make sure that you reinstall imagemagick, so its uses inkscape as default for converting svg ( brew remove imagemagick , brew install imagemagick)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages