Bayesian format obsolescence modeling #116
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Introduces a new digital preservation plot which may over a long-enough period of time identify a mechanism of predicting format obsolescence across a repository.
This PR also introduces a new plot library to AIPscan called Seaborn which offers the type of rug plot required to approximate @nkrabben's original work which this PR is taken from also here.
Seaborn produces PNG images of plots. To render these in AIPscan we write them into memory and then encode the byte stream as Base64 which can be interpreted in HTML. This approach might be useful for other plot-types and libraries in AIPscan in the futrue. Seaborn also seems to present a powerful visualization tool.
With only a short period of time to polish this off it is unlikely that it meets all of the code quality requirements needed to merge, but hopefully the work is close for whoever picks it up with some tests included, and most of the basic principles followed so far in AIPscan.
For testing, the cURL for the API endpoint is:
curl -X GET "http://127.0.0.1:5000/api/report-data/bayesian-format-modeling/1" -H "accept: application/json" | python -m json.tool
Approximate time to write report code: 9 hrs + added time to learn Seaborn.
A note on distributions: The data I have available is not authentic enough to demonstrate realistic distribution patterns. Largely downloaded through sources like other GitHub repositories, the distributions of dates are usually clumped into the day that something was downloaded with a few organic outliers on the side. I am still investigating synthetic methods for simulating data but ultimately I am looking forward to seeing charts like these generated from a real repository.