Skip to content

A data linter written at the VUB AI Lab

License

GPL-3.0, Unknown licenses found

Licenses found

GPL-3.0
LICENSE
Unknown
LICENSE.md
Notifications You must be signed in to change notification settings

zgornel/DataLinter

Repository files navigation

DataLinter

A data linter written in Julia at the Vrije Universiteit Brussel.

Build Status License Documentation

til

Installation

The recommended way to install DataLinter is to download the docker image:

$ docker pull ghcr.io/zgornel/datalinter-compiled:latest

This will download a Docker image with the compiled version of the data linter. For development, one can dowload the repository and build the Docker image separately if needed.

Note: Before running the linter, make sure that the Docker container has mapped all the relevant directories. Check out the Dockerfile of the image to see what directories are available inside the container (created with the mkdir -p commands).

Running the linter

To perform a smaple run on the test dataset from the repository from inside the root of the repository:

$ time docker run -it --rm \
    --volume=./test/data:/_data \
    --volume=./config:/_config \
    ghcr.io/zgornel/datalinter-compiled:latest \
        /datalinter/bin/datalinter /_data/data.csv \
            --config-path /_config/default.toml \
            --log-level warn

The output should look something like:

┌ Warning: Could not load KB@. Returning empty Dict().
└ @ DataLinter.KnowledgeBaseNative ~/.julia/packages/DataLinter/5mybQ/src/kb.jl:22
• info  (tokenizable_string)    column: x6           the values of 'column: x6' could be tokenizable i.e. contain spaces
• info  (tokenizable_string)    column: x8           the values of 'column: x8' could be tokenizable i.e. contain spaces
• info  (large_outliers)        column: x1           the values of 'column: x1' contain large outliers
! warn  (int_as_float)          column: x4           the values of 'column: x4' are floating point but can be integers
! warn  (enum_detector)         column: x5           just a few distinct values in 'column: x5', it could be an enum
! warn  (enum_detector)         column: x8           just a few distinct values in 'column: x8', it could be an enum
! warn  (enum_detector)         column: x4           just a few distinct values in 'column: x4', it could be an enum
! warn  (empty_example)         row: 10              the example at 'row: 10' looks empty
! warn  (empty_example)         row: 11              the example at 'row: 11' looks empty
! warn  (uncommon_signs)        column: x1           uncommon signs (+/-/NaN/0) present in 'column: x1'
! warn  (long_tailed_distrib)   column: x1           the distribution for 'column: x1' has 'long tails'
11 issues found from 14 linters applied (13 OK, 1 N/A) .
docker run -it --rm --volume=./test/data:/_data --volume=./config:/_config     0.02s user 0.01s system 0% cpu 4.197 total

Using the script

The linter can be run quickly through the datalinter.sh shell script. To run in on the test dataset, one can do:

$ ./datalinter.sh ./test/data/data.csv

The script can be ran from any directory and accepts a single argument, the dataset that is to be linted.

License

This code has an GPL license and therefore it is free as beer.

Reporting Bugs

Please file an issue to report a bug or request a feature.

References

[1] https://en.wikipedia.org/wiki/Lint_(software)

[2] A data linter written by Google

Acknowledgements

The initial version of DataLinter was fully inspired by this work written by Google brain research.

About

A data linter written at the VUB AI Lab

Resources

License

GPL-3.0, Unknown licenses found

Licenses found

GPL-3.0
LICENSE
Unknown
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published