DataLinter

A data linter written in Julia at the Vrije Universiteit Brussel.

Installation

The recommended way to install DataLinter is to download the docker image:

$ docker pull ghcr.io/zgornel/datalinter-compiled:latest

This will download a Docker image with the compiled version of the data linter. For development, one can dowload the repository and build the Docker image separately if needed.

Note: Before running the linter, make sure that the Docker container has mapped all the relevant directories. Check out the Dockerfile of the image to see what directories are available inside the container (created with the mkdir -p commands).

Running the linter

To perform a smaple run on the test dataset from the repository from inside the root of the repository:

$ time docker run -it --rm \
    --volume=./test/data:/_data \
    --volume=./config:/_config \
    ghcr.io/zgornel/datalinter-compiled:latest \
        /datalinter/bin/datalinter /_data/data.csv \
            --config-path /_config/default.toml \
            --log-level warn

The output should look something like:

┌ Warning: Could not load KB@. Returning empty Dict().
└ @ DataLinter.KnowledgeBaseNative ~/.julia/packages/DataLinter/5mybQ/src/kb.jl:22
• info  (tokenizable_string)    column: x6           the values of 'column: x6' could be tokenizable i.e. contain spaces
• info  (tokenizable_string)    column: x8           the values of 'column: x8' could be tokenizable i.e. contain spaces
• info  (large_outliers)        column: x1           the values of 'column: x1' contain large outliers
! warn  (int_as_float)          column: x4           the values of 'column: x4' are floating point but can be integers
! warn  (enum_detector)         column: x5           just a few distinct values in 'column: x5', it could be an enum
! warn  (enum_detector)         column: x8           just a few distinct values in 'column: x8', it could be an enum
! warn  (enum_detector)         column: x4           just a few distinct values in 'column: x4', it could be an enum
! warn  (empty_example)         row: 10              the example at 'row: 10' looks empty
! warn  (empty_example)         row: 11              the example at 'row: 11' looks empty
! warn  (uncommon_signs)        column: x1           uncommon signs (+/-/NaN/0) present in 'column: x1'
! warn  (long_tailed_distrib)   column: x1           the distribution for 'column: x1' has 'long tails'
11 issues found from 14 linters applied (13 OK, 1 N/A) .
docker run -it --rm --volume=./test/data:/_data --volume=./config:/_config     0.02s user 0.01s system 0% cpu 4.197 total

Using the script

The linter can be run quickly through the datalinter.sh shell script. To run in on the test dataset, one can do:

$ ./datalinter.sh ./test/data/data.csv

The script can be ran from any directory and accepts a single argument, the dataset that is to be linted.

License

This code has an GPL license and therefore it is free as beer.

Reporting Bugs

Please file an issue to report a bug or request a feature.

References

[1] https://en.wikipedia.org/wiki/Lint_(software)

[2] A data linter written by Google

Acknowledgements

The initial version of DataLinter was fully inspired by this work written by Google brain research.

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
.github/workflows		.github/workflows
apps/datalinter		apps/datalinter
config		config
docker		docker
docs		docs
gifs		gifs
knowledge		knowledge
scripts		scripts
src		src
test		test
workload		workload
.gitignore		.gitignore
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NEWS.md		NEWS.md
Project.toml		Project.toml
README.md		README.md
datalinter.sh		datalinter.sh
datalinterw.jl		datalinterw.jl
runrunic.sh		runrunic.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

DataLinter

Installation

Running the linter

Using the script

License

Reporting Bugs

References

Acknowledgements

About

Licenses found

Releases

Packages

Languages

License

Licenses found

zgornel/DataLinter

Folders and files

Latest commit

History

Repository files navigation

DataLinter

Installation

Running the linter

Using the script

License

Reporting Bugs

References

Acknowledgements

About

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages