Skip to content

Statistical & ML Tool Kits for Noisy Data Classification Problems

License

Notifications You must be signed in to change notification settings

KevinLiao159/klearn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Klearn: Data Science and Machine Learning Tool Kits for Kagglers

Klearn logo

license

Good Job!!! I am glad that you just found Klearn.

Klearn is a Python module that speeds up data science or machine learning research work flow tremendously. It embraces the best data science practices and commits to empower data scientists. It holds several data science most-use modules, which includes but not limit to EDA module, feature engineering module, cross-validation strategy, hold-out data scoring, and model ensembling.

Klearn is compatible with: Python 2.7-3.6.


Some principles

  • User friendliness. Klearn is designed for data science beginners. Klearn follows best practices for reducing cognitive load: it offers consistent & simple APIs, it minimizes the number of user actions required for common use cases, and it provides clear and actionable feedback upon user error.

  • Modularity. A data science research project is understood as a sequence of tasks including EDA, feature engineering, and model selection/benchmarking. Each module in Klearn is reponsible for each task in data scientist research routine work flow.

  • Easy extensibility. New modules are simple to add (as new classes and functions), and existing modules provide ample examples. To be able to easily create new modules allows for total expressiveness, making Klearn suitable for advanced research.

  • Work with Python. No separate models configuration files in a declarative format. Models are described in Python code, which is compact, easier to debug, and allows for ease of extensibility.


Module structure

The main modules of Klearn API are:

  • datasets, which is responsible for dumping data in certain format
  • eda, which is responsible for data visualization and exploratory analysis
  • ensemble, which is reponsible for combining models together
  • model_selection, which holds cv strategy classes and scoring functions
  • models, which is for higher level wrappers of machine learning models
  • preprocessing, which responsible for data cleaning and feature engineering

The complete file-structure for the project is as follows:

klearn/
    klearn/
        datasets/
            libffm_format.py
        eda/
            eda.py
            plotly.py
            seaborn.py
        ensemble/
            dispatch.py
            ensemble.py
        model_selection/
            metrics.py
            scorers.py
            split.py
        models/
            modifiers.py
            trainers.py
            transformers.py
        preprocessing/
            cleaners.py
            features.py
            targets.py
        logger.py
        utils.py
    images/
        ...random stuff

    README.md
    LICENSE
    requirements.txt
    setup.py

Installation

  • Install Klearn from PyPI (NOT supported for now):
sudo pip install klearn
  • Alternatively: install Klearn from the GitHub source (recommended):

First, clone Klearn using git:

git clone https://github.com/KevinLiao159/klearn.git

Then, cd to the Klearn folder and run the install command:

cd klearn
sudo python setup.py install

About

Statistical & ML Tool Kits for Noisy Data Classification Problems

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages