diff --git a/README.md b/README.md index 9c46e91..ad2b531 100644 --- a/README.md +++ b/README.md @@ -82,6 +82,8 @@ After installed Hermione: ![](https://cdn-images-1.medium.com/max/800/1*U3ToR5jDjQJihT9EnxeDdg.png) +Do you want to create your **project from scratch**? There click [here](tutorial_base.md) to check a tutorial. + ## Documentation This is the class structure diagram that Hermione relies on: diff --git a/images/create.png b/images/create.png new file mode 100644 index 0000000..d5fb5d4 Binary files /dev/null and b/images/create.png differ diff --git a/images/env.png b/images/env.png new file mode 100644 index 0000000..8c668d4 Binary files /dev/null and b/images/env.png differ diff --git a/images/mlflow.png b/images/mlflow.png new file mode 100644 index 0000000..313de9a Binary files /dev/null and b/images/mlflow.png differ diff --git a/images/requirements.png b/images/requirements.png new file mode 100644 index 0000000..362e56e Binary files /dev/null and b/images/requirements.png differ diff --git a/tutorial_base.md b/tutorial_base.md new file mode 100644 index 0000000..84f61c4 --- /dev/null +++ b/tutorial_base.md @@ -0,0 +1,137 @@ + + +## Start your project from scratch with Hermione + +1. After you have installed Hermione, you need to create your project: + +![](images/create.png) + +2. Activate it environment: + + ![](images/env.png) + +3. Now we are going to install by default libraries that are in file ``requirements.txt``: + +![](images/requirements.png) + +4. When you create a project at Hermione, it already creates a configuration file, this file can be found at ``src/config/config.json``. This file sets some project settings data, and if you find it necessary, you can change it, including new fields or changing existing ones: + +```json +{ +"project_name": "project_scratch", +"env_path": "project_scratch/project_scratch_env", +"files_path": "../data/raw/", +"key": "<<<>>>", +"user": "<<<>>>" +} +``` +5. The first step in creating our project is to load the database. In this tutorial we will use the [Titanic dataset](https://www.kaggle.com/c/titanic/data). For that we load it in the class ``Spreadsheet`` in the method ``get_data`` as follows the example below: + +```python +def get_data(self, path)->pd.DataFrame: + """ + Returns a flat table in Dataframe + Parameters + ---------- + arg : type + description + Returns + ------- + pd.DataFrame + Dataframe with data + """ + return pd.read_csv(path)[['Survived', 'Pclass', 'Sex', 'Age']] +``` +6. Then we need to apply pre-processing to our database, this is done in the class ``Preprocessing`` in the method ``process``: +```python +def process(self, df: pd.DataFrame): + """ + Perform data cleansing. + Parameters + ---------- + df : pd.Dataframe + Dataframe to be processed + + Returns + ------- + pd.Dataframe + Cleaned Data Frame + """ + print("Cleaning data") + df_copy = df.copy() + df_copy['Pclass'] = df_copy.Pclass.astype('object') + df_copy = df_copy.dropna() + df_copy = pd.get_dummies(df_copy) + return df_copy +``` + +Here we apply three pre-processing: transformation of the column `` Pclass`` to the type `` object``, removal of empty lines and creation of dummies. + +7. The next step is to define the algorithm that we will be training. If you are going to run your model with some sklearn algorithm, the ``TrainerSklearn`` class already has the implementation and you just need to call the ``train`` method, passing some parameters. The ``train`` method also supports training with cross validation or just dividing into training and testing (all parameterized). +If you need to use another package, just implement your own class, inheriting from the ``Trainer`` class. Similar to what was implemented in ``TrainerSklearn``: + +```python +class TrainerSklearn(Trainer): + pass +``` + +8. Now that we have loaded the data, implemented the pre-processing and already have the method to train, we need to join these steps, that is, set up our execution pipeline. +At Hermione this process must be performed in the **script** ``train.py``. So come on! + + + 8.1. Load the project name, defined in the config file (step 4). + ```python + with open('config/config.json', 'r') as file: + project_name = json.load(file)['project_name'] + ``` + 8.2. Create an experiment in mlflow: + ```python + mlflow.set_experiment(project_name) + ``` + 8.3. Enter the path of the dataset to be loaded by the ``Spreadsheet`` class (step 5): + ```python + df = Spreadsheet().get_data('../data/raw/train.csv') + ``` + 8.3. Apply the preprocessing defined in step 6: + ```python + p = Preprocessing() + df = p.process(df) + ``` + 8.4. Define features (X) and target (y): + ```python + X = df.drop(columns=["Survived"]) + y = df["Survived"] + ``` + 8.5. Define the sklearn algorithms that we will apply: + ```python + algos = [RandomForestClassifier, GradientBoostingClassifier, LogisticRegression] + ``` + 8.6. Now we will configure the execution of the algorithms, using the ``TrainerSklearn`` class (step 7). Here we train with mlflow so that the results can be stored and analyzed later: + ```python + for algo in algos: + with mlflow.start_run() as run: + model = TrainerSklearn().train(X, y, + classification=True, + algorithm=algo, + data_split=('cv', {'cv': 8}), + preprocessing=p) + mlflow.log_params({'algorithm': algo}) + mlflow.log_metrics(model.get_metrics()) + mlflow.sklearn.log_model(model.get_model(), 'model') + ``` + 8.7. After the ``train.py`` script has been built, with the previous steps. You need to run it. You can do this in two ways: + + - Run the entire script on the python console + - Run the ``hermione train`` command, in the ``src`` folder + + +9. After executing step 8.7 the models, their parameters and metrics are logged in mlflow. To access them, simply execute the command below at the command prompt inside the path ``src/``: + +```ssh +mlflow ui +``` +10. Open the URL, which the previous command returns, in your preferred browser. So you can analyze the results returned: + +![](images/mlflow.png) + +Ready! Now you have built a project from scratch using Hermione. \ No newline at end of file