- After you have installed Hermione, you need to create your project:
- Activate it environment:
- Now we are going to install by default libraries that are in file
requirements.txt
:
- When you create a project at Hermione, it already creates a configuration file, this file can be found at
src/config/config.json
. This file sets some project settings data, and if you find it necessary, you can change it, including new fields or changing existing ones:
{
"project_name": "project_scratch",
"env_path": "project_scratch/project_scratch_env",
"files_path": "../data/raw/",
"key": "<<<<key>>>>",
"user": "<<<<user>>>>"
}
- The first step in creating our project is to load the database. In this tutorial we will use the Titanic dataset. For that we load it in the class
Spreadsheet
in the methodget_data
as follows the example below:
def get_data(self, path)->pd.DataFrame:
"""
Returns a flat table in Dataframe
Parameters
----------
arg : type
description
Returns
-------
pd.DataFrame
Dataframe with data
"""
return pd.read_csv(path)[['Survived', 'Pclass', 'Sex', 'Age']]
- Then we need to apply pre-processing to our database, this is done in the class
Preprocessing
in the methodprocess
:
def process(self, df: pd.DataFrame):
"""
Perform data cleaning.
Parameters
----------
df : pd.Dataframe
Dataframe to be processed
Returns
-------
pd.Dataframe
Cleaned Data Frame
"""
print("Cleaning data")
df_copy = df.copy()
df_copy['Pclass'] = df_copy.Pclass.astype('object')
df_copy = df_copy.dropna()
df_copy = pd.get_dummies(df_copy)
return df_copy
Here we apply three pre-processing: transformation of the column Pclass
to the type object
, removal of empty lines and creation of dummies.
- The next step is to define the algorithm that we will be training. If you are going to run your model with some sklearn algorithm, the
TrainerSklearn
class already has the implementation and you just need to call thetrain
method, passing some parameters. Thetrain
method also supports training with cross validation or just dividing into training and testing (all parameterized). If you need to use another package, just implement your own class, inheriting from theTrainer
class. Similar to what was implemented inTrainerSklearn
:
class TrainerSklearn(Trainer):
def train(self, X, y,
classification: bool,
algorithm,
preprocessing=None,
data_split=('train_test', {'test_size': 0.2}),
**params):
"""
Method that builds the Sklearn model
Parameters
----------
classification : bool
if True, classification model training takes place, otherwise Regression
model_name : str
model name
data_split : tuple (strategy: str, params: dict)
strategy of split the data to train your model.
Strategy: ['train_test', 'cv']
Ex: ('cv', {'cv': 9, 'agg': np.median})
preprocessing : Preprocessing
preprocessed object to be applied
Returns
-------
Wrapper
"""
model = algorithm(**params) #model
columns = list(X.columns)
if data_split[0] == 'train_test':
X_train, X_test, y_train, y_test = train_test_split(X, y, **data_split[1])
model.fit(X_train,y_train)
y_pred = model.predict(X_test[columns])
y_probs = model.predict_proba(X_test[columns])[:,1]
if classification:
res_metrics = Metrics.classification(y_test.values, y_pred, y_probs)
else:
res_metrics = Metrics.regression(y_test.values, y_pred)
elif data_split[0] == 'cv':
cv = data_split[1]['cv'] if 'cv' in data_split[1] else 5
agg_func = data_split[1]['agg'] if 'agg' in data_split[1] else np.mean
res_metrics = Metrics.crossvalidation(model, X, y, classification, cv, agg_func)
model.fit(X,y)
model = Wrapper(model, preprocessing, res_metrics, columns)
return model
-
Now that we have loaded the data, implemented the pre-processing and already have the method to train, we need to join these steps, that is, set up our execution pipeline. At Hermione this process must be performed in the script
train.py
. So come on!8.1. Load the project name, defined in the config file (step 4).
with open('config/config.json', 'r') as file: project_name = json.load(file)['project_name']
8.2. Create an experiment in mlflow:
mlflow.set_experiment(project_name)
8.3. Enter the path of the dataset to be loaded by the
Spreadsheet
class (step 5):df = Spreadsheet().get_data('../data/raw/train.csv')
8.3. Apply the preprocessing defined in step 6:
p = Preprocessing() df = p.process(df)
8.4. Define features (X) and target (y):
X = df.drop(columns=["Survived"]) y = df["Survived"]
8.5. Define the sklearn algorithms that we will apply:
algos = [RandomForestClassifier, GradientBoostingClassifier, LogisticRegression]
8.6. Now we will configure the execution of the algorithms, using the
TrainerSklearn
class (step 7). Here we train with mlflow so that the results can be stored and analyzed later:for algo in algos: with mlflow.start_run() as run: model = TrainerSklearn().train(X, y, classification=True, algorithm=algo, data_split=('cv', {'cv': 8}), preprocessing=p) mlflow.log_params({'algorithm': algo}) mlflow.log_metrics(model.get_metrics()) mlflow.sklearn.log_model(model.get_model(), 'model')
8.7. After the
train.py
script has been built, with the previous steps. You need to run it. You can do this in two ways:- Run the entire script on the python console
- Run the
hermione train
command, in thesrc
folder
-
After executing step 8.7 the models, their parameters and metrics are logged in mlflow. To access them, simply execute the command below at the command prompt inside the path
src/
:
mlflow ui
- Open the URL, which the previous command returns, in your preferred browser. So you can analyze the results returned:
Ready! Now you have built a project from scratch using Hermione.