Skip to content

Latest commit

 

History

History
181 lines (154 loc) · 6.66 KB

tutorial_base.md

File metadata and controls

181 lines (154 loc) · 6.66 KB

Start your project from scratch with Hermione

  1. After you have installed Hermione, you need to create your project:

  1. Activate it environment:

  1. Now we are going to install by default libraries that are in file requirements.txt:

  1. When you create a project at Hermione, it already creates a configuration file, this file can be found at src/config/config.json. This file sets some project settings data, and if you find it necessary, you can change it, including new fields or changing existing ones:
{
"project_name": "project_scratch",
"env_path": "project_scratch/project_scratch_env",
"files_path": "../data/raw/",
"key": "<<<<key>>>>",
"user": "<<<<user>>>>"
}
  1. The first step in creating our project is to load the database. In this tutorial we will use the Titanic dataset. For that we load it in the class Spreadsheet in the method get_data as follows the example below:
def  get_data(self, path)->pd.DataFrame:
	"""
	Returns a flat table in Dataframe
	Parameters
	----------
	arg : type
	description
	Returns
	-------
	pd.DataFrame
	Dataframe with data
	"""
	return pd.read_csv(path)[['Survived', 'Pclass', 'Sex', 'Age']]
  1. Then we need to apply pre-processing to our database, this is done in the class Preprocessing in the method process:
def  process(self, df: pd.DataFrame):
	"""
	Perform data cleaning.
	Parameters
	----------
	df : pd.Dataframe
	Dataframe to be processed 

	Returns
	-------
	pd.Dataframe
	Cleaned Data Frame
	"""
	print("Cleaning data")
	df_copy = df.copy()
	df_copy['Pclass'] = df_copy.Pclass.astype('object')
	df_copy = df_copy.dropna()
	df_copy = pd.get_dummies(df_copy)
	return df_copy

Here we apply three pre-processing: transformation of the column Pclass to the type object, removal of empty lines and creation of dummies.

  1. The next step is to define the algorithm that we will be training. If you are going to run your model with some sklearn algorithm, the TrainerSklearn class already has the implementation and you just need to call the train method, passing some parameters. The train method also supports training with cross validation or just dividing into training and testing (all parameterized). If you need to use another package, just implement your own class, inheriting from the Trainer class. Similar to what was implemented in TrainerSklearn:
class TrainerSklearn(Trainer):
        
    def train(self, X, y,
                classification: bool, 
                algorithm, 
                preprocessing=None,
                data_split=('train_test', {'test_size': 0.2}), 
                **params):
        """
    	Method that builds the Sklearn model
    
    	Parameters
    	----------    
        classification    : bool
                            if True, classification model training takes place, otherwise Regression
        model_name        : str
                            model name
        data_split        : tuple (strategy: str, params: dict)
                            strategy of split the data to train your model. 
                            Strategy: ['train_test', 'cv']
                            Ex: ('cv', {'cv': 9, 'agg': np.median})
        preprocessing     : Preprocessing
                            preprocessed object to be applied
             
    	Returns
    	-------
    	Wrapper
        """
        model = algorithm(**params) #model
        columns = list(X.columns)
        if data_split[0] == 'train_test':
            X_train, X_test, y_train, y_test = train_test_split(X, y, **data_split[1])
            model.fit(X_train,y_train)
            y_pred = model.predict(X_test[columns])
            y_probs = model.predict_proba(X_test[columns])[:,1]
            if classification:
                res_metrics = Metrics.classification(y_test.values, y_pred, y_probs)
            else:
                res_metrics = Metrics.regression(y_test.values, y_pred)
        elif data_split[0] == 'cv':
            cv = data_split[1]['cv'] if 'cv' in data_split[1] else 5
            agg_func = data_split[1]['agg'] if 'agg' in data_split[1] else np.mean
            res_metrics = Metrics.crossvalidation(model, X, y, classification, cv, agg_func)
            model.fit(X,y)
        model = Wrapper(model, preprocessing, res_metrics, columns)
        return model
  1. Now that we have loaded the data, implemented the pre-processing and already have the method to train, we need to join these steps, that is, set up our execution pipeline. At Hermione this process must be performed in the script train.py. So come on!

    8.1. Load the project name, defined in the config file (step 4).

    with  open('config/config.json', 'r') as  file:
    	project_name = json.load(file)['project_name']

    8.2. Create an experiment in mlflow:

    mlflow.set_experiment(project_name)

    8.3. Enter the path of the dataset to be loaded by the Spreadsheet class (step 5):

    df = Spreadsheet().get_data('../data/raw/train.csv')

    8.3. Apply the preprocessing defined in step 6:

    p = Preprocessing()
    df = p.process(df)

    8.4. Define features (X) and target (y):

    X = df.drop(columns=["Survived"])
    y = df["Survived"]

    8.5. Define the sklearn algorithms that we will apply:

    algos = [RandomForestClassifier, GradientBoostingClassifier, LogisticRegression]	

    8.6. Now we will configure the execution of the algorithms, using the TrainerSklearn class (step 7). Here we train with mlflow so that the results can be stored and analyzed later:

    for algo in algos:
    	with mlflow.start_run() as run:
    		model = TrainerSklearn().train(X, y,
    									   classification=True,
    									   algorithm=algo,
    									   data_split=('cv', {'cv': 8}),
    									   preprocessing=p)
    	   mlflow.log_params({'algorithm': algo})
    	   mlflow.log_metrics(model.get_metrics())
    	   mlflow.sklearn.log_model(model.get_model(), 'model')

    8.7. After the train.py script has been built, with the previous steps. You need to run it. You can do this in two ways:

    • Run the entire script on the python console
    • Run the hermione train command, in the src folder
  2. After executing step 8.7 the models, their parameters and metrics are logged in mlflow. To access them, simply execute the command below at the command prompt inside the path src/:

mlflow ui
  1. Open the URL, which the previous command returns, in your preferred browser. So you can analyze the results returned:

Ready! Now you have built a project from scratch using Hermione.