Add MemoryStatsVisualizer tool to Khiops repo

L'outil MemoryStatsVisualizer permet de visualiser les traces memoires generes par les binaires Khiops. Cet outil est desormais gere dans le repo Khiops dans le repertoire test\MemoryStatsVisualizer Il est constitue d'une documentation d'utilisation (README.md), de scripts pythons, et d'exemples d'utilisation. Reprise de l'outil existant, avec quelques operations minimalistes de nettoyage - nettoyage minimaliste du code - prise en compte des warnings PEP8 de pycharm (la plupart) - ecriture README.md - test complet - production des logs memoire pour les samples, avec le script kht_test - exploitation des logs memoire avec l'ensemble des outils de MemoryStatsVisualizer
KhiopsML · Apr 12, 2024 · c204aa6 · c204aa6
1 parent c76df9a
commit c204aa6
Show file tree

Hide file tree

Showing 14 changed files with 16,142 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -21,6 +21,12 @@ CMakeFiles/
 test/**/results/
 comparisonResults.log
 
+# Visualization file and summaries produced by the MemoryStatsVisualizer tool
+KhiopsMemoryStats.html
+KhiopsMemoryStats.Summary.txt
+KhiopsMemoryStats.aggregate_stats.xlsx
+KhiopsMemoryStats.all_stats.xlsx
+
 # Python
 *.pyc
 __pycache__/
diff --git a/test/MemoryStatsVisualizer/AdultParallelSample/KhiopsMemoryStats.log b/test/MemoryStatsVisualizer/AdultParallelSample/KhiopsMemoryStats.log
diff --git a/test/MemoryStatsVisualizer/AdultParallelSample/KhiopsMemoryStats_1.log b/test/MemoryStatsVisualizer/AdultParallelSample/KhiopsMemoryStats_1.log
diff --git a/test/MemoryStatsVisualizer/AdultParallelSample/KhiopsMemoryStats_2.log b/test/MemoryStatsVisualizer/AdultParallelSample/KhiopsMemoryStats_2.log
diff --git a/test/MemoryStatsVisualizer/AdultParallelSample/KhiopsMemoryStats_3.log b/test/MemoryStatsVisualizer/AdultParallelSample/KhiopsMemoryStats_3.log
diff --git a/test/MemoryStatsVisualizer/AdultParallelSample/KhiopsMemoryStats_4.log b/test/MemoryStatsVisualizer/AdultParallelSample/KhiopsMemoryStats_4.log
diff --git a/test/MemoryStatsVisualizer/AdultParallelSample/KhiopsMemoryStats_5.log b/test/MemoryStatsVisualizer/AdultParallelSample/KhiopsMemoryStats_5.log
diff --git a/test/MemoryStatsVisualizer/AdultSample/KhiopsMemoryStats.log b/test/MemoryStatsVisualizer/AdultSample/KhiopsMemoryStats.log
diff --git a/test/MemoryStatsVisualizer/README.md b/test/MemoryStatsVisualizer/README.md
@@ -0,0 +1,99 @@
+# Memory stats visualizer
+
+Visualizer for memory stats logs produced by the Khiops tool
+
+It takes log file as input and visualize them in a browser, with curves per memory stats and summary bars per process.
+It also build some statistics summary of the log in tabular or Excel files.
+
+## Producing memory stats logs
+
+Khiops binaries can produce logs that summarise resource consumption for cores, CPU, memory and I/O.
+
+The log are produced on demands, depending on environment variables.
+These environment variables are recalled using the kht_env script of the LearningTestTool:
+- KhiopsMemStatsLogFileName: None, memory stats log file name
+- KhiopsMemStatsLogFrequency: None, frequency of allocator stats collection (0, 100000, 1000000,...)
+- KhiopsMemStatsLogToCollect: None, stats to collect (8193: only time and labels, 16383: all,...)
+- KhiopsIOTraceMode: None, to collect IO trace (false, true)
+
+Warning: the KhiopsIOTraceMode produces a lot of logs: use it only when necessary.
+
+#### Detailed settings for the logs to be collected
+
+The following constants are used to select the statistics to be collected:
+- LogTime=1 : Time  (*temps associe au log*)
+- HeapMemory=2 : Heap mem (*taille courante de la heap*)
+- MaxHeapRequestedMemory=4 : Max heap mem (*taille maximale demandee pour la heap*)
+- TotalHeapRequestedMemory=8 : Total heap mem (*taille total cumulee demande pour la heap*)
+- AllocNumber=16 : Alloc  (*nombre courant d'allocations*)
+- MaxAllocNumber=32: Max alloc (*nombre maximale d'allocations*)
+- TotalAllocNumber=64 : Total alloc (*nombre total d'allocations*)
+- TotalFreeNumber=128 : Total free (*nombre total de liberations*)
+- GrantedSize=256 : Granted (*taille courant alouee*)
+- MaxGrantedSize=512 : Max granted (*taille maximale alouee*)
+- TotalRequestedSize=1024 : Total requested size (*taille totale demandee*)
+- TotalGrantedSize=2048 : Total granted size (*taille totale alouee*)
+- TotalFreeSize=4096 : Total free size : (*taille totale liberee*)
+- LogLabel=8192 : Label (*libelle utilisateur*)
+
+
+To collect several statistics simultaneously, the corresponding constants must be added together:
+- Special values:
+  - NoStats=0 (*aucune statistique*)
+  - AllStats=16383 (*toutes les statistique*)
+- Some potential useful combinations
+  - LogInfo=8193 (*infos par log*)
+	- LogTime+LogLabel
+  - HeapStats=14 (*stats sur la heap*)
+    - HeapMemory + MaxHeapRequestedMemory + TotalHeapRequestedMemory
+  - AllocStats=8176 (*stats sur les allocations*)
+	- AllStats - LogInfo - HeapStats
+  - AllocCurrentStats=272 (*stats sur les allocations courantes*)
+	- AllocNumber + GrantedSize
+  - AllocMaxStats=544  (*stats sur les max d'allocations*)
+	- MaxAllocNumber + MaxGrantedSize   (*stats sur les max d'allocations*)
+  - AllocNumberStats=240  (*stats sur les nombres d'allocations*)
+	- AllocNumber + MaxAllocNumber + TotalAllocNumber + TotalFreeNumber
+  - AllocSizeStats=7936 (*stats sur les tailles d'allocations*)
+	- GrantedSize + MaxGrantedSize + TotalRequestedSize + TotalGrantedSize + TotalFreeSize
+
+
+## Samples of stats logs
+
+Log directories:
+- AdultSample: logs obtained in sequential for Adult dataset
+- AdultParallelSample: logs obtained in parallel for Adult dataset
+
+Producing the log using kht_test:
+- have the LearningTestTool script in the path
+- go to the MemoryStatsVisualizer directory
+- set environment variables and run the test in sequential mode
+  (under Windows in the example below)
+~~~~
+set KhiopsMemStatsLogFileName=%cd%\AdultSample\KhiopsMemoryStats.log
+set KhiopsMemStatsLogFrequency=100000
+set KhiopsMemStatsLogToCollect=16383
+set KhiopsIOTraceMode=true
+kht_test ..\LearningTest\TestKhiops\Standard\Adult r
+~~~~
+- idem in parallel mode
+~~~~
+set KhiopsMemStatsLogFileName=%cd%\AdultParallelSample\KhiopsMemoryStats.log
+kht_test ..\LearningTest\TestKhiops\Standard\Adult r -p 6
+~~~~
+
+## Exploiting memory stats logs
+
+Once the logs are available, the memory stats visualizer python scripts can be used to
+produce synthetic visualization and summaries of the logs:
+- memory_stats_visualizer: build a synthetic visualization of the logs and open it in a browers
+- collect_stats_by_task: build a summary text file with stats per task
+- compute_and_write_aggregates: build a summary Excel file with I/O stats per task
+
+See sample.py for an example of the use of each python script.
+
+This is an interactive display tool:
+- widgets are available in the top right-hand corner, for example to zoom in on a sub-section of the display
+- the legend on the right can be used to select what needs to be viewed and the statistics to be displayed
+- tooltips are available, showing detailed statistics for each selected part of the display
+- ...
diff --git a/test/MemoryStatsVisualizer/collect_stat_io.py b/test/MemoryStatsVisualizer/collect_stat_io.py
@@ -0,0 +1,209 @@
+import os
+import re
+import pandas as pd
+import plotly.io as pio
+import utils
+
+
+"""
+Aggregation  of the memory stats log produced by the Khiops tool
+
+It takes log file as input and aggregate I/O time.
+"""
+
+
+def extract_driver_method(label):
+    """Extract driver type, method Name, and Start/End"""
+    driver_type = ""
+    method_name = ""
+    start_end = ""
+    if label.find("driver") == 0:
+        offset_s = label.find("[")
+        offset_e = label.find("]")
+        offset_be = label.find("Begin")
+        if offset_be == -1:
+            offset_be = label.find("End")
+        assert offset_be != -1
+        driver_type = label[offset_s + 1 : offset_e]
+        method_name = label[offset_e + 2 : offset_be - 1]
+        start_end = label[offset_be:]
+
+        # Rename driver type in a more concise way
+        schemes = ("hdfs", "ansi", "s3")
+        for scheme in schemes:
+            if driver_type.lower().find(scheme) != -1:
+                driver_type = scheme
+                break
+    return driver_type, method_name, start_end
+
+
+def collect_stat_io(file_name):
+    """Collect memory stats per task"""
+    stats = []
+
+    # Defines offset collected stats per task
+    (
+        ID,
+        SLAVES,
+        START,
+        TIME,
+        PROCESS_NB,
+        READ_TIME,
+        WRITE_TIME,
+        READ,
+        WRITE,
+        ALLOC,
+        GRANTED,
+        NAME,
+    ) = (
+        0,
+        1,
+        2,
+        3,
+        4,
+        5,
+        6,
+        7,
+        8,
+        9,
+        10,
+        11,
+    )  # Keys for tuple fields in annotation families
+
+    # Count number of process, based on existing file with slave extensions
+    process_number = utils.get_process_number(file_name)
+
+    # Analyse stat files for all processes to collect stats per task (and per inter-task the the master)
+    for process_id in range(process_number):
+        process_file_name = utils.build_process_filename(file_name, process_id)
+
+        # Read data file
+        data = pd.read_csv(process_file_name, delimiter="\t")
+
+        data_time = data["Time"]
+        data_label = data["Label"]
+        stats_size = len(data_time)
+        start_time = 0
+        for i in range(stats_size):
+            label = data_label[i]
+
+            if isinstance(label, str) and label.find("driver") == 0:
+                driver_type, method_name, start_or_end = extract_driver_method(label)
+                if start_or_end == "Begin":
+                    assert start_time == 0
+                    start_time = data_time[i]
+
+                if start_or_end == "End":
+                    duration = data_time[i] - start_time
+                    start_time = 0
+                    stats.append([str(process_id), driver_type, method_name, duration])
+    return pd.DataFrame(stats, columns=("rank", "driver", "method", "time"))
+
+
+def compute_aggregates(file_name):
+    """Load data from files
+    ----------
+
+    Parameters
+    ----------
+    file_name: str
+        name of the input memory stat log file, without suffix for the master process and
+        with a suffix '_<processId>' per slave process
+
+    Returns
+    ----------
+    4 DataFrames :
+
+        - all statistics
+        - statitstics grouped by method
+        - statitstics grouped by method and driver
+        - statitstics grouped by rank, method and driver
+    """
+    # Loading data
+    df_stats = collect_stat_io(file_name)
+
+    if df_stats.empty:
+        raise Exception("no I/O statistics to collect")
+    else:
+        # Aggregation by ProcessId method and driver
+        df_groupby_rank = df_stats.groupby(["rank", "method", "driver"]).agg(
+            min=("time", "min"),
+            max=("time", "max"),
+            mean=("time", "mean"),
+            median=("time", "median"),
+            stddev=("time", "std"),
+            sum=("time", "sum"),
+            count=("time", "count"),
+        )
+
+        # Aggregation by method and driver
+        df_groupby_driver = df_stats.groupby(["method", "driver"]).agg(
+            min=("time", "min"),
+            max=("time", "max"),
+            mean=("time", "mean"),
+            median=("time", "median"),
+            stddev=("time", "std"),
+            sum=("time", "sum"),
+            count=("time", "count"),
+        )
+
+        # Aggregation by method
+        df_groupby_method = df_stats.groupby(["method"]).agg(
+            min=("time", "min"),
+            max=("time", "max"),
+            mean=("time", "mean"),
+            median=("time", "median"),
+            stddev=("time", "std"),
+            sum=("time", "sum"),
+            count=("time", "count"),
+        )
+
+        df_groupby_rank.sort_values(by=["driver", "method"], inplace=True)
+        df_groupby_driver.sort_values(by=["driver", "method"], inplace=True)
+        df_groupby_method.sort_values(by=["method"], inplace=True)
+
+        return df_stats, df_groupby_method, df_groupby_driver, df_groupby_rank
+
+
+def compute_and_write_aggregates(memory_log_file_name):
+    """Load data from files and write results in Excel files
+    ----------
+
+    Compute and write 4 DataFrames on disk :
+       - 'all_stats.xlsx' : all statistics on one file
+       - 'aggregate_stats.xlsx' :
+            - sheet 'group_by_method' : statitstics grouped by method
+            - sheet 'group_by_driver' : statitstics grouped by method and driver
+            - sheet 'group_by_rank' : rouped by rank, method and driver
+
+    Parameters
+    ----------
+    memory_log_file_name: str
+        The name of the input memory stat log file, without suffix for the master process and
+        with a suffix '_<processId>' per slave process
+
+    """
+
+    # Compute all aggregates
+    try:
+        df_stats, df_groupby_method, df_groupby_driver, df_groupby_rank = (
+            compute_aggregates(memory_log_file_name)
+        )
+    except Exception as error:
+        print("Error in IO analysis of log file " + memory_log_file_name + ":", error)
+        return
+
+    # Writes the global statistics in a separate excel file
+    dir_name = os.path.dirname(memory_log_file_name)
+    file_name = os.path.basename(memory_log_file_name)
+    stats_file_name = os.path.splitext(file_name)[0] + ".all_stats.xlsx"
+    with pd.ExcelWriter(os.path.join(dir_name, stats_file_name)) as writer:
+        df_stats.to_excel(writer)
+
+    # Writes aggregate statistics in excel sheets
+    stats_file_name = os.path.splitext(file_name)[0] + ".aggregate_stats.xlsx"
+    print("Save aggregate statistics file " + stats_file_name)
+    with pd.ExcelWriter(os.path.join(dir_name, stats_file_name)) as writer:
+        df_groupby_method.to_excel(writer, sheet_name="group_by_method")
+        df_groupby_driver.to_excel(writer, sheet_name="group_by_driver")
+        df_groupby_rank.to_excel(writer, sheet_name="group_by_rank")