Add MemoryStatsVisualizer tool to Khiops repo (#237)

L'outil MemoryStatsVisualizer permet de visualiser les traces memoires generes par les binaires Khiops. Cet outil est desormais gere dans le repo Khiops dans le repertoire test\MemoryStatsVisualizer Il est constitue d'une documentation d'utilisation (README.md), de scripts pythons, et d'exemples d'utilisation. Reprise de l'outil existant, avec quelques operations minimalistes de nettoyage - nettoyage minimaliste du code - prise en compte des warnings PEP8 de pycharm (la plupart) - ecriture README.md - test complet - production des logs memoire pour les samples, avec le script kht_test - exploitation des logs memoire avec l'ensemble des outils de MemoryStatsVisualizer
KhiopsML · Apr 15, 2024 · 9591132 · 9591132
1 parent c76df9a
commit 9591132
Show file tree

Hide file tree

Showing 14 changed files with 16,151 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -21,6 +21,12 @@ CMakeFiles/
 test/**/results/
 comparisonResults.log
 
+# Visualization file and summaries produced by the MemoryStatsVisualizer tool
+KhiopsMemoryStats.html
+KhiopsMemoryStats.Summary.txt
+KhiopsMemoryStats.aggregate_stats.xlsx
+KhiopsMemoryStats.all_stats.xlsx
+
 # Python
 *.pyc
 __pycache__/
diff --git a/test/MemoryStatsVisualizer/AdultParallelSample/KhiopsMemoryStats.log b/test/MemoryStatsVisualizer/AdultParallelSample/KhiopsMemoryStats.log
diff --git a/test/MemoryStatsVisualizer/AdultParallelSample/KhiopsMemoryStats_1.log b/test/MemoryStatsVisualizer/AdultParallelSample/KhiopsMemoryStats_1.log
diff --git a/test/MemoryStatsVisualizer/AdultParallelSample/KhiopsMemoryStats_2.log b/test/MemoryStatsVisualizer/AdultParallelSample/KhiopsMemoryStats_2.log
diff --git a/test/MemoryStatsVisualizer/AdultParallelSample/KhiopsMemoryStats_3.log b/test/MemoryStatsVisualizer/AdultParallelSample/KhiopsMemoryStats_3.log
diff --git a/test/MemoryStatsVisualizer/AdultParallelSample/KhiopsMemoryStats_4.log b/test/MemoryStatsVisualizer/AdultParallelSample/KhiopsMemoryStats_4.log
diff --git a/test/MemoryStatsVisualizer/AdultParallelSample/KhiopsMemoryStats_5.log b/test/MemoryStatsVisualizer/AdultParallelSample/KhiopsMemoryStats_5.log
diff --git a/test/MemoryStatsVisualizer/AdultSample/KhiopsMemoryStats.log b/test/MemoryStatsVisualizer/AdultSample/KhiopsMemoryStats.log
diff --git a/test/MemoryStatsVisualizer/README.md b/test/MemoryStatsVisualizer/README.md
@@ -0,0 +1,108 @@
+# Memory stats visualizer
+
+Visualizer for memory stats logs produced by the Khiops tool
+
+It takes log file as input and visualize them in a browser, with curves per memory stats and summary bars per process.
+It also build some statistics summary of the log in tabular or Excel files.
+
+## Producing memory stats logs
+
+Khiops binaries can produce logs that summarise resource consumption for cores, CPU, memory and I/O.
+
+The log are produced on demands, depending on environment variables.
+These environment variables are recalled using the kht_env script of the LearningTestTool:
+- KhiopsMemStatsLogFileName: None, memory stats log file name
+- KhiopsMemStatsLogFrequency: None, frequency of allocator stats collection (0, 100000, 1000000,...)
+- KhiopsMemStatsLogToCollect: None, stats to collect (8193: only time and labels, 16383: all,...)
+- KhiopsIOTraceMode: None, to collect IO trace (false, true)
+
+The detailed valued usable for KhiopsMemStatsLogToCollect are detailed in the appendix at the end of this document.
+
+Warning: the KhiopsIOTraceMode produces a lot of logs: use it only when necessary.
+
+
+## Samples of stats logs
+
+Log directories:
+- AdultSample: logs obtained in sequential for Adult dataset
+- AdultParallelSample: logs obtained in parallel for Adult dataset
+
+Producing the log using kht_test:
+- have the LearningTestTool script in the path
+- go to the MemoryStatsVisualizer directory
+- set environment variables and run the test in sequential mode
+  (under Windows in the example below)
+~~~~
+set KhiopsMemStatsLogFileName=%cd%\AdultSample\KhiopsMemoryStats.log
+set KhiopsMemStatsLogFrequency=100000
+set KhiopsMemStatsLogToCollect=16383
+set KhiopsIOTraceMode=true
+kht_test ..\LearningTest\TestKhiops\Standard\Adult r
+~~~~
+- idem in parallel mode
+~~~~
+set KhiopsMemStatsLogFileName=%cd%\AdultParallelSample\KhiopsMemoryStats.log
+kht_test ..\LearningTest\TestKhiops\Standard\Adult r -p 6
+~~~~
+
+## Exploiting memory stats logs
+
+Once the logs are available, the memory stats visualizer python scripts can be used to
+produce synthetic visualization and summaries of the logs:
+- memory_stats_visualizer: build a synthetic visualization of the logs and open it in a browers
+- collect_stats_by_task: build a summary text file with stats per task
+- compute_and_write_aggregates: build a summary Excel file with I/O stats per task
+
+See sample.py for an example of the use of each python script.
+
+This is an interactive display tool:
+- widgets are available in the top right-hand corner, for example to zoom in on a sub-section of the display
+- the legend on the right can be used to select what needs to be viewed and the statistics to be displayed
+- tooltips are available, showing detailed statistics for each selected part of the display
+- ...
+
+
+#### Appendix: detailed settings for the logs to be collected
+
+In the following example, "LogTime=1 : Time (*time associated with the log*)":
+- Time: identifier
+- 1: value to use for environment variable KhiopsMemStatsLogToCollect
+- Time: label of the column if the file containing the logs (cf. KhiopsMemStatsLogFileName)
+- *time associated with the log*: label
+
+The following constants are used to select the statistics to be collected:
+- LogTime=1 : Time (*time associated with the log*)
+- HeapMemory=2 : Heap mem (*current size of the heap*)
+- MaxHeapRequestedMemory=4 : Max heap mem (*maximum size requested for the heap*)
+- TotalHeapRequestedMemory=8 : Total heap mem (*total cumulative size requested for the heap*)
+- AllocNumber=16 : Alloc (*current number of allocations*)
+- MaxAllocNumber=32: Max alloc (*maximum number of allocations*)
+- TotalAllocNumber=64: Total alloc (*total number of allocations*)
+- TotalFreeNumber=128: Total free (*total number of memory free*)
+- GrantedSize=256 : Granted (*current size allocated*)
+- MaxGrantedSize=512 : Max granted (*maximum size allocated*)
+- TotalRequestedSize=1024 : Total requested size
+- TotalGrantedSize=2048 : Total granted size
+- TotalFreeSize=4096 : Total free size (*total size freed*)
+- LogLabel=8192 : Label (*user label*)
+
+To collect several statistics simultaneously, the corresponding constants must be added together:
+  - NoStats=0 (*no statistics*)
+  - AllStats=16383 (*all statistics*)
+- Some potential useful combinations
+  - LogInfo=8193 (*info per log*)
+	- LogTime+LogLabel
+  - HeapStats=14 (*statitics on the heap*)
+    - HeapMemory + MaxHeapRequestedMemory + TotalHeapRequestedMemory
+  - AllocStats=8176 (*statistics on allocations*)
+	- AllStats - LogInfo - HeapStats
+  - AllocCurrentStats=272 (*statistics on current allocations*)
+	- AllocNumber + GrantedSize
+  - AllocMaxStats=544 (*statistics on max allocations*)
+	- MaxAllocNumber + MaxGrantedSize (*stats on max allocations*)
+  - AllocNumberStats=240 (*statistics on the number of allocations*)
+	- AllocNumber + MaxAllocNumber + TotalAllocNumber + TotalFreeNumber
+  - AllocSizeStats=7936 (*statistics on allocation sizes*)
+	- GrantedSize + MaxGrantedSize + TotalRequestedSize + TotalGrantedSize +
+
+Reference: src\Norm\base\MemoryStatsManager.h
diff --git a/test/MemoryStatsVisualizer/collect_stat_io.py b/test/MemoryStatsVisualizer/collect_stat_io.py
@@ -0,0 +1,209 @@
+import os
+import re
+import pandas as pd
+import plotly.io as pio
+import utils
+
+
+"""
+Aggregation  of the memory stats log produced by the Khiops tool
+
+It takes log file as input and aggregate I/O time.
+"""
+
+
+def extract_driver_method(label):
+    """Extract driver type, method Name, and Start/End"""
+    driver_type = ""
+    method_name = ""
+    start_end = ""
+    if label.find("driver") == 0:
+        offset_s = label.find("[")
+        offset_e = label.find("]")
+        offset_be = label.find("Begin")
+        if offset_be == -1:
+            offset_be = label.find("End")
+        assert offset_be != -1
+        driver_type = label[offset_s + 1 : offset_e]
+        method_name = label[offset_e + 2 : offset_be - 1]
+        start_end = label[offset_be:]
+
+        # Rename driver type in a more concise way
+        schemes = ("hdfs", "ansi", "s3")
+        for scheme in schemes:
+            if driver_type.lower().find(scheme) != -1:
+                driver_type = scheme
+                break
+    return driver_type, method_name, start_end
+
+
+def collect_stat_io(file_name):
+    """Collect memory stats per task"""
+    stats = []
+
+    # Defines offset collected stats per task
+    (
+        ID,
+        SLAVES,
+        START,
+        TIME,
+        PROCESS_NB,
+        READ_TIME,
+        WRITE_TIME,
+        READ,
+        WRITE,
+        ALLOC,
+        GRANTED,
+        NAME,
+    ) = (
+        0,
+        1,
+        2,
+        3,
+        4,
+        5,
+        6,
+        7,
+        8,
+        9,
+        10,
+        11,
+    )  # Keys for tuple fields in annotation families
+
+    # Count number of process, based on existing file with slave extensions
+    process_number = utils.get_process_number(file_name)
+
+    # Analyse stat files for all processes to collect stats per task (and per inter-task the the master)
+    for process_id in range(process_number):
+        process_file_name = utils.build_process_filename(file_name, process_id)
+
+        # Read data file
+        data = pd.read_csv(process_file_name, delimiter="\t")
+
+        data_time = data["Time"]
+        data_label = data["Label"]
+        stats_size = len(data_time)
+        start_time = 0
+        for i in range(stats_size):
+            label = data_label[i]
+
+            if isinstance(label, str) and label.find("driver") == 0:
+                driver_type, method_name, start_or_end = extract_driver_method(label)
+                if start_or_end == "Begin":
+                    assert start_time == 0
+                    start_time = data_time[i]
+
+                if start_or_end == "End":
+                    duration = data_time[i] - start_time
+                    start_time = 0
+                    stats.append([str(process_id), driver_type, method_name, duration])
+    return pd.DataFrame(stats, columns=("rank", "driver", "method", "time"))
+
+
+def compute_aggregates(file_name):
+    """Load data from files
+    ----------
+
+    Parameters
+    ----------
+    file_name: str
+        name of the input memory stat log file, without suffix for the master process and
+        with a suffix '_<processId>' per slave process
+
+    Returns
+    ----------
+    4 DataFrames :
+
+        - all statistics
+        - statitstics grouped by method
+        - statitstics grouped by method and driver
+        - statitstics grouped by rank, method and driver
+    """
+    # Loading data
+    df_stats = collect_stat_io(file_name)
+
+    if df_stats.empty:
+        raise Exception("no I/O statistics to collect")
+    else:
+        # Aggregation by ProcessId method and driver
+        df_groupby_rank = df_stats.groupby(["rank", "method", "driver"]).agg(
+            min=("time", "min"),
+            max=("time", "max"),
+            mean=("time", "mean"),
+            median=("time", "median"),
+            stddev=("time", "std"),
+            sum=("time", "sum"),
+            count=("time", "count"),
+        )
+
+        # Aggregation by method and driver
+        df_groupby_driver = df_stats.groupby(["method", "driver"]).agg(
+            min=("time", "min"),
+            max=("time", "max"),
+            mean=("time", "mean"),
+            median=("time", "median"),
+            stddev=("time", "std"),
+            sum=("time", "sum"),
+            count=("time", "count"),
+        )
+
+        # Aggregation by method
+        df_groupby_method = df_stats.groupby(["method"]).agg(
+            min=("time", "min"),
+            max=("time", "max"),
+            mean=("time", "mean"),
+            median=("time", "median"),
+            stddev=("time", "std"),
+            sum=("time", "sum"),
+            count=("time", "count"),
+        )
+
+        df_groupby_rank.sort_values(by=["driver", "method"], inplace=True)
+        df_groupby_driver.sort_values(by=["driver", "method"], inplace=True)
+        df_groupby_method.sort_values(by=["method"], inplace=True)
+
+        return df_stats, df_groupby_method, df_groupby_driver, df_groupby_rank
+
+
+def compute_and_write_aggregates(memory_log_file_name):
+    """Load data from files and write results in Excel files
+    ----------
+
+    Compute and write 4 DataFrames on disk :
+       - 'all_stats.xlsx' : all statistics on one file
+       - 'aggregate_stats.xlsx' :
+            - sheet 'group_by_method' : statitstics grouped by method
+            - sheet 'group_by_driver' : statitstics grouped by method and driver
+            - sheet 'group_by_rank' : rouped by rank, method and driver
+
+    Parameters
+    ----------
+    memory_log_file_name: str
+        The name of the input memory stat log file, without suffix for the master process and
+        with a suffix '_<processId>' per slave process
+
+    """
+
+    # Compute all aggregates
+    try:
+        df_stats, df_groupby_method, df_groupby_driver, df_groupby_rank = (
+            compute_aggregates(memory_log_file_name)
+        )
+    except Exception as error:
+        print("Error in IO analysis of log file " + memory_log_file_name + ":", error)
+        return
+
+    # Writes the global statistics in a separate excel file
+    dir_name = os.path.dirname(memory_log_file_name)
+    file_name = os.path.basename(memory_log_file_name)
+    stats_file_name = os.path.splitext(file_name)[0] + ".all_stats.xlsx"
+    with pd.ExcelWriter(os.path.join(dir_name, stats_file_name)) as writer:
+        df_stats.to_excel(writer)
+
+    # Writes aggregate statistics in excel sheets
+    stats_file_name = os.path.splitext(file_name)[0] + ".aggregate_stats.xlsx"
+    print("Save aggregate statistics file " + stats_file_name)
+    with pd.ExcelWriter(os.path.join(dir_name, stats_file_name)) as writer:
+        df_groupby_method.to_excel(writer, sheet_name="group_by_method")
+        df_groupby_driver.to_excel(writer, sheet_name="group_by_driver")
+        df_groupby_rank.to_excel(writer, sheet_name="group_by_rank")