Skip to content

Commit

Permalink
Add MemoryStatsVisualizer tool to Khiops repo
Browse files Browse the repository at this point in the history
L'outil MemoryStatsVisualizer permet de visualiser les traces memoires generes par les binaires Khiops.

Cet outil est desormais gere dans le repo Khiops dans le repertoire test\MemoryStatsVisualizer
Il est constitue d'une documentation d'utilisation (README.md), de scripts pythons, et d'exemples d'utilisation.

Reprise de l'outil existant, avec quelques operations minimalistes de nettoyage
- nettoyage minimaliste du code
- prise en compte des warnings PEP8 de pycharm (la plupart)
- ecriture README.md
- test complet
  - production des logs memoire pour les samples, avec le script kht_test
  - exploitation des logs memoire avec l'ensemble des outils de MemoryStatsVisualizer
  • Loading branch information
marcboulle committed Apr 12, 2024
1 parent c76df9a commit c204aa6
Show file tree
Hide file tree
Showing 14 changed files with 16,142 additions and 0 deletions.
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,12 @@ CMakeFiles/
test/**/results/
comparisonResults.log

# Visualization file and summaries produced by the MemoryStatsVisualizer tool
KhiopsMemoryStats.html
KhiopsMemoryStats.Summary.txt
KhiopsMemoryStats.aggregate_stats.xlsx
KhiopsMemoryStats.all_stats.xlsx

# Python
*.pyc
__pycache__/
5,610 changes: 5,610 additions & 0 deletions test/MemoryStatsVisualizer/AdultParallelSample/KhiopsMemoryStats.log

Large diffs are not rendered by default.

1,668 changes: 1,668 additions & 0 deletions test/MemoryStatsVisualizer/AdultParallelSample/KhiopsMemoryStats_1.log

Large diffs are not rendered by default.

1,679 changes: 1,679 additions & 0 deletions test/MemoryStatsVisualizer/AdultParallelSample/KhiopsMemoryStats_2.log

Large diffs are not rendered by default.

396 changes: 396 additions & 0 deletions test/MemoryStatsVisualizer/AdultParallelSample/KhiopsMemoryStats_3.log

Large diffs are not rendered by default.

382 changes: 382 additions & 0 deletions test/MemoryStatsVisualizer/AdultParallelSample/KhiopsMemoryStats_4.log

Large diffs are not rendered by default.

312 changes: 312 additions & 0 deletions test/MemoryStatsVisualizer/AdultParallelSample/KhiopsMemoryStats_5.log

Large diffs are not rendered by default.

4,555 changes: 4,555 additions & 0 deletions test/MemoryStatsVisualizer/AdultSample/KhiopsMemoryStats.log

Large diffs are not rendered by default.

99 changes: 99 additions & 0 deletions test/MemoryStatsVisualizer/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# Memory stats visualizer

Visualizer for memory stats logs produced by the Khiops tool

It takes log file as input and visualize them in a browser, with curves per memory stats and summary bars per process.
It also build some statistics summary of the log in tabular or Excel files.

## Producing memory stats logs

Khiops binaries can produce logs that summarise resource consumption for cores, CPU, memory and I/O.

The log are produced on demands, depending on environment variables.
These environment variables are recalled using the kht_env script of the LearningTestTool:
- KhiopsMemStatsLogFileName: None, memory stats log file name
- KhiopsMemStatsLogFrequency: None, frequency of allocator stats collection (0, 100000, 1000000,...)
- KhiopsMemStatsLogToCollect: None, stats to collect (8193: only time and labels, 16383: all,...)
- KhiopsIOTraceMode: None, to collect IO trace (false, true)

Warning: the KhiopsIOTraceMode produces a lot of logs: use it only when necessary.

#### Detailed settings for the logs to be collected

The following constants are used to select the statistics to be collected:
- LogTime=1 : Time (*temps associe au log*)
- HeapMemory=2 : Heap mem (*taille courante de la heap*)
- MaxHeapRequestedMemory=4 : Max heap mem (*taille maximale demandee pour la heap*)
- TotalHeapRequestedMemory=8 : Total heap mem (*taille total cumulee demande pour la heap*)
- AllocNumber=16 : Alloc (*nombre courant d'allocations*)
- MaxAllocNumber=32: Max alloc (*nombre maximale d'allocations*)
- TotalAllocNumber=64 : Total alloc (*nombre total d'allocations*)
- TotalFreeNumber=128 : Total free (*nombre total de liberations*)
- GrantedSize=256 : Granted (*taille courant alouee*)
- MaxGrantedSize=512 : Max granted (*taille maximale alouee*)
- TotalRequestedSize=1024 : Total requested size (*taille totale demandee*)
- TotalGrantedSize=2048 : Total granted size (*taille totale alouee*)
- TotalFreeSize=4096 : Total free size : (*taille totale liberee*)
- LogLabel=8192 : Label (*libelle utilisateur*)


To collect several statistics simultaneously, the corresponding constants must be added together:
- Special values:
- NoStats=0 (*aucune statistique*)
- AllStats=16383 (*toutes les statistique*)
- Some potential useful combinations
- LogInfo=8193 (*infos par log*)
- LogTime+LogLabel
- HeapStats=14 (*stats sur la heap*)
- HeapMemory + MaxHeapRequestedMemory + TotalHeapRequestedMemory
- AllocStats=8176 (*stats sur les allocations*)
- AllStats - LogInfo - HeapStats
- AllocCurrentStats=272 (*stats sur les allocations courantes*)
- AllocNumber + GrantedSize
- AllocMaxStats=544 (*stats sur les max d'allocations*)
- MaxAllocNumber + MaxGrantedSize (*stats sur les max d'allocations*)
- AllocNumberStats=240 (*stats sur les nombres d'allocations*)
- AllocNumber + MaxAllocNumber + TotalAllocNumber + TotalFreeNumber
- AllocSizeStats=7936 (*stats sur les tailles d'allocations*)
- GrantedSize + MaxGrantedSize + TotalRequestedSize + TotalGrantedSize + TotalFreeSize


## Samples of stats logs

Log directories:
- AdultSample: logs obtained in sequential for Adult dataset
- AdultParallelSample: logs obtained in parallel for Adult dataset

Producing the log using kht_test:
- have the LearningTestTool script in the path
- go to the MemoryStatsVisualizer directory
- set environment variables and run the test in sequential mode
(under Windows in the example below)
~~~~
set KhiopsMemStatsLogFileName=%cd%\AdultSample\KhiopsMemoryStats.log
set KhiopsMemStatsLogFrequency=100000
set KhiopsMemStatsLogToCollect=16383
set KhiopsIOTraceMode=true
kht_test ..\LearningTest\TestKhiops\Standard\Adult r
~~~~
- idem in parallel mode
~~~~
set KhiopsMemStatsLogFileName=%cd%\AdultParallelSample\KhiopsMemoryStats.log
kht_test ..\LearningTest\TestKhiops\Standard\Adult r -p 6
~~~~

## Exploiting memory stats logs

Once the logs are available, the memory stats visualizer python scripts can be used to
produce synthetic visualization and summaries of the logs:
- memory_stats_visualizer: build a synthetic visualization of the logs and open it in a browers
- collect_stats_by_task: build a summary text file with stats per task
- compute_and_write_aggregates: build a summary Excel file with I/O stats per task

See sample.py for an example of the use of each python script.

This is an interactive display tool:
- widgets are available in the top right-hand corner, for example to zoom in on a sub-section of the display
- the legend on the right can be used to select what needs to be viewed and the statistics to be displayed
- tooltips are available, showing detailed statistics for each selected part of the display
- ...
209 changes: 209 additions & 0 deletions test/MemoryStatsVisualizer/collect_stat_io.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,209 @@
import os
import re
import pandas as pd
import plotly.io as pio
import utils


"""
Aggregation of the memory stats log produced by the Khiops tool
It takes log file as input and aggregate I/O time.
"""


def extract_driver_method(label):
"""Extract driver type, method Name, and Start/End"""
driver_type = ""
method_name = ""
start_end = ""
if label.find("driver") == 0:
offset_s = label.find("[")
offset_e = label.find("]")
offset_be = label.find("Begin")
if offset_be == -1:
offset_be = label.find("End")
assert offset_be != -1
driver_type = label[offset_s + 1 : offset_e]
method_name = label[offset_e + 2 : offset_be - 1]
start_end = label[offset_be:]

# Rename driver type in a more concise way
schemes = ("hdfs", "ansi", "s3")
for scheme in schemes:
if driver_type.lower().find(scheme) != -1:
driver_type = scheme
break
return driver_type, method_name, start_end


def collect_stat_io(file_name):
"""Collect memory stats per task"""
stats = []

# Defines offset collected stats per task
(
ID,
SLAVES,
START,
TIME,
PROCESS_NB,
READ_TIME,
WRITE_TIME,
READ,
WRITE,
ALLOC,
GRANTED,
NAME,
) = (
0,
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
) # Keys for tuple fields in annotation families

# Count number of process, based on existing file with slave extensions
process_number = utils.get_process_number(file_name)

# Analyse stat files for all processes to collect stats per task (and per inter-task the the master)
for process_id in range(process_number):
process_file_name = utils.build_process_filename(file_name, process_id)

# Read data file
data = pd.read_csv(process_file_name, delimiter="\t")

data_time = data["Time"]
data_label = data["Label"]
stats_size = len(data_time)
start_time = 0
for i in range(stats_size):
label = data_label[i]

if isinstance(label, str) and label.find("driver") == 0:
driver_type, method_name, start_or_end = extract_driver_method(label)
if start_or_end == "Begin":
assert start_time == 0
start_time = data_time[i]

if start_or_end == "End":
duration = data_time[i] - start_time
start_time = 0
stats.append([str(process_id), driver_type, method_name, duration])
return pd.DataFrame(stats, columns=("rank", "driver", "method", "time"))


def compute_aggregates(file_name):
"""Load data from files
----------
Parameters
----------
file_name: str
name of the input memory stat log file, without suffix for the master process and
with a suffix '_<processId>' per slave process
Returns
----------
4 DataFrames :
- all statistics
- statitstics grouped by method
- statitstics grouped by method and driver
- statitstics grouped by rank, method and driver
"""
# Loading data
df_stats = collect_stat_io(file_name)

if df_stats.empty:
raise Exception("no I/O statistics to collect")
else:
# Aggregation by ProcessId method and driver
df_groupby_rank = df_stats.groupby(["rank", "method", "driver"]).agg(
min=("time", "min"),
max=("time", "max"),
mean=("time", "mean"),
median=("time", "median"),
stddev=("time", "std"),
sum=("time", "sum"),
count=("time", "count"),
)

# Aggregation by method and driver
df_groupby_driver = df_stats.groupby(["method", "driver"]).agg(
min=("time", "min"),
max=("time", "max"),
mean=("time", "mean"),
median=("time", "median"),
stddev=("time", "std"),
sum=("time", "sum"),
count=("time", "count"),
)

# Aggregation by method
df_groupby_method = df_stats.groupby(["method"]).agg(
min=("time", "min"),
max=("time", "max"),
mean=("time", "mean"),
median=("time", "median"),
stddev=("time", "std"),
sum=("time", "sum"),
count=("time", "count"),
)

df_groupby_rank.sort_values(by=["driver", "method"], inplace=True)
df_groupby_driver.sort_values(by=["driver", "method"], inplace=True)
df_groupby_method.sort_values(by=["method"], inplace=True)

return df_stats, df_groupby_method, df_groupby_driver, df_groupby_rank


def compute_and_write_aggregates(memory_log_file_name):
"""Load data from files and write results in Excel files
----------
Compute and write 4 DataFrames on disk :
- 'all_stats.xlsx' : all statistics on one file
- 'aggregate_stats.xlsx' :
- sheet 'group_by_method' : statitstics grouped by method
- sheet 'group_by_driver' : statitstics grouped by method and driver
- sheet 'group_by_rank' : rouped by rank, method and driver
Parameters
----------
memory_log_file_name: str
The name of the input memory stat log file, without suffix for the master process and
with a suffix '_<processId>' per slave process
"""

# Compute all aggregates
try:
df_stats, df_groupby_method, df_groupby_driver, df_groupby_rank = (
compute_aggregates(memory_log_file_name)
)
except Exception as error:
print("Error in IO analysis of log file " + memory_log_file_name + ":", error)
return

# Writes the global statistics in a separate excel file
dir_name = os.path.dirname(memory_log_file_name)
file_name = os.path.basename(memory_log_file_name)
stats_file_name = os.path.splitext(file_name)[0] + ".all_stats.xlsx"
with pd.ExcelWriter(os.path.join(dir_name, stats_file_name)) as writer:
df_stats.to_excel(writer)

# Writes aggregate statistics in excel sheets
stats_file_name = os.path.splitext(file_name)[0] + ".aggregate_stats.xlsx"
print("Save aggregate statistics file " + stats_file_name)
with pd.ExcelWriter(os.path.join(dir_name, stats_file_name)) as writer:
df_groupby_method.to_excel(writer, sheet_name="group_by_method")
df_groupby_driver.to_excel(writer, sheet_name="group_by_driver")
df_groupby_rank.to_excel(writer, sheet_name="group_by_rank")
Loading

0 comments on commit c204aa6

Please sign in to comment.