Skip to content

Commit

Permalink
Add MemoryStatsVisualizer tool to Khiops repo (#237)
Browse files Browse the repository at this point in the history
L'outil MemoryStatsVisualizer permet de visualiser les traces memoires generes par les binaires Khiops.

Cet outil est desormais gere dans le repo Khiops dans le repertoire test\MemoryStatsVisualizer
Il est constitue d'une documentation d'utilisation (README.md), de scripts pythons, et d'exemples d'utilisation.

Reprise de l'outil existant, avec quelques operations minimalistes de nettoyage
- nettoyage minimaliste du code
- prise en compte des warnings PEP8 de pycharm (la plupart)
- ecriture README.md
- test complet
  - production des logs memoire pour les samples, avec le script kht_test
  - exploitation des logs memoire avec l'ensemble des outils de MemoryStatsVisualizer
  • Loading branch information
marcboulle authored Apr 15, 2024
1 parent c76df9a commit 9591132
Show file tree
Hide file tree
Showing 14 changed files with 16,151 additions and 0 deletions.
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,12 @@ CMakeFiles/
test/**/results/
comparisonResults.log

# Visualization file and summaries produced by the MemoryStatsVisualizer tool
KhiopsMemoryStats.html
KhiopsMemoryStats.Summary.txt
KhiopsMemoryStats.aggregate_stats.xlsx
KhiopsMemoryStats.all_stats.xlsx

# Python
*.pyc
__pycache__/
5,610 changes: 5,610 additions & 0 deletions test/MemoryStatsVisualizer/AdultParallelSample/KhiopsMemoryStats.log

Large diffs are not rendered by default.

1,668 changes: 1,668 additions & 0 deletions test/MemoryStatsVisualizer/AdultParallelSample/KhiopsMemoryStats_1.log

Large diffs are not rendered by default.

1,679 changes: 1,679 additions & 0 deletions test/MemoryStatsVisualizer/AdultParallelSample/KhiopsMemoryStats_2.log

Large diffs are not rendered by default.

396 changes: 396 additions & 0 deletions test/MemoryStatsVisualizer/AdultParallelSample/KhiopsMemoryStats_3.log

Large diffs are not rendered by default.

382 changes: 382 additions & 0 deletions test/MemoryStatsVisualizer/AdultParallelSample/KhiopsMemoryStats_4.log

Large diffs are not rendered by default.

312 changes: 312 additions & 0 deletions test/MemoryStatsVisualizer/AdultParallelSample/KhiopsMemoryStats_5.log

Large diffs are not rendered by default.

4,555 changes: 4,555 additions & 0 deletions test/MemoryStatsVisualizer/AdultSample/KhiopsMemoryStats.log

Large diffs are not rendered by default.

108 changes: 108 additions & 0 deletions test/MemoryStatsVisualizer/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# Memory stats visualizer

Visualizer for memory stats logs produced by the Khiops tool

It takes log file as input and visualize them in a browser, with curves per memory stats and summary bars per process.
It also build some statistics summary of the log in tabular or Excel files.

## Producing memory stats logs

Khiops binaries can produce logs that summarise resource consumption for cores, CPU, memory and I/O.

The log are produced on demands, depending on environment variables.
These environment variables are recalled using the kht_env script of the LearningTestTool:
- KhiopsMemStatsLogFileName: None, memory stats log file name
- KhiopsMemStatsLogFrequency: None, frequency of allocator stats collection (0, 100000, 1000000,...)
- KhiopsMemStatsLogToCollect: None, stats to collect (8193: only time and labels, 16383: all,...)
- KhiopsIOTraceMode: None, to collect IO trace (false, true)

The detailed valued usable for KhiopsMemStatsLogToCollect are detailed in the appendix at the end of this document.

Warning: the KhiopsIOTraceMode produces a lot of logs: use it only when necessary.


## Samples of stats logs

Log directories:
- AdultSample: logs obtained in sequential for Adult dataset
- AdultParallelSample: logs obtained in parallel for Adult dataset

Producing the log using kht_test:
- have the LearningTestTool script in the path
- go to the MemoryStatsVisualizer directory
- set environment variables and run the test in sequential mode
(under Windows in the example below)
~~~~
set KhiopsMemStatsLogFileName=%cd%\AdultSample\KhiopsMemoryStats.log
set KhiopsMemStatsLogFrequency=100000
set KhiopsMemStatsLogToCollect=16383
set KhiopsIOTraceMode=true
kht_test ..\LearningTest\TestKhiops\Standard\Adult r
~~~~
- idem in parallel mode
~~~~
set KhiopsMemStatsLogFileName=%cd%\AdultParallelSample\KhiopsMemoryStats.log
kht_test ..\LearningTest\TestKhiops\Standard\Adult r -p 6
~~~~

## Exploiting memory stats logs

Once the logs are available, the memory stats visualizer python scripts can be used to
produce synthetic visualization and summaries of the logs:
- memory_stats_visualizer: build a synthetic visualization of the logs and open it in a browers
- collect_stats_by_task: build a summary text file with stats per task
- compute_and_write_aggregates: build a summary Excel file with I/O stats per task

See sample.py for an example of the use of each python script.

This is an interactive display tool:
- widgets are available in the top right-hand corner, for example to zoom in on a sub-section of the display
- the legend on the right can be used to select what needs to be viewed and the statistics to be displayed
- tooltips are available, showing detailed statistics for each selected part of the display
- ...


#### Appendix: detailed settings for the logs to be collected

In the following example, "LogTime=1 : Time (*time associated with the log*)":
- Time: identifier
- 1: value to use for environment variable KhiopsMemStatsLogToCollect
- Time: label of the column if the file containing the logs (cf. KhiopsMemStatsLogFileName)
- *time associated with the log*: label

The following constants are used to select the statistics to be collected:
- LogTime=1 : Time (*time associated with the log*)
- HeapMemory=2 : Heap mem (*current size of the heap*)
- MaxHeapRequestedMemory=4 : Max heap mem (*maximum size requested for the heap*)
- TotalHeapRequestedMemory=8 : Total heap mem (*total cumulative size requested for the heap*)
- AllocNumber=16 : Alloc (*current number of allocations*)
- MaxAllocNumber=32: Max alloc (*maximum number of allocations*)
- TotalAllocNumber=64: Total alloc (*total number of allocations*)
- TotalFreeNumber=128: Total free (*total number of memory free*)
- GrantedSize=256 : Granted (*current size allocated*)
- MaxGrantedSize=512 : Max granted (*maximum size allocated*)
- TotalRequestedSize=1024 : Total requested size
- TotalGrantedSize=2048 : Total granted size
- TotalFreeSize=4096 : Total free size (*total size freed*)
- LogLabel=8192 : Label (*user label*)

To collect several statistics simultaneously, the corresponding constants must be added together:
- NoStats=0 (*no statistics*)
- AllStats=16383 (*all statistics*)
- Some potential useful combinations
- LogInfo=8193 (*info per log*)
- LogTime+LogLabel
- HeapStats=14 (*statitics on the heap*)
- HeapMemory + MaxHeapRequestedMemory + TotalHeapRequestedMemory
- AllocStats=8176 (*statistics on allocations*)
- AllStats - LogInfo - HeapStats
- AllocCurrentStats=272 (*statistics on current allocations*)
- AllocNumber + GrantedSize
- AllocMaxStats=544 (*statistics on max allocations*)
- MaxAllocNumber + MaxGrantedSize (*stats on max allocations*)
- AllocNumberStats=240 (*statistics on the number of allocations*)
- AllocNumber + MaxAllocNumber + TotalAllocNumber + TotalFreeNumber
- AllocSizeStats=7936 (*statistics on allocation sizes*)
- GrantedSize + MaxGrantedSize + TotalRequestedSize + TotalGrantedSize +

Reference: src\Norm\base\MemoryStatsManager.h
209 changes: 209 additions & 0 deletions test/MemoryStatsVisualizer/collect_stat_io.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,209 @@
import os
import re
import pandas as pd
import plotly.io as pio
import utils


"""
Aggregation of the memory stats log produced by the Khiops tool
It takes log file as input and aggregate I/O time.
"""


def extract_driver_method(label):
"""Extract driver type, method Name, and Start/End"""
driver_type = ""
method_name = ""
start_end = ""
if label.find("driver") == 0:
offset_s = label.find("[")
offset_e = label.find("]")
offset_be = label.find("Begin")
if offset_be == -1:
offset_be = label.find("End")
assert offset_be != -1
driver_type = label[offset_s + 1 : offset_e]
method_name = label[offset_e + 2 : offset_be - 1]
start_end = label[offset_be:]

# Rename driver type in a more concise way
schemes = ("hdfs", "ansi", "s3")
for scheme in schemes:
if driver_type.lower().find(scheme) != -1:
driver_type = scheme
break
return driver_type, method_name, start_end


def collect_stat_io(file_name):
"""Collect memory stats per task"""
stats = []

# Defines offset collected stats per task
(
ID,
SLAVES,
START,
TIME,
PROCESS_NB,
READ_TIME,
WRITE_TIME,
READ,
WRITE,
ALLOC,
GRANTED,
NAME,
) = (
0,
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
) # Keys for tuple fields in annotation families

# Count number of process, based on existing file with slave extensions
process_number = utils.get_process_number(file_name)

# Analyse stat files for all processes to collect stats per task (and per inter-task the the master)
for process_id in range(process_number):
process_file_name = utils.build_process_filename(file_name, process_id)

# Read data file
data = pd.read_csv(process_file_name, delimiter="\t")

data_time = data["Time"]
data_label = data["Label"]
stats_size = len(data_time)
start_time = 0
for i in range(stats_size):
label = data_label[i]

if isinstance(label, str) and label.find("driver") == 0:
driver_type, method_name, start_or_end = extract_driver_method(label)
if start_or_end == "Begin":
assert start_time == 0
start_time = data_time[i]

if start_or_end == "End":
duration = data_time[i] - start_time
start_time = 0
stats.append([str(process_id), driver_type, method_name, duration])
return pd.DataFrame(stats, columns=("rank", "driver", "method", "time"))


def compute_aggregates(file_name):
"""Load data from files
----------
Parameters
----------
file_name: str
name of the input memory stat log file, without suffix for the master process and
with a suffix '_<processId>' per slave process
Returns
----------
4 DataFrames :
- all statistics
- statitstics grouped by method
- statitstics grouped by method and driver
- statitstics grouped by rank, method and driver
"""
# Loading data
df_stats = collect_stat_io(file_name)

if df_stats.empty:
raise Exception("no I/O statistics to collect")
else:
# Aggregation by ProcessId method and driver
df_groupby_rank = df_stats.groupby(["rank", "method", "driver"]).agg(
min=("time", "min"),
max=("time", "max"),
mean=("time", "mean"),
median=("time", "median"),
stddev=("time", "std"),
sum=("time", "sum"),
count=("time", "count"),
)

# Aggregation by method and driver
df_groupby_driver = df_stats.groupby(["method", "driver"]).agg(
min=("time", "min"),
max=("time", "max"),
mean=("time", "mean"),
median=("time", "median"),
stddev=("time", "std"),
sum=("time", "sum"),
count=("time", "count"),
)

# Aggregation by method
df_groupby_method = df_stats.groupby(["method"]).agg(
min=("time", "min"),
max=("time", "max"),
mean=("time", "mean"),
median=("time", "median"),
stddev=("time", "std"),
sum=("time", "sum"),
count=("time", "count"),
)

df_groupby_rank.sort_values(by=["driver", "method"], inplace=True)
df_groupby_driver.sort_values(by=["driver", "method"], inplace=True)
df_groupby_method.sort_values(by=["method"], inplace=True)

return df_stats, df_groupby_method, df_groupby_driver, df_groupby_rank


def compute_and_write_aggregates(memory_log_file_name):
"""Load data from files and write results in Excel files
----------
Compute and write 4 DataFrames on disk :
- 'all_stats.xlsx' : all statistics on one file
- 'aggregate_stats.xlsx' :
- sheet 'group_by_method' : statitstics grouped by method
- sheet 'group_by_driver' : statitstics grouped by method and driver
- sheet 'group_by_rank' : rouped by rank, method and driver
Parameters
----------
memory_log_file_name: str
The name of the input memory stat log file, without suffix for the master process and
with a suffix '_<processId>' per slave process
"""

# Compute all aggregates
try:
df_stats, df_groupby_method, df_groupby_driver, df_groupby_rank = (
compute_aggregates(memory_log_file_name)
)
except Exception as error:
print("Error in IO analysis of log file " + memory_log_file_name + ":", error)
return

# Writes the global statistics in a separate excel file
dir_name = os.path.dirname(memory_log_file_name)
file_name = os.path.basename(memory_log_file_name)
stats_file_name = os.path.splitext(file_name)[0] + ".all_stats.xlsx"
with pd.ExcelWriter(os.path.join(dir_name, stats_file_name)) as writer:
df_stats.to_excel(writer)

# Writes aggregate statistics in excel sheets
stats_file_name = os.path.splitext(file_name)[0] + ".aggregate_stats.xlsx"
print("Save aggregate statistics file " + stats_file_name)
with pd.ExcelWriter(os.path.join(dir_name, stats_file_name)) as writer:
df_groupby_method.to_excel(writer, sheet_name="group_by_method")
df_groupby_driver.to_excel(writer, sheet_name="group_by_driver")
df_groupby_rank.to_excel(writer, sheet_name="group_by_rank")
Loading

0 comments on commit 9591132

Please sign in to comment.