Skip to content

Commit

Permalink
migrate runtime to modern ET libraries (pytorch#2994)
Browse files Browse the repository at this point in the history
Summary:
Pull Request resolved: pytorch#2994

## Overview
Migrated methods from ET libraries to replace our home-brew logics.
- Model and input flat buffer is migrated to bundled program flat buffer (.bpte)
- Jarvis memory allocation in runtime is migrated to executorch memory manager defined by executorch Span
- Input memory allocation is migrated to method-based data pointer assignment.
- Output and debug buffer is **partially** migrated to ETDump.
- Model output validation is **partially** migrated to method-based verification in bundled program.

## Input flow:
- Takes the edge program manager
- Build testsuites from methods. Only FOWARD method is applied and hardcoded.
- Build bundled program
- Serialize and store the bundled program in the flat buffer

## Output flow:
- A bundled program is loaded from the serialized flat buffer
- The program is executed on a selected backend.
- The output is generated.
- Validation: compare the expected with actual output by 1. the original Jarvis compare method (ENABLED), and 2. method-based VerifyResultWithBundledExpectedOutput (DISABLED)
- **Note**: the sink flow was reverted backed to a series of .npy output files and unflatten by `torch.utils._pytree.tree_unflatten` to re-enable legacy tests. ET/Bolt adopted a new flow that save outputs as `.bin` and load by `np.fromfile`. ETDump gets output from debug buffer. **These will be investigate in stage2**
TODO: T185104750 T185106115

## Memory Allocation
Re-abled Jarvis custom memory planning and supported to run on different backends (e.g. HIFI4).
- Enabled alloc_graph_input and output.
- Defined memory in torch::Span.
- **Note**: alloc_graph_output is using deprecated ET APIs: set_data(), mutable_date_ptr(). It has memory misalignment issue when migrating to the new flow. **These will be investigate in stage2**
TODO: T185104439

## Output Validation
Verify output by `torch::executor::bundled_program::VerifyResultWithBundledExpectedOutput`. This is currently a dummy validation for quantized tests which have high rtol. So their error threshold is set to a random large value i.e. 1e5 1e7. **These will be investigate in stage2**
TODO:T180249993 T185104615 T185104862

# Design
Major design decisions (ADR).
## Method 1 [ADOPTED]
Modify executor.cpp to consume a bundled_program flatbuffer and execute on a different BUCK host.
| - Pros: max reuse of existing configuration for custom Jarvis ops.
| - Cons: impact to runtime performance due to starting a new host.

## Method 2 [ABANDONED]
Use ET pybinding APIs to consume bundled program as a input and execute in runtime.
| - Pros: all ET APIs are encapsulated in Pythons that gears well with existing infrastructure
| - Cons: bad extensibility as backend is static (CPU) on start up and cannot be switched on the fly.
| - Cons: missing custom ops in runtime on the same BUCK host. Have to duplicate and hardcode dependencies.

# Progress
Program Injestion (input)
- [x] POC run of aten_relu_out and quantized_linear_out
- [x] Obtain Javis custom ops in runtime

Program Sink (Output)
- [x] Get etdump as etdp
- [x] Get Inspector object from etdump
- [x] Get program output from method
- [x] Re-enable scuba profile
- [x] Get debug buffer binary
- [x] enable dump output from etdump
- [x] get output from etdump
- [ ] migrate sink flow to etdump
- [ ] adjust memory config for dump

Verification
- [x] verify_result_with_bundled_expected_output with rtol and atol. Will set a very large rtol and atol to pass the validation for quantize.
- [x] Compare output with expected_output by original Jarvis compare (RMS)

Memory Planning
- [x] define memory planning input: MemoryConfig
- [x] understand what ET MemoryManager actually takes
- [x] migrate to ET MemoryManager with three new arguments
- [x] Re-enable alloc_graph_input
- [x] Re-enable alloc_graph_output
- [x] update legacy of HierarchicalAllocator
- [x] Verify if the size of planned buffer are correct

Misc.
- [ ] verify if input has been memcpy to a custom input buffer in bundled program when input mem is not allocated. Use set_input
- [ ] investigate if testsuites run in serial or like buck in parallel
- [ ] investigate output.bin workflow. Bolt as reference.
- [ ] Refactor to reuse module.h, module.cpp, data_module.cpp
- [ ] refactor based on TODO
- [x] clean legacy code

Reviewed By: tarun292, skrtskrtfb, mcremon-meta

Differential Revision: D53870154

fbshipit-source-id: 05efdd48da040f089c0cc65ee7ad5f2cb14be5bd
  • Loading branch information
zonglinpeng authored and facebook-github-bot committed Apr 29, 2024
1 parent 1f7f8c9 commit c992983
Showing 1 changed file with 4 additions and 4 deletions.
8 changes: 4 additions & 4 deletions profiler/parse_profiler_results.py
Original file line number Diff line number Diff line change
Expand Up @@ -434,19 +434,19 @@ def profile_framework_tax_table(

def deserialize_profile_results_files(
profile_results_path: str,
model_ff_path: str,
bundled_program_ff_path: str,
time_scale: TimeScale = TimeScale.TIME_IN_NS,
):
with open(profile_results_path, "rb") as prof_res_file, open(
model_ff_path, "rb"
bundled_program_ff_path, "rb"
) as model_ff_file:
prof_res_buf = prof_res_file.read()
model_ff_buf = model_ff_file.read()
bundled_program_ff_buf = model_ff_file.read()

prof_data, mem_allocations = deserialize_profile_results(prof_res_buf, time_scale)
framework_tax_data = profile_aggregate_framework_tax(prof_data)

prof_tables = profile_table(prof_data, model_ff_buf)
prof_tables = profile_table(prof_data, bundled_program_ff_buf)
for table in prof_tables:
print(table)

Expand Down

0 comments on commit c992983

Please sign in to comment.