Use export_delegate_segments

Summary: Before this diff, exporting the 7b llama model with xnnpack was failing during flatc serialization because of a too-large offset. I noticed that the `to_executorch()` call wasn't enabling the `extract_delegate_segments` flag, which means: - The gigabytes of delegate data was getting inlined in the intermediate .json file used for flatc serialization. In this case, that added up to 15GiB of ASCII numbers, consuming 99.98% of the size of the overall .json file. This caused `flatc` to consume a ton of memory when parsing it, and to ultimately fail when it couldn't handle such a large array. - At runtime when `XnnpackBackend::init()` calls `processed->Free()`, the data couldn't actually be freed, dramatically increasing the peak memory used during execution. Before setting this flag, exporting the model on my devvm failed after ~43 minutes. After setting this flag, it succeeded after 19 minutes. Reviewed By: kimishpatel Differential Revision: D53738848 fbshipit-source-id: a47cc46f2941e74693487aad12621a62b003a3be
kirklandsign · Feb 14, 2024 · c4e0185 · c4e0185
1 parent d8d60a0
commit c4e0185
Showing 1 changed file with 1 addition and 0 deletions.
diff --git a/examples/models/llama2/export_llama_lib.py b/examples/models/llama2/export_llama_lib.py
@@ -398,6 +398,7 @@ def _export_llama(modelname, args) -> str:  # noqa: C901
     export_program = edge_manager.to_executorch(
         ExecutorchBackendConfig(
             extract_constant_segment=True,
+            extract_delegate_segments=True,
             passes=[
                 QuantFusionPass(),
             ],