Use Output Ports of an Operator to Write Storage #3295

Xiao-zhen-Liu · 2025-03-02T06:48:56Z

Currently the Amber engine creates and uses sink operators to write the results of output ports of an operator. This design causes many problems, mainly because it alters the physical plan. Ideally a physical plan should not be changed once it is compiled.

This PR updates the design to use output ports of an operator instead of sink operators to write port results to storage. The changes implemented in this PR include:

The logic to add sink operators during compilation and scheduling are not removed in this PR. However, all the execution logic in the sink operator are removed. A sink operator will not do anything during execution. We will remove sink operators in the next PR.
GlobalPortIdentity is moved from Region to workflow-core so that it is accessible by all the modules.
The compiler does not create storage objects anymore. Instead, it produces a set of GlobalPortIdentity that need view results. This set is passed along to the scheduler as part of WorkflowContext.workflowSettings. In the future, this information will be directly produced by the frontend instead of by the compiler.
The scheduler combines the ports that need view result and materialized ports needed by the scheduler as part of a region. Ideally this information should be fixed once a region is created by the SchedulerGenerator. However, as the physical plan still needs to be changed currently (because of additional cache read operators), additional logic is implemented in the ScheduleGenerator to make sure this information is correct for all the regions.
The scheduler and resource allocator do not create storage objects directly and only assign storage URIs for each region as part of resourceConfig.
When a region is scheduled to execute, during the initialization of a region, these URIs are used to create storage objects.
AssignPortRequest is used to indicate whether an output port of a worker needs storage and to pass the storage URI information to a worker. Note this request is used for both input ports and output ports, and this PR only updates output ports. As a result, for input ports, empty storage URIs will be provided in AssignPortRequest. In the future, after we also use input ports to read storage, we will also update and use these storage URIs.
Note that since currently operators with dependee inputs belong to multiple regions, and AssignPortRequest is only used once for each worker, I had to implement additional logic in the to make sure all the regions that such an operator belongs to have the proper storage information (specifically, the output port connecting a dependee link belongs to two regions, and both regions need to have storageURI for this port)
Inside a worker (both Java and Python), the OutputManager is used to create writer threads for each output port that needs storage. The writing does not block the data processor, but will block the completion status of the operator/port.

TODOs:

Completely remove sink op
Use GlobalPortIdentity for storage URIs
Remove the mention of "result" in the storage layer
Let the frontend specify view results on the port level
Remove cache read op and use input port for reading storage

…orage

…orage # Conflicts: # core/amber/src/main/python/core/architecture/packaging/output_manager.py # core/amber/src/main/python/core/runnables/main_loop.py # core/amber/src/main/scala/edu/uci/ics/amber/engine/architecture/pythonworker/PythonWorkflowWorker.scala

…orage # Conflicts: # core/amber/src/main/scala/edu/uci/ics/texera/web/resource/dashboard/user/workflow/WorkflowExecutionsResource.scala # core/amber/src/main/scala/edu/uci/ics/texera/web/service/ResultExportService.scala

…n-use-output-port-for-storage # Conflicts: # core/amber/src/main/scala/edu/uci/ics/amber/engine/architecture/scheduling/ScheduleGenerator.scala # core/amber/src/main/scala/edu/uci/ics/texera/web/resource/dashboard/user/workflow/WorkflowExecutionsResource.scala # core/amber/src/main/scala/edu/uci/ics/texera/web/service/ExecutionResultService.scala # core/amber/src/main/scala/edu/uci/ics/texera/web/service/ResultExportService.scala # core/amber/src/main/scala/edu/uci/ics/texera/workflow/WorkflowCompiler.scala # core/workflow-core/src/main/scala/edu/uci/ics/amber/core/storage/VFSURIFactory.scala # core/workflow-operator/src/main/scala/edu/uci/ics/amber/operator/SpecialPhysicalOpFactory.scala # core/workflow-operator/src/main/scala/edu/uci/ics/amber/operator/sink/ProgressiveSinkOpExec.scala

…orage

Yicong-Huang

In general looks good! left code comments.

core/amber/src/main/python/core/architecture/packaging/output_manager.py

core/amber/src/main/python/core/storage/runnables/port_result_writer.py

...mber/src/main/scala/edu/uci/ics/amber/engine/architecture/scheduling/ScheduleGenerator.scala

...la/edu/uci/ics/amber/engine/architecture/scheduling/resourcePolicies/ResourceAllocator.scala

...ala/edu/uci/ics/amber/engine/architecture/worker/managers/OutputPortResultWriterThread.scala

core/amber/src/main/scala/edu/uci/ics/texera/web/service/ExecutionResultService.scala

core/amber/src/main/scala/edu/uci/ics/texera/workflow/WorkflowCompiler.scala

…orage

core/amber/src/main/scala/edu/uci/ics/amber/engine/architecture/scheduling/Region.scala

...la/edu/uci/ics/amber/engine/architecture/scheduling/resourcePolicies/ResourceAllocator.scala

Yicong-Huang

LGTM!

core/amber/src/main/python/core/architecture/packaging/output_manager.py

...mber/src/main/scala/edu/uci/ics/amber/engine/architecture/messaginglayer/OutputManager.scala

...ala/edu/uci/ics/texera/web/resource/dashboard/user/workflow/WorkflowExecutionsResource.scala

core/amber/src/main/scala/edu/uci/ics/texera/web/service/ExecutionResultService.scala

core/amber/src/main/scala/edu/uci/ics/texera/workflow/WorkflowCompiler.scala

core/workflow-core/src/main/scala/edu/uci/ics/amber/core/storage/DocumentFactory.scala

Xiao-zhen-Liu added 30 commits February 4, 2025 23:37

java works.

758096b

Merge branch 'refs/heads/master' into xiaozhen-use-output-port-for-st…

a932abf

…orage

Update python proto.

f747691

Python works.

d4f3535

remove sinkOpExec actions.

48937bf

Merge branch 'refs/heads/master' into xiaozhen-use-output-port-for-st…

f29523e

…orage

Merge branch 'refs/heads/master' into xiaozhen-use-output-port-for-st…

10e6d75

…orage

Merge branch 'refs/heads/master' into xiaozhen-use-output-port-for-st…

8b8e117

…orage

Add async result writing on Java.

83797ed

Add async result writing on Java.

b375367

Merge branch 'refs/heads/master' into xiaozhen-use-output-port-for-st…

edf211c

…orage

Merge branch 'master' into xiaozhen-use-output-port-for-storage

254e283

Add threaded port result writer in python.

104885d

Merge branch 'refs/heads/master' into xiaozhen-use-output-port-for-st…

a6302fc

…orage

Temp with materialized port storage.

77d9fc4

Modify psql scripts and joop generated code.

6b459d5

Ensure the position of new column; use layer_name instead of layer_id.

8c65ea8

Working version with a fix on sinkOp.

52994ca

Fix test.

9ea150d

Merge branch 'master' into xiaozhen-phy-op-id-for-storage

de79dd2

Fix fmt.

06eed21

Fix.

304e70c

Merge branch 'master' into xiaozhen-phy-op-id-for-storage

13dc90c

Move storage object creation to resource allocation.

b9b26fa

Remove sink operator generation.

a356938

Merge branch 'refs/heads/master' into xiaozhen-use-output-port-for-st…

5073b04

…orage

Merge branch 'master' into xiaozhen-phy-op-id-for-storage

4cef39c

Xiao-zhen-Liu added 2 commits March 1, 2025 22:49

Merge branch 'master' into xiaozhen-use-output-port-for-storage

3abe8c6

Refactor.

19b2795

Xiao-zhen-Liu self-assigned this Mar 2, 2025

Xiao-zhen-Liu added engine refactoring Refactor the code labels Mar 2, 2025

Xiao-zhen-Liu requested a review from Yicong-Huang March 2, 2025 19:46

Xiao-zhen-Liu added 5 commits March 3, 2025 12:07

java test fix

b69d687

fix ExpansionGreedyScheduleGenerator

b975b38

fix dev mode

a59cdb8

Fix circular python module dependency.

0bd4a0d

Merge branch 'refs/heads/master' into xiaozhen-use-output-port-for-st…

905d4a0

…orage

Yicong-Huang requested changes Mar 4, 2025

View reviewed changes

Xiao-zhen-Liu added 5 commits March 4, 2025 11:56

Merge branch 'refs/heads/master' into xiaozhen-use-output-port-for-st…

e7fec23

…orage

python minor comments.

0151a7f

Scala minor comments.

d2887a8

Merge branch 'refs/heads/master' into xiaozhen-use-output-port-for-st…

291dbe0

…orage

fmt.

01a570c

Yicong-Huang reviewed Mar 5, 2025

View reviewed changes

core/amber/src/main/scala/edu/uci/ics/amber/engine/architecture/scheduling/Region.scala Outdated Show resolved Hide resolved

Yicong-Huang reviewed Mar 5, 2025

View reviewed changes

...la/edu/uci/ics/amber/engine/architecture/scheduling/resourcePolicies/ResourceAllocator.scala Outdated Show resolved Hide resolved

Xiao-zhen-Liu added 5 commits March 5, 2025 11:23

Python main changes.

d69a6b0

Scala writer thread changes.

a3df014

Address comments.

d10316b

Address comments.

7bdaa3c

Fix.

c78591d

Xiao-zhen-Liu requested a review from Yicong-Huang March 6, 2025 00:04

Yicong-Huang approved these changes Mar 6, 2025

View reviewed changes

Xiao-zhen-Liu added 3 commits March 7, 2025 10:01

Minor refactoring.

aade79a

Renaming.

e395b70

Refactor ResourceConfig.

5d293c6

Xiao-zhen-Liu requested a review from shengquan-ni March 7, 2025 18:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Output Ports of an Operator to Write Storage #3295

Use Output Ports of an Operator to Write Storage #3295

Xiao-zhen-Liu commented Mar 2, 2025 •

edited

Loading

Yicong-Huang left a comment

Yicong-Huang left a comment

Use Output Ports of an Operator to Write Storage #3295

Are you sure you want to change the base?

Use Output Ports of an Operator to Write Storage #3295

Conversation

Xiao-zhen-Liu commented Mar 2, 2025 • edited Loading

TODOs:

Yicong-Huang left a comment

Choose a reason for hiding this comment

Yicong-Huang left a comment

Choose a reason for hiding this comment

Xiao-zhen-Liu commented Mar 2, 2025 •

edited

Loading