Next steps for distributed deployments #535

phil-opp · 2024-06-05T17:03:40Z

This issue outlines the steps that we need to take to make dora dataflow work across multiple machines:

update dora check to skip checking paths on remote machines
- use the same logic when checking the dataflow in dora start
- open question: How do we know which machine ID is local and which remote?
- alternative: skip path checks completely when dataflow specifies multiple machine IDs
figure out a way to handle relative node paths on remote machines
- For local dataflows, we use the folder containing the dataflow YAML file as working directory. This does not work for remote machines since the YAML file is not available there.
- Option 1: Use the working directory of the daemon by default (i.e. the directory where the daemon was started in)
  - This would be a breaking change.
- Option 2: Only allow absolute paths for remote machines (this is probably too limiting)
- Option 3: Configure the working directory for each machine in the YAML file.
- Other ideas?

The text was updated successfully, but these errors were encountered:

phil-opp · 2024-06-05T17:10:23Z

Some relevant places in the code:

Lines 19 to 21 in 60e4d7d

    
           // TODO: remove this once we figure out deploying of node/operator 
        
           // binaries from CLI to coordinator/daemon 
        
           local_working_dir: PathBuf,

The CLI sets this directory on dora start when in sends the start command to the coordinator. The coordinator forwards it to the daemon on the target machine, which uses it when spawning the dataflow nodes:

dora/binaries/daemon/src/lib.rs

Line 535 in 60e4d7d

working_dir: PathBuf,
Remote machines might have a different directory structure, so it does not make sense to use the same working directory there.

dora/libraries/core/src/descriptor/validate.rs

Lines 25 to 30 in d4ff586

    
           if source_is_url(source) { 
        
               info!("{source} is a URL."); // TODO: Implement url check. 
        
           } else { 
        
               resolve_path(source, working_dir) 
        
                   .wrap_err_with(|| format!("Could not find source path `{}`", source))?; 
        
           };

We check whether a node exists here.
For nodes running on remote machines, this check doesn't make sense.

dora/binaries/daemon/src/spawn.rs

Lines 81 to 90 in d4ff586

    
           let resolved_path = if source_is_url(source) { 
        
               // try to download the shared library 
        
               let target_path = Path::new("build") 
        
                   .join(node_id.to_string()) 
        
                   .with_extension(EXE_EXTENSION); 
        
               download_file(source, &target_path) 
        
                   .await 
        
                   .wrap_err("failed to download custom node")?; 
        
               target_path.clone() 
        
           } else {

We do support URLs as node sources. This could be useful for distributed deployments.

haixuanTao · 2024-06-06T08:19:48Z

Option 1: Use the working directory of the daemon by default (i.e. the directory where the daemon was started in)

I think that if this is the specification it should be the same specification across local and remote node, thus breaking changes compared to current implementation.

Option 2: Only allow absolute paths for remote machines (this is probably too limiting)

I would expect this to be an available option all the time as it might not be easy to specify a specific file very far from the daemon spawning path.

Gege-Wang · 2024-06-06T08:36:22Z

If we pass parameter working_dir when we start daemon then we can manage working_dir just like machine_id, When a node is started, the coordinator passes the working_dir of the corresponding daemon so we don't have to skip checks. However, I'm not sure that's possible.

phil-opp · 2024-06-06T10:22:37Z

If we pass parameter working_dir when we start daemon then we can manage working_dir just like machine_id, When a node is started, the coordinator passes the working_dir of the corresponding daemon so we don't have to skip checks. However, I'm not sure that's possible.

This works when both daemons are running on the same machine. However, if a daemon runs on a remote machine, we have no access to its file system, so we cannot check the paths.

phil-opp · 2024-06-06T10:24:27Z

Option 1: Use the working directory of the daemon by default (i.e. the directory where the daemon was started in)

I think that if this is the specification it should be the same specification across local and remote node, thus breaking changes compared to current implementation.

Good point, I added this drawback to the list above.

Option 2: Only allow absolute paths for remote machines (this is probably too limiting)

I would expect this to be an available option all the time as it might not be easy to specify a specific file very far from the daemon spawning path.

Yes, it's always available as an option. What I meant is that we don't allow relative paths for remote machines.

haixuanTao · 2024-06-06T12:36:03Z

In that case can we maybe try an implementation using option 2 before making Option 1.

What do you think @XxChang @Gege-Wang ?

XxChang · 2024-06-06T12:56:10Z

In that case can we maybe try an implementation using option 2 before making Option 1.

What do you think @XxChang @Gege-Wang ?

I think it is good, let me do it.

phil-opp · 2024-06-06T14:34:05Z

I opened a draft PR that implements option 2 a few days ago, maybe that's useful: #534

One challenge is the multiple-daemons test, which runs on multiple machines, which all resolve to the same local machine. Using absolute paths in its config is not ideal because we want to commit the test to git and run it on different machines.

haixuanTao · 2024-06-06T16:43:02Z

I see.

Maybe we can use some environment variable to fix CI?

Otherwise, I guess it's fine to hard code GitHub CI path for now.

Gege-Wang · 2024-06-13T07:01:39Z

There are some issues about dora start and dora check to skip checking paths on remote machines:

if cli and coordinator are local, some daemons are on remote machines, when dora start, the cli check(&working_dir) will go to this branch and resolve_path. Here the resolve_path will fail, because the remote daemon path exist check should be skip here.

dora/libraries/core/src/descriptor/validate.rs

Lines 54 to 57 in eda09cb

    
           } else { 
        
               resolve_path(source, working_dir) 
        
                   .wrap_err_with(|| format!("Could not find source path `{}`", source))?; 
        
           };

If the cli check the dataflow, this problem should be always here, because cli never know whether the daemon are local or remote.

if the cli and coordinator is ubuntu, and some remote daemons are windows, the windows absolute path will be checked into relative. Here the check fails, even though we write the right absolute path.

 let path = Pathbuf::from("C:\\dora\\tmp\\test.log");
    if path.is_absolute() {
        println!("Path is absolute");
    } else {
        println!("Path is relative");
    }

if cli and coordinator are local, some daemons are on remote machines, theoretically，we can start the dataflow like this.

# cli
dora coordinator
dora daemon --machine-id A 

# remote
dora daemon --machine-id B --coordinator-addr <remote-ip>:<port>

however, it doesn't work, because the ip of machine A is 127.0.0.1，so we must start dataflow like this

# cli
dora coordinator
dora daemon --machine-id A  --coordinator-add <local-ip>:<port>

# remote
dora daemon --machine-id B --coordinator-addr <remote-ip>:<port>

I don't understand why we use one work_dir per-daemon. theoretically, We should use one work_dir per dataflow? And why we have to check the dataflow in cli, it is complex in multiple-daemon.

phil-opp · 2024-06-20T12:50:13Z

Thanks a lot for testing and reporting these issues! This is very useful!

if cli and coordinator are local, some daemons are on remote machines, when dora start, the cli check(&working_dir) will go to this branch and resolve_path. Here the resolve_path will fail, because the remote daemon path exist check should be skip here.

I think we can fix this in the following way:

For dora check, only print a warning if the path doesn't exit, instead of failing.
For dora start, we should get the query the list of remote_machine_ids from the coordinator.

if cli and coordinator are local, some daemons are on remote machines, theoretically，we can start the dataflow like this.
# cli
dora coordinator
dora daemon --machine-id A 

# remote
dora daemon --machine-id B --coordinator-addr <remote-ip>:<port>
however, it doesn't work, because the ip of machine A is 127.0.0.1，so we must start dataflow like this
# cli
dora coordinator
dora daemon --machine-id A  --coordinator-add <local-ip>:<port>

# remote
dora daemon --machine-id B --coordinator-addr <remote-ip>:<port>

The issue is around these lines:

dora/binaries/coordinator/src/lib.rs

Lines 181 to 184 in 9d2ee36

    
           let peer_ip = connection 
        
               .peer_addr() 
        
               .map(|addr| addr.ip()) 
        
               .map_err(|err| format!("failed to get peer addr of connection: {err}"));

If the peer_ip is the loopback address, we know that the coordinator and the daemon run on the same machine. So other daemons should be able to reach the registered deamon through the same IP address as the coordinator. So a (hacky) fix could be to replace the 127.0.0.1 with the coordinator listen IP.

if the cli and coordinator is ubuntu, and some remote daemons are windows, the windows absolute path will be checked into relative. Here the check fails, even though we write the right absolute path.
 let path = Pathbuf::from("C:\\dora\\tmp\\test.log");
    if path.is_absolute() {
        println!("Path is absolute");
    } else {
        println!("Path is relative");
    }

Good catch! So we need a way to check whether a path is an absolute Windows path on Linux systems (and the other way around). Maybe there are some crates that allow this? Alternatively, we could copy the libstd implementations and provide them in architecture-independent functions.

github-actions bot added the daemon label Jun 5, 2024

phil-opp mentioned this issue Jun 5, 2024

Tracking issue for distributed dora dataflow #459

Open

4 tasks

XxChang mentioned this issue Jun 6, 2024

Refuse relative path for remote in coordinator #538

Merged

This was referenced Jun 13, 2024

Configure remote working directory #545

Closed

Make daemons working dir configurable #555

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Next steps for distributed deployments #535

Next steps for distributed deployments #535

phil-opp commented Jun 5, 2024 •

edited by Gege-Wang

Loading

phil-opp commented Jun 5, 2024

haixuanTao commented Jun 6, 2024 •

edited

Loading

Gege-Wang commented Jun 6, 2024

phil-opp commented Jun 6, 2024

phil-opp commented Jun 6, 2024

haixuanTao commented Jun 6, 2024

XxChang commented Jun 6, 2024

phil-opp commented Jun 6, 2024

haixuanTao commented Jun 6, 2024

Gege-Wang commented Jun 13, 2024 •

edited

Loading

phil-opp commented Jun 20, 2024

Next steps for distributed deployments #535

Next steps for distributed deployments #535

Comments

phil-opp commented Jun 5, 2024 • edited by Gege-Wang Loading

phil-opp commented Jun 5, 2024

haixuanTao commented Jun 6, 2024 • edited Loading

Gege-Wang commented Jun 6, 2024

phil-opp commented Jun 6, 2024

phil-opp commented Jun 6, 2024

haixuanTao commented Jun 6, 2024

XxChang commented Jun 6, 2024

phil-opp commented Jun 6, 2024

haixuanTao commented Jun 6, 2024

Gege-Wang commented Jun 13, 2024 • edited Loading

phil-opp commented Jun 20, 2024

phil-opp commented Jun 5, 2024 •

edited by Gege-Wang

Loading

haixuanTao commented Jun 6, 2024 •

edited

Loading

Gege-Wang commented Jun 13, 2024 •

edited

Loading