Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Next steps for distributed deployments #535

Open
3 tasks
phil-opp opened this issue Jun 5, 2024 · 11 comments
Open
3 tasks

Next steps for distributed deployments #535

phil-opp opened this issue Jun 5, 2024 · 11 comments
Labels

Comments

@phil-opp
Copy link
Collaborator

phil-opp commented Jun 5, 2024

This issue outlines the steps that we need to take to make dora dataflow work across multiple machines:

  • update dora check to skip checking paths on remote machines
    • use the same logic when checking the dataflow in dora start
    • open question: How do we know which machine ID is local and which remote?
    • alternative: skip path checks completely when dataflow specifies multiple machine IDs
  • figure out a way to handle relative node paths on remote machines
    • For local dataflows, we use the folder containing the dataflow YAML file as working directory. This does not work for remote machines since the YAML file is not available there.
    • Option 1: Use the working directory of the daemon by default (i.e. the directory where the daemon was started in)
      • This would be a breaking change.
    • Option 2: Only allow absolute paths for remote machines (this is probably too limiting)
    • Option 3: Configure the working directory for each machine in the YAML file.
    • Other ideas?
@github-actions github-actions bot added the daemon label Jun 5, 2024
@phil-opp
Copy link
Collaborator Author

phil-opp commented Jun 5, 2024

Some relevant places in the code:

  • // TODO: remove this once we figure out deploying of node/operator
    // binaries from CLI to coordinator/daemon
    local_working_dir: PathBuf,
    • The CLI sets this directory on dora start when in sends the start command to the coordinator. The coordinator forwards it to the daemon on the target machine, which uses it when spawning the dataflow nodes:
      working_dir: PathBuf,
    • Remote machines might have a different directory structure, so it does not make sense to use the same working directory there.
  • if source_is_url(source) {
    info!("{source} is a URL."); // TODO: Implement url check.
    } else {
    resolve_path(source, working_dir)
    .wrap_err_with(|| format!("Could not find source path `{}`", source))?;
    };
    • We check whether a node exists here.
    • For nodes running on remote machines, this check doesn't make sense.
  • let resolved_path = if source_is_url(source) {
    // try to download the shared library
    let target_path = Path::new("build")
    .join(node_id.to_string())
    .with_extension(EXE_EXTENSION);
    download_file(source, &target_path)
    .await
    .wrap_err("failed to download custom node")?;
    target_path.clone()
    } else {
    • We do support URLs as node sources. This could be useful for distributed deployments.

@haixuanTao
Copy link
Collaborator

haixuanTao commented Jun 6, 2024

Option 1: Use the working directory of the daemon by default (i.e. the directory where the daemon was started in)

I think that if this is the specification it should be the same specification across local and remote node, thus breaking changes compared to current implementation.

Option 2: Only allow absolute paths for remote machines (this is probably too limiting)

I would expect this to be an available option all the time as it might not be easy to specify a specific file very far from the daemon spawning path.

@Gege-Wang
Copy link
Collaborator

If we pass parameter working_dir when we start daemon then we can manage working_dir just like machine_id, When a node is started, the coordinator passes the working_dir of the corresponding daemon so we don't have to skip checks. However, I'm not sure that's possible.

@phil-opp
Copy link
Collaborator Author

phil-opp commented Jun 6, 2024

If we pass parameter working_dir when we start daemon then we can manage working_dir just like machine_id, When a node is started, the coordinator passes the working_dir of the corresponding daemon so we don't have to skip checks. However, I'm not sure that's possible.

This works when both daemons are running on the same machine. However, if a daemon runs on a remote machine, we have no access to its file system, so we cannot check the paths.

@phil-opp
Copy link
Collaborator Author

phil-opp commented Jun 6, 2024

Option 1: Use the working directory of the daemon by default (i.e. the directory where the daemon was started in)

I think that if this is the specification it should be the same specification across local and remote node, thus breaking changes compared to current implementation.

Good point, I added this drawback to the list above.

Option 2: Only allow absolute paths for remote machines (this is probably too limiting)

I would expect this to be an available option all the time as it might not be easy to specify a specific file very far from the daemon spawning path.

Yes, it's always available as an option. What I meant is that we don't allow relative paths for remote machines.

@haixuanTao
Copy link
Collaborator

In that case can we maybe try an implementation using option 2 before making Option 1.

What do you think @XxChang @Gege-Wang ?

@XxChang
Copy link
Collaborator

XxChang commented Jun 6, 2024

In that case can we maybe try an implementation using option 2 before making Option 1.

What do you think @XxChang @Gege-Wang ?

I think it is good, let me do it.

@phil-opp
Copy link
Collaborator Author

phil-opp commented Jun 6, 2024

I opened a draft PR that implements option 2 a few days ago, maybe that's useful: #534

One challenge is the multiple-daemons test, which runs on multiple machines, which all resolve to the same local machine. Using absolute paths in its config is not ideal because we want to commit the test to git and run it on different machines.

@haixuanTao
Copy link
Collaborator

I see.

Maybe we can use some environment variable to fix CI?

Otherwise, I guess it's fine to hard code GitHub CI path for now.

@Gege-Wang
Copy link
Collaborator

Gege-Wang commented Jun 13, 2024

There are some issues about dora start and dora check to skip checking paths on remote machines:

  • if cli and coordinator are local, some daemons are on remote machines, when dora start, the cli check(&working_dir) will go to this branch and resolve_path. Here the resolve_path will fail, because the remote daemon path exist check should be skip here.
    } else {
    resolve_path(source, working_dir)
    .wrap_err_with(|| format!("Could not find source path `{}`", source))?;
    };

If the cli check the dataflow, this problem should be always here, because cli never know whether the daemon are local or remote.

  • if the cli and coordinator is ubuntu, and some remote daemons are windows, the windows absolute path will be checked into relative. Here the check fails, even though we write the right absolute path.
 let path = Pathbuf::from("C:\\dora\\tmp\\test.log");
    if path.is_absolute() {
        println!("Path is absolute");
    } else {
        println!("Path is relative");
    }
  • if cli and coordinator are local, some daemons are on remote machines, theoretically,we can start the dataflow like this.
# cli
dora coordinator
dora daemon --machine-id A 

# remote
dora daemon --machine-id B --coordinator-addr <remote-ip>:<port>

however, it doesn't work, because the ip of machine A is 127.0.0.1,so we must start dataflow like this

# cli
dora coordinator
dora daemon --machine-id A  --coordinator-add <local-ip>:<port>

# remote
dora daemon --machine-id B --coordinator-addr <remote-ip>:<port>
  • I don't understand why we use one work_dir per-daemon. theoretically, We should use one work_dir per dataflow? And why we have to check the dataflow in cli, it is complex in multiple-daemon.

@phil-opp
Copy link
Collaborator Author

Thanks a lot for testing and reporting these issues! This is very useful!

  • if cli and coordinator are local, some daemons are on remote machines, when dora start, the cli check(&working_dir) will go to this branch and resolve_path. Here the resolve_path will fail, because the remote daemon path exist check should be skip here.

I think we can fix this in the following way:

  • For dora check, only print a warning if the path doesn't exit, instead of failing.
  • For dora start, we should get the query the list of remote_machine_ids from the coordinator.
  • if cli and coordinator are local, some daemons are on remote machines, theoretically,we can start the dataflow like this.
# cli
dora coordinator
dora daemon --machine-id A 

# remote
dora daemon --machine-id B --coordinator-addr <remote-ip>:<port>

however, it doesn't work, because the ip of machine A is 127.0.0.1,so we must start dataflow like this

# cli
dora coordinator
dora daemon --machine-id A  --coordinator-add <local-ip>:<port>

# remote
dora daemon --machine-id B --coordinator-addr <remote-ip>:<port>

The issue is around these lines:

let peer_ip = connection
.peer_addr()
.map(|addr| addr.ip())
.map_err(|err| format!("failed to get peer addr of connection: {err}"));

If the peer_ip is the loopback address, we know that the coordinator and the daemon run on the same machine. So other daemons should be able to reach the registered deamon through the same IP address as the coordinator. So a (hacky) fix could be to replace the 127.0.0.1 with the coordinator listen IP.

  • if the cli and coordinator is ubuntu, and some remote daemons are windows, the windows absolute path will be checked into relative. Here the check fails, even though we write the right absolute path.
 let path = Pathbuf::from("C:\\dora\\tmp\\test.log");
    if path.is_absolute() {
        println!("Path is absolute");
    } else {
        println!("Path is relative");
    }

Good catch! So we need a way to check whether a path is an absolute Windows path on Linux systems (and the other way around). Maybe there are some crates that allow this? Alternatively, we could copy the libstd implementations and provide them in architecture-independent functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants