-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch from mdns to using the kaboodle peer mesh #48
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. I think it makes sense to move your helpers into the kaboodle repo once we're confident that they're the correct shape (I have no reason to doubt they are, but keeping them here is maximally flexible for the time being).
I don't have any substantial feedback. I have a few friendly nitpicky suggestions and conversation openers below. Shippit!
agent/src/main.rs
Outdated
@@ -171,14 +168,20 @@ async fn main() -> Result<()> { | |||
); | |||
} | |||
|
|||
let mut roles: Vec<String> = Vec::new(); | |||
if blob_path.is_some() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't really matter, but if you change this to if let Some(blob_path) = blob_path {
then you can get rid of the :?
in the format string and the output will look nicer.
// TODO: this is actually fallible | ||
impl KaboodlePeer for PeerMetadata { | ||
fn from_identity(address: SocketAddr, encoded: Vec<u8>) -> Self { | ||
// TODO: this is actually fallible; when might it fail? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This data is coming in over unreliable transport, so it's possible for a bit to get flipped and lead to a corrupted encoded payload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider shipping a CRC with it?
@@ -22,35 +23,40 @@ pub async fn relay_request( | |||
service_name: &str, | |||
source_instance_id: &Uuid, | |||
) -> Result<Response, ServalError> { | |||
let node_info = discover_service(service_name).await.map_err(|err| { | |||
log::warn!("proxy_unavailable_services failed to find a node offering the service; service={service_name}; err={err:?}"); | |||
let mesh = MESH.get().expect("Peer network not initialized!"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Thinking aloud 🤔 ) Is it worth promoting this to ServalError
, or (since this happening would be a Programming Error™) is the catastrophic nature of the .expect
preferable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The catastrophic nature of the expect is desirable: this is a crash-and-restart situation. It would be programmer error, most likely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will enjoy living to regret this statement.
agent/src/api/v1/proxy.rs
Outdated
let host = info.get_addresses().iter().next().unwrap(); // unwrap is safe because discover_service will never return a service without addresses | ||
let port = info.get_port(); | ||
let target_instance_id = peer.name(); | ||
let Some(socket_addr) = peer.address() else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am making this comment in the wrong place, but why does peer.address() return an
Option`?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It returns an option because the underlying peers() function in kaboodle returns a vec of (Option<SocketAddr>, Vec<u8>)
! I should probably filter that list to remove any peer that doesn't have an address, which would simplify the code in the agent.
cli/src/main.rs
Outdated
env_var_override_name: &str, | ||
) -> Result<String> { | ||
if let Ok(override_url) = std::env::var(env_var_override_name) { | ||
async fn maybe_find_peer(role: &ServalRole, override_var: &str) -> Result<String> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on initializing the mesh elsewhere and passing a reference to it into this function? (Or does the CLI only ever need to find a peer a single time?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking was that some commands wouldn't need to join the mesh, and joining is expensive in terms of time, and it is one-shot. But if we come across a case where it isn't, then we'd refactor. It's fine either way.
@@ -13,6 +14,8 @@ pub trait KaboodleMesh { | |||
|
|||
/// Create a new entry for a Kaboodle peer network and add ourselves to the mesh. | |||
async fn start(&mut self) -> Result<(), KaboodleError>; | |||
/// Remove this peer from the mesh. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Total nit but maybe consider Remove ourselves from the mesh
so it's more obvious that there is no way to remove forcibly an other peer from the mesh.
@@ -320,7 +319,7 @@ async fn main() -> Result<()> { | |||
SERVAL_NODE_URL.lock().unwrap().replace(baseurl); | |||
|
|||
match args.cmd { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just .await?
this match statement rather than doing it in all of the arms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like that would look cleaner - would it be functionally equivalent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it would not be functionally equivalent if we ever have a non-async command handler. I was preserving the option.
agent/src/main.rs
Outdated
let metadata = PeerMetadata::new(format!("{}:{}", host, port), roles, None); | ||
let metadata = PeerMetadata::new( | ||
Uuid::new_v4().to_string(), | ||
format!("http://{}:{}", host, port), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a thing for this PR, but it would be ideal to start passing around SocketAddr
s rather than host
and port
, because the way you stringify host is different for IPv4 and IPv6 (which needs [brackets]
around the IP).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we track this in an issue? Sounds like a boring but potentially large footprint refactor? Or should this just be done when tackling #50 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a lot of this work in a commit to this branch today, so I think it's mostly done? I'll review.
} else { | ||
Err(anyhow!("Unable to locate a peer with the {role} role")) | ||
}; | ||
|
||
mesh.stop().await?; | ||
mesh.stop().await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this finally made it through my skull and brings up a very interesting architectural consideration.
Nothing needs to change right now, but it's worth discussion:
Up until now, I had never contemplated the thought that an instance of the serval
CLI tool would have to join the mesh to do any work. That totally works, of course, but it seems a bit odd for something so ephemeral to create so much work for the mesh—everybody is going to have to update their known peers list, and it will take things a while to settle down—all for a peer who will vanish a few seconds later.
From the perspective of the mesh, a better design would be for the serval
CLI tool to connect to an instance of serval-agent
running on localhost. If none is running on localhost, joining the mesh to find a node to talk to would be the fallback. Perhaps we could store all of this in $TMP
or /var/run
or whatever?
agent/src/main.rs
Outdated
let metadata = PeerMetadata::new( | ||
Uuid::new_v4().to_string(), | ||
format!("http://{}:{}", host, port), | ||
roles, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably doesn’t actually matter, and we certainly don't have to change anything right now, but should we encode the roles in the identity payload as bit masks rather than as strings? The identity payload is included in every single kaboodle packet, so little optimizations add up.
Surely every conceivable role we'd have would fit into a single byte, and since we have versioning (brilliant idea, by the way—I would not have thought of that) we can expand it if we ever need more than 8 roles.
If you agree, I can make a GitHub issue so we don't lose the idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take a look at how bincode is actually encoding the roles :D It does a very nice job. Well-chosen crate there, colleague!
860aa90
to
3b47d9a
Compare
agent/src/main.rs
Outdated
let metadata = PeerMetadata::new(format!("{}:{}", host, port), roles, None); | ||
let metadata = PeerMetadata::new( | ||
Uuid::new_v4().to_string(), | ||
format!("http://{}:{}", host, port), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we track this in an issue? Sounds like a boring but potentially large footprint refactor? Or should this just be done when tackling #50 ?
owo-colors = "3.5.0" | ||
prettytable = "0.10.0" | ||
reqwest = { version = "0.11.13", default-features = false, features = ["blocking", "deflate", "brotli", "json", "multipart", "stream", "rustls-tls"] } | ||
reqwest = { version = "0.11.13", default-features = false, features = ["deflate", "brotli", "json", "multipart", "stream", "rustls-tls"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I'll definitely wait til this is merged before I continue work on #43. I'm doing some blocking reqwests there that need to be refactored...
@@ -320,7 +319,7 @@ async fn main() -> Result<()> { | |||
SERVAL_NODE_URL.lock().unwrap().replace(baseurl); | |||
|
|||
match args.cmd { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like that would look cleaner - would it be functionally equivalent?
// TODO: this is actually fallible | ||
impl KaboodlePeer for PeerMetadata { | ||
fn from_identity(address: SocketAddr, encoded: Vec<u8>) -> Self { | ||
// TODO: this is actually fallible; when might it fail? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider shipping a CRC with it?
We need to encode some information about nodes in the identity payload, so I took a stab at writing a wrapper that hides the machinery from the agent. I also sketch out a way to handle versioning this payload, but I'm not happy with it yet.
Also, replaced the MDNS usage in the agent with kaboodle usage. It runs and joins a mesh of one! I haven't replaced peer searches yet, however.
oh and also probably mostly made the cli join the peer mesh and try to run jobs over it. does not yet work because of something slightly surprising involving switching to async main. Possibly. Have not yet investigated.
No blocking. Only tackling.
This allows job-running to work! SHIP IT.
Also, start generalizing how we find peers with services.
79deae8
to
30dfa0a
Compare
The agent and the cli now both attempt to join a network of peers, established via kaboodle, and use that network instead of mdns to discover relevant services.
The most interesting implementation details are in the file
utils/src/mesh.rs
. This file includes some wrapper traits that probably belong in Kaboodle itself, since they were designed to be re-usable and to encode the responsibilities of any Kaboodle implementation: KaboodleMesh and KaboodlePeer. There are implementations plus some serval-specific functions in the structs ServalMesh and PeerMetadata. (All names are negotiable. I already don't like my initial choices.)The peer metadata struct gets a little complicated because I attempted to provide some future-proofing via a way to version it. It's encoded via
binencode
, with a version envelope wrapper to allow us to make newer agents that change what gets packed into identity able to interoperate with older agents, and let older agents ignore newer ones. I welcome improvements to this somewhat ad-hoc design.Agent roles are now an enum,
ServalRoles
. Thestrum
crate provides Display and FromString implementations. This adds some safety and validity-checking to what had been some magic strings.We knew we'd stop blocking in the cli some day, and this is that day: the cli is now fully async.
Finally, and most noisily, corrected the spelling of "Wasm" everywhere. Apologies for that.