Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimizing memory usage #190

Merged
merged 4 commits into from
Sep 14, 2023
Merged

Conversation

mariocynicys
Copy link
Collaborator

@mariocynicys mariocynicys commented Feb 18, 2023

Just by loading the minimal necessary data plus dropping the last_known_blocks we get a huge reduction (2GB -> 1.2GB for 2.2M appointments*).
We can even do better and further reduce the data stored in memory (dropping user_ids and potentially locators).

* This is on MacOS. On linux we get to as low as 1.3G but only 1.0G is actually being used, we know that because if we call malloc_trim we get to 1.0G

Fixes #224

@sr-gi
Copy link
Member

sr-gi commented Feb 20, 2023

We can even do better and further reduce the data stored in memory (dropping user_ids and potentially locators).

I thought you were going to pack all mem optimizations in the same PR. Are you planning on a follow-up for that?

@mariocynicys
Copy link
Collaborator Author

I thought you were going to pack all mem optimizations in the same PR. Are you planning on a follow-up for that?

Nah, will follow-up in this PR.

Polling the least amount of data already solved the iter/non-iter issue since it only appears when a hashmap has a dynamic sized field (the encrypted_blob in our case). And not really in favor of playing with trimming and stuff since it's not a general solution.
What's left is:

  • streaming reads from the DB (shouldn't introduce any issues, but haven't looked into it yet).

Not sure if there are any trivial/non-structural opts left after that.


The next set of optimizations for watcher I think are:

  • removing user_id from appointment summary (64 byte reduction) -> will need to call the DB for user_id when needed.
  • squeezing the locator_uuid_map to only a locator set -> will need to call the DB for uuids if a locator is matched on a block.
  • removing the locator set all together and relying on db calls in sql transactions for each block as discussed in discord + stream the db reads with the breach handling.

I will need to review the watcher and see what roles does these two hashmaps play so to understand how would reducing/eliminating them would affect the tower CPU-wise in a normal case (not so many breaches).

Will also need to check similar memory optimization options for the responder and gatekeeper.

@mariocynicys
Copy link
Collaborator Author

After a very very long fight with lifetimes, boxes, dyns and everything I was able to assemble this f12cb90.

It's working as intended as a way of steaming data whilst hiding the DBM specific things, but the design isn't good at all:

  • QueryIterator isn't actually an iterator, you need to call .iter() to get the real iterator: if you try to make it an iterator, you will end up reimplementing rusqlite all over, because you need to replace the Rows & Statement structs they offer.
  • You need to annotate QueryIterator's params generic type which shouldn't be relevant at all to the caller: because stmt.query_map(params, ... needs params to be of known size at compile time, thus can't be placed in a Box<dyn Params>.

The issue lies in DBM methods constructing a Statement which dies after the method ends, but since we need an iterator and don't want to collect just yet, we want to return Rows which is bound to the life time of that Statement so then we will have to return the statement as well, which is what the above mentioned commit does. rusqlite/rusqlite#1265 should solve this if it were to be implemented.

That said, I might be complicating the issue and could be solved in another simpler way, so posting here to have a review on this.

Meanwhile I am moving on to other possible memory optimization options.

Copy link
Member

@sr-gi sr-gi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am only reviewing f12cb90 for now.

I don't think this approach is that bad, it'll be best if we could simply implement Iter for QueryIterator, but for more than I've scratched my head, I haven't found a way of doing so.

With respect to changes being implemented in rusqlite, the one you've mentioned seems to depend on an issue from 2016, so I think it's pretty unlikely :(

Regarding the code itself. Assuming our goal is to use this only for bootstrapping, we don't actually need the params to be part of QueryIterator, given there are none for either the collection of appointment summaries nor the trackers. Therefore we could simplify it to something like the following:

diff --git a/teos/src/db_iterator.rs b/teos/src/db_iterator.rs
index 9bf61a1..feda098 100644
--- a/teos/src/db_iterator.rs
+++ b/teos/src/db_iterator.rs
@@ -1,22 +1,19 @@
-use rusqlite::{MappedRows, Params, Result, Row, Statement};
+use rusqlite::{MappedRows, Result, Row, Statement};
 use std::iter::Map;

 /// A struct that owns a [Statement] and has an `iter` method to iterate over the
 /// results of that DB query statement.
-pub struct QueryIterator<'db, P, T> {
+pub struct QueryIterator<'db, T> {
     stmt: Statement<'db>,
-    params_and_mapper: Option<(P, Box<dyn Fn(&Row) -> T>)>,
+    mapper: Option<Box<dyn Fn(&Row) -> T>>,
 }

-impl<'db, P, T> QueryIterator<'db, P, T>
-where
-    P: Params,
-{
+impl<'db, T> QueryIterator<'db, T> {
     /// Construct a new [QueryIterator].
-    pub fn new(stmt: Statement<'db>, params: P, f: impl Fn(&Row) -> T + 'static) -> Self {
+    pub fn new(stmt: Statement<'db>, f: impl Fn(&Row) -> T + 'static) -> Self {
         Self {
             stmt,
-            params_and_mapper: Some((params, Box::new(f))),
+            mapper: Some(Box::new(f)),
         }
     }

@@ -28,9 +25,9 @@ where
         &mut self,
     ) -> Option<Map<MappedRows<'_, impl FnMut(&Row) -> Result<T>>, impl FnMut(Result<T>) -> T>>
     {
-        self.params_and_mapper.take().map(move |(params, mapper)| {
+        self.mapper.take().map(move |mapper| {
             self.stmt
-                .query_map(params, move |row| Ok((mapper)(row)))
+                .query_map([], move |row| Ok((mapper)(row)))
                 .unwrap()
                 .map(|row| row.unwrap())
         })
diff --git a/teos/src/dbm.rs b/teos/src/dbm.rs
index 318d8c9..6823313 100644
--- a/teos/src/dbm.rs
+++ b/teos/src/dbm.rs
@@ -7,7 +7,7 @@ use std::path::PathBuf;
 use std::str::FromStr;

 use rusqlite::limits::Limit;
-use rusqlite::{params, params_from_iter, Connection, Error as SqliteError, Row, ParamsFromIter};
+use rusqlite::{params, params_from_iter, Connection, Error as SqliteError};

 use bitcoin::consensus;
 use bitcoin::hashes::Hash;
@@ -307,9 +307,7 @@ impl DBM {
     }

     /// Loads all [AppointmentSummary]s from that database.
-    pub(crate) fn load_appointment_summaries(
-        &self,
-    ) -> QueryIterator<ParamsFromIter<[u8; 0]>, (UUID, AppointmentSummary)> {
+    pub(crate) fn load_appointment_summaries(&self) -> QueryIterator<(UUID, AppointmentSummary)> {
         let stmt = self
                 .connection
                 .prepare(
@@ -318,7 +316,7 @@ impl DBM {
                 )
                 .unwrap();

-        let func = |row: &Row| {
+        QueryIterator::new(stmt, |row| {
             let raw_uuid: Vec<u8> = row.get(0).unwrap();
             let raw_locator: Vec<u8> = row.get(1).unwrap();
             let raw_userid: Vec<u8> = row.get(2).unwrap();
@@ -329,9 +327,7 @@ impl DBM {
                     UserId::from_slice(&raw_userid).unwrap(),
                 ),
             )
-        };
-
-        QueryIterator::new(stmt, params_from_iter([]), func)
+        })
     }

     /// Loads appointments from the database. If a locator is given, this method loads only the appointments

Also, notice how annotating the closure is not necessary as long as you add it directly to QueryIterator::new, given the compiler can infer the types in that case (if you assign it to a variable it could be used in different contexts, so it looks like the compiler doesn't really like that).

All in all, I don't think this is too terrible, the alternative would be replacing rusqlite with something that has a design more compatible with our needs (which I'm not opposed to, tbh), or being a rustlang master maybe and find a tricky solution 🙃

@mariocynicys
Copy link
Collaborator Author

Happy you find it not that terrible 😂
Regrading Params, I wanted this to be general over any DB query. I think an iterator gonna also be useful for processing a newly connected blocks breaches without collecting all the appointments in memory at one time.

@sr-gi
Copy link
Member

sr-gi commented Mar 1, 2023

Happy you find it not that terrible 😂

It was you calling it not good at all, I'm just smoothing it out a bit lol

@mariocynicys mariocynicys force-pushed the better-mem-cpu-usage branch from f12cb90 to 72b384d Compare March 7, 2023 17:09
@mariocynicys
Copy link
Collaborator Author

DB Iterator stuff are backed up in https://github.com/mariocynicys/rust-teos/tree/dbm-iterator

@mariocynicys
Copy link
Collaborator Author

mariocynicys commented Mar 7, 2023

After this change, reached 631MiB after trimming using the 1G DB.

Some tests needed to adapt because they were breaking the recipe for UUID = UserId + Locator, for example Watcher::store_appointment was sometimes passed some uuids that didn't link to the appointment being stored (was generated using generate_uuid).
This was problematic because if we store locators in the gatekeepers UserInfo and want to get the uuid back for some locator it wont match. And I think causes a similar problem after the absence of a uuid -> locator map (ie. Watcher::appointments) but don't recall the hows.
But these issues where in tests only anyways.

Also some methods might need to have its signatures changed a little bit to be less confusing and less error prone.
For example, GateKeeper::add_update_appointment, it accepts (user id, locator, extended appointment), one can get the user id and locator from the extended appointment already, so there is actually a replication of passed data here, and this avoids the us supplying wrong user id and locator and avoid confusion of what these parameter actually should be and whether they should match the fields in the extended appointment or not.

Another thing that I think we should change is the excessive use of hashsets and hashmaps in functions. Hash maps and sets are more expensive to insert in and use more memory than a vector. In many cases we pass a hashmaps to a function just to iterate on it (never using the mapping functionality it offers). Hashmaps could be replaced with a Vec<(key, val)>.

I pushed that commit 72b384d to get a concept ack, but I think will need to do some refactoring & renaming.
@sr-gi

@mariocynicys
Copy link
Collaborator Author

Let me also dump some notes I had written xD.


Trying to get rid of Watcher::locator_uuid_map and Watcher::appointments.

1- For locator_uuid_map:
locator_uuid_map is used to keep uuids for each locator.
On a breach for locator X, you will need to load all the uuids associated with it and load appointments for these uuids (load_appointment(uuid)) and try to decrypt them and broadcast.

This should be replaced with:

  • Get all the locators found in a newly mined block.
  • Do a DB tx looking for all the UUIDs for these locators.
  • Do all of this in an iteratable way to avoid storing any intermediaries in memory for no reason.

2- For appointments:
appointments hashmap maps from a uuid to its 2 parents (user_id & locator).

We extract locator from a uuid when a client calls get_subscription_info. We map from UUIDs from UserInfo from the gatekeeper to Locators using the appointments hashmap, this can be avoided if we store locators instead of uuids in UserInfo. We can always recover the uuids from the locators since we know the user_id for that UserInfo, and we also save a 20% in size (locators are 16 bytes while uuids are 20 bytes).

We extract user_id from a uuid when a new block is connected (in filtered_block_connected), user_ids are used while trackers for breaches (why store user_id for a tracker?), user_ids are also used to instruct the gatekeeper which user to update after broadcasting an appointment with uuid X like returning the users' slots and so on.
As an alternative, we can get the user_id from the database, when a new block is connected and we pull relevant penalty TXs for broadcasting, we can pull the user_signature as well which we can recover the user_id from. And everything should be done in an iteratable fashion.


That latest commit works on point 2, I will try to work on point 1 without the iteratable fashion mentioned and then adapt it once we can iterate over DB queries in a nice manner.

@sr-gi
Copy link
Member

sr-gi commented Mar 8, 2023

Alright, I think you are on the right path. I'm commenting on your notes, but haven't checked 72b384d (lmk if you'd like me to at this stage).

After this change, reached 631MiB after trimming using the 1G DB.

Some tests needed to adapt because they were breaking the recipe for UUID = UserId + Locator, for example Watcher::store_appointment was sometimes passed some uuids that didn't link to the appointment being stored (was generated using generate_uuid). This was problematic because if we store locators in the gatekeepers UserInfo and want to get the uuid back for some locator it wont match. And I think causes a similar problem after the absence of a uuid -> locator map (ie. Watcher::appointments) but don't recall the hows. But these issues where in tests only anyways.

Yeah, that should only affect tests. In the test suite the uuid recipe was not being followed because data was being generated sort of randomly, but that should be relatively easy to patch.

Also some methods might need to have its signatures changed a little bit to be less confusing and less error prone.
For example, GateKeeper::add_update_appointment, it accepts (user id, locator, extended appointment), one can get the user id and locator from the extended appointment already, so there is actually a replication of passed data here, and this avoids the us supplying wrong user id and locator and avoid confusion of what these parameter actually should be and whether they should match the fields in the extended appointment or not.

I partially agree with this. With respect to the user_id, I think we can simplify it given, as you mentioned, one is literally part of the other (and this indeed applies to multiple methods). With respect to the UUID, I'm not that sure given this will trigger a cascade of recomputing this same data. Take Watcher::add_appointment for instance. Here, after computing the UUID, we call:

  • self.responder.has_tracker
  • log (using UUID)
  • Gatekeeper::add_update_appointment
  • Watcher::store_triggered_appointment
  • Watcher::store_appointment

The two former need UUID but don't get ExtendedAppointment. The three latter need UUID and also receive ExtendedAppointment. Just for this method will be creating the UUID four times (and I'm not counting the calls that happen inside those methods, the store_X methods also call the DBM, which also receives UUID and ExtendedAppointment).

Another thing that I think we should change is the excessive use of hashsets and hashmaps in functions. Hash maps and sets are more expensive to insert in and use more memory than a vector. In many cases we pass a hashmaps to a function just to iterate on it (never using the mapping functionality it offers). Hashmaps could be replaced with a Vec<(key, val)>.

I agree as long as collecting the map doesn't involve ending up having worse performance. Same applies to sets. The reasoning behind sets instead of vectors is that the former allow item deletion by identifier, while the latter doesn't. In a nutshell, if we're only iterating it may be worth, but if we need addition/deletion it may be trickier. Actually, if we only want to iterate, wouldn't an iterator be an option? Also, are we passing copies of the maps/sets or references?

We extract locator from a uuid when a client calls get_subscription_info

The other way around, or am I missing something? The user requests a given locator and we compute the UUID based on his UserId and the requested Locator.

We map from UUIDs from UserInfo from the gatekeeper to Locators using the appointments hashmap, this can be avoided if we store locators instead of uuids in UserInfo. We can always recover the uuids from the locators since we know the user_id for that UserInfo, and we also save a 20% in size (locators are 16 bytes while uuids are 20 bytes).

I barely remember this now, but I think the reason why UUIDs were stored was so, on deletion, Gatekeeper data could be mapped to Watcher and Responder data. We can indeed re-compute the UUID based on the locator and the user_id (that's why the UUID is created that way, so we could serve queries without having to do a reverse lookup), so I think you may be right here.

We extract user_id from a uuid when a new block is connected (in filtered_block_connected), user_ids are used while trackers for breaches (why store user_id for a tracker?)

Same reasoning I think, so we can delete the corresponding data from Gatekeeper and Watcher/Responder when needed. But again, I'm talking from the top of my mind, would need to review a change of this to see if it makes sense.

@mariocynicys
Copy link
Collaborator Author

mariocynicys commented Mar 10, 2023

With respect to the UUID, I'm not that sure given this will trigger a cascade of recomputing this same data.

That's true, we can keep it taking UUID but be cautious later in the tests not to provide a random non-related UUID. Or we can embed the UUID inside the extended appointment.


Actually, if we only want to iterate, wouldn't an iterator be an option? Also, are we passing copies of the maps/sets or references?

Yup iterator would be the best we can do if we nail having converting the calls reacting to block_connected being stream-able.
We pass mostly/everywhere? references, but still some of these sets and maps not being used anytime for mapping and we can simplify them to vecs instead (& refs to vecs).


The other way around, or am I missing something? The user requests a given locator and we compute the UUID based on his UserId and the requested Locator.

We extract locator from a uuid when a client calls get_subscription_info

When a client asks for their subscription info, they wanna know the locators they have given out and not the UUIDs (UUIDs is tower-implementation specific after all). The thing is we store a user's appointments in terms of UUIDs inside UserInfo and we want to convert them to locators for get_subscription_info response.
There is not going back from UUID -> Locator (without a map) so we could have stored Locators in UserInfo instead.

get_subscription_info is the message that returns subscription info to the user including all the locators they have sent out to us.

common_msgs::GetSubscriptionInfoResponse {
  available_slots: subscription_info.available_slots,
  subscription_expiry: subscription_info.subscription_expiry,
  locators: locators.iter().map(|x| x.to_vec()).collect(),
}

I think you confused it with get_appointment.


but haven't checked 72b384d (lmk if you'd like me to at this stage).

Nah, ACKing these comments were enough.
Will clean this commit, do some renamings and some refactoring for the tests.

@sr-gi
Copy link
Member

sr-gi commented Mar 10, 2023

The other way around, or am I missing something? The user requests a given locator and we compute the UUID based on his UserId and the requested Locator.

We extract locator from a uuid when a client calls get_subscription_info

When a client asks for their subscription info, they wanna know the locators they have given out and not the UUIDs (UUIDs is tower-implementation specific after all). The thing is we store a user's appointments in terms of UUIDs inside UserInfo and we want to convert them to locators for get_subscription_info response.
There is not going back from UUID -> Locator (without a map) so we could have stored Locators in UserInfo instead.

get_subscription_info is the message that returns subscription info to the user including all the locators they have sent out to us.

common_msgs::GetSubscriptionInfoResponse {
  available_slots: subscription_info.available_slots,
  subscription_expiry: subscription_info.subscription_expiry,
  locators: locators.iter().map(|x| x.to_vec()).collect(),
}

I think you confused it with get_appointment.

I was indeed thinking about single appointment requests 😅

Copy link
Member

@sr-gi sr-gi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a thorough review, some things may still be missing, but given I wanted to start using these optimizations at teos.talaia.watch I went ahead a give this a look.

I think this goes in the right direction, but it needs some polishing.

teos/src/api/internal.rs Outdated Show resolved Hide resolved
teos/src/responder.rs Outdated Show resolved Hide resolved
teos/src/api/internal.rs Outdated Show resolved Hide resolved
teos/src/gatekeeper.rs Outdated Show resolved Hide resolved
teos/src/gatekeeper.rs Outdated Show resolved Hide resolved
teos/src/watcher.rs Outdated Show resolved Hide resolved
teos/src/watcher.rs Outdated Show resolved Hide resolved
teos/src/watcher.rs Outdated Show resolved Hide resolved
teos/src/extended_appointment.rs Outdated Show resolved Hide resolved
@mariocynicys mariocynicys force-pushed the better-mem-cpu-usage branch from 72b384d to 8b357e0 Compare May 22, 2023 16:27
@mariocynicys
Copy link
Collaborator Author

mariocynicys commented May 25, 2023

Current mem usage: 69M
Which is basically the locator_cache & tx_index & carrier data & in-memory user info in the gatekeeper.

The last 4 commits should have their tests adapted and squashed into one commit. Leaving them for now for an easy linear review.

Does they make sense in terms of how much they query the DB (bottle necks)?
@sr-gi

Copy link
Member

@sr-gi sr-gi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overlooking that test are not building and that some methods have unnecessary arguments for now, just focusing on the big changes.

I reviewed up to the Gatekeeper (so Watcher and Responder are missing), but I don't trust GH to not delete all my pending comments, so I'll add the rest in a followup review.

teos/src/api/internal.rs Outdated Show resolved Hide resolved
teos/src/api/internal.rs Outdated Show resolved Hide resolved
teos/src/dbm.rs Outdated Show resolved Hide resolved
teos/src/dbm.rs Show resolved Hide resolved
teos/src/dbm.rs Outdated Show resolved Hide resolved
teos/src/gatekeeper.rs Outdated Show resolved Hide resolved
teos/src/gatekeeper.rs Outdated Show resolved Hide resolved
teos/src/gatekeeper.rs Outdated Show resolved Hide resolved
teos/src/gatekeeper.rs Outdated Show resolved Hide resolved
teos/src/gatekeeper.rs Outdated Show resolved Hide resolved
Copy link
Member

@sr-gi sr-gi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like GH posted the comment twice, so I'm reserving this stop for later 😆

Copy link
Member

@sr-gi sr-gi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on the Wartcher and Responder.

I would need to do another general pass once the code is cleaned up, it's otherwise har to follow.

Take a look at the comment regarding the disconnections, I think we should preserve that logic instead of trying to react to things on block_disconnected.

teos/src/watcher.rs Show resolved Hide resolved
teos/src/watcher.rs Show resolved Hide resolved
teos/src/watcher.rs Show resolved Hide resolved
teos/src/watcher.rs Outdated Show resolved Hide resolved
teos/src/watcher.rs Outdated Show resolved Hide resolved
teos/src/responder.rs Outdated Show resolved Hide resolved
teos/src/responder.rs Outdated Show resolved Hide resolved
teos/src/responder.rs Outdated Show resolved Hide resolved
teos/src/responder.rs Outdated Show resolved Hide resolved
teos/src/responder.rs Outdated Show resolved Hide resolved
@mariocynicys
Copy link
Collaborator Author

mariocynicys commented Jun 22, 2023

Current mem usage: 69M
Which is basically the locator_cache & tx_index & carrier data & in-memory user info in the gatekeeper.

Also need to mention the IO implications of this refactor. Most of the IO are reads triggered each block by batch_check_locators_exist, but other methods' IO will arise when breaches are found (like pulling encrypted penalties from the DB) but those aren't new.

  • 111 calls to batch_check_locators_exist made around ~27.6GiB of IO reads (~255MiB per block).
  • No extra writes.

Other affected reads:

  • get_subscription_info: Pulls all the user's locators from the DB, reads depend on how big the user is.

The 111 calls to batch_check_locators_exist I based those measurements on, all had no breaches found. Thus the 255MiB is solely the search operation and not that some data is read and returned. I think the search will be more efficient if we have an index over the locators in the appointments table.

@mariocynicys
Copy link
Collaborator Author

The 111 calls to batch_check_locators_exist I based those measurements on, all had no breaches found. Thus the 255MiB is solely the search operation and not that some data is read and returned. I think the search will be more efficient if we have an index over the locators in the appointments table.

After applying CREATE INDEX IF NOT EXISTS locators_index ON appointments (locator). 266 calls to batch_check_locators_exist made 786MiB (~3MiB per block).

@sr-gi
Copy link
Member

sr-gi commented Jun 22, 2023

The 111 calls to batch_check_locators_exist I based those measurements on, all had no breaches found. Thus the 255MiB is solely the search operation and not that some data is read and returned. I think the search will be more efficient if we have an index over the locators in the appointments table.

After applying CREATE INDEX IF NOT EXISTS locators_index ON appointments (locator). 266 calls to batch_check_locators_exist made 786MiB (~3MiB per block).

Wow, what a reduction, that nice :)

@mariocynicys
Copy link
Collaborator Author

The current rework of the responder is actually equivalent to how things were previously before the memory opt stuff (but without ConfirmationStatus::ReorgedOut).

I believe it still needs some reworking to keep tracking reorged disputes for longer time (+ actually track reorged disputes for non-reoged penalties), but dunno whether they should be in a follow-up since this one has grown a lot. Thoughts @sr-gi ?

Copy link
Member

@sr-gi sr-gi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense, but it is starting to get to the point where a cleanup would be helpful, especially to be able to properly review each commit individually.

It would need more a thorough review, but see this as an Approach ACK

teos/src/responder.rs Show resolved Hide resolved
teos/src/responder.rs Show resolved Hide resolved
teos/src/responder.rs Outdated Show resolved Hide resolved
teos/src/dbm.rs Show resolved Hide resolved
teos/src/responder.rs Outdated Show resolved Hide resolved
@mariocynicys
Copy link
Collaborator Author

I think this makes sense, but it is starting to get to the point where a cleanup would be helpful, especially to be able to properly review each commit individually.

Ummm, I thought we can only do so many optimization at the first and relied on some in-memory structs being there. Incrementally, I realized we can remove these as well, and updated on old commits, so it's a bit confusing not having done all of them in a single commit.

I would suggest squash reviewing all the commits (- the test fixing one & the merge one) at once, as separating each of them into a standalone commit will be a very nasty rebase (and probably will end up squashing most/all of them together).

@sr-gi
Copy link
Member

sr-gi commented Jul 17, 2023

I think this makes sense, but it is starting to get to the point where a cleanup would be helpful, especially to be able to properly review each commit individually.

Ummm, I thought we can only do so many optimization at the first and relied on some in-memory structs being there. Incrementally, I realized we can remove these as well, and updated on old commits, so it's a bit confusing not having done all of them in a single commit.

I would suggest squash reviewing all the commits (- the test fixing one & the merge one) at once, as separating each of them into a standalone commit will be a very nasty rebase (and probably will end up squashing most/all of them together).

Fair, as long as there is a cleanup of old comments and suggestions are addressed.

@mariocynicys mariocynicys requested a review from sr-gi July 18, 2023 11:14
@mariocynicys mariocynicys force-pushed the better-mem-cpu-usage branch from db73cd1 to 9150458 Compare July 18, 2023 17:03
teos/src/responder.rs Outdated Show resolved Hide resolved
teos/src/responder.rs Outdated Show resolved Hide resolved
@mariocynicys mariocynicys requested a review from sr-gi August 7, 2023 17:29
Copy link
Member

@sr-gi sr-gi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final nits.

Can you rebase so I can review this from the new base? I think only tests are missing but just to avoid having to go trough that again.

You can leave squashing for later, after the final review.

teos/src/responder.rs Outdated Show resolved Hide resolved
@mariocynicys mariocynicys force-pushed the better-mem-cpu-usage branch 4 times, most recently from 43cb570 to c9d4c3f Compare August 8, 2023 06:03
teos/src/gatekeeper.rs Outdated Show resolved Hide resolved
@mariocynicys mariocynicys force-pushed the better-mem-cpu-usage branch from c9d4c3f to 550305f Compare August 8, 2023 06:15
Copy link
Member

@sr-gi sr-gi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a full review, inckluding tests. It is looking pretty good.

We should be tracking things that are pending so we do not forget in the followups. I've added comments for some, here are more (some may be duplicated):

Comment on lines +1191 to +1195
assert_eq!(
dbm.get_appointment_length(uuid).unwrap(),
appointment.inner.encrypted_blob.len()
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit tricky because all appointments created using generate_random_appointment do use the exact same penalty, which is hardcoded.

I don't really think it is a big issue, given the assertion doesn't need to work with random sizes, but in prod appointments will certainly have different lengths.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, a "random" transaction could be created by just modifying some of the transaction bits, such as value, prev_txid, ...

It may be worth adding something on these lines to the testing suite at some point.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prev_txid is already random (used for encryption) but I don't think this affects the length of the encrypted bytes.

pub struct Transaction {
    /// The protocol version, is currently expected to be 1 or 2 (BIP 68).
    pub version: i32,
    /// Block number before which this transaction is valid, or 0 for valid immediately.
    pub lock_time: u32,
    /// List of transaction inputs.
    pub input: Vec<TxIn>,
    /// List of transaction outputs.
    pub output: Vec<TxOut>,
}

We can add more inputs/outputs to the transaction. A variable size OP_RETURN should do the trick.

teos/src/dbm.rs Outdated Show resolved Hide resolved
teos/src/dbm.rs Outdated Show resolved Hide resolved
teos/src/dbm.rs Show resolved Hide resolved
teos/src/dbm.rs Outdated Show resolved Hide resolved
teos/src/watcher.rs Show resolved Hide resolved
@@ -329,6 +270,9 @@ impl Watcher {
.lock()
.unwrap()
.store_appointment(uuid, appointment)
// TODO: Don't unwrap, or better, make this insertion atomic with the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this to the pending fixes issue

teos/src/watcher.rs Show resolved Hide resolved
teos/src/watcher.rs Outdated Show resolved Hide resolved
teos/src/watcher.rs Outdated Show resolved Hide resolved
@mariocynicys
Copy link
Collaborator Author

We should be tracking things that are pending so we do not forget in the followups

I have typed the issues (one for add_appointment atomicity issues & one for DB persistence of tracker status) but will wait to add code references on them from this PR once it's on master.

@mariocynicys mariocynicys requested a review from sr-gi August 23, 2023 12:24
@sr-gi
Copy link
Member

sr-gi commented Aug 24, 2023

Most comments have been addressed, I left additional comments in the ones that have not.

Once those are addressed, this should be good to go. It will need both rebasing and squashing.

@mariocynicys mariocynicys force-pushed the better-mem-cpu-usage branch 2 times, most recently from 83614ec to b4e6ba6 Compare August 30, 2023 14:43
@mariocynicys mariocynicys force-pushed the better-mem-cpu-usage branch 2 times, most recently from 5d32e26 to f965c40 Compare August 31, 2023 11:48
@mariocynicys mariocynicys requested a review from sr-gi September 1, 2023 13:30
mariocynicys and others added 2 commits September 1, 2023 16:31
By loading the minimal necessary data during bootstrap, we get lower
memory usage and faster bootstrapping.

Co-authored-by: Sergi Delgado Segura <[email protected]>
`last_known_blocks` was taking up ~300migs of memory (for 100 blocks) because it was not dropped in `main`.

Co-authored-by: Sergi Delgado Segura <[email protected]>
Regrading the `Watcher`, fields (appointments, locator_uuid_map) has
been replaced by DB calls when needed.

For `Responder`, the field `trackers` has been replaced by DB calls when
needed, and `tx_tracker_map` wasn't actually needed for the tower to
operate, so was just dropped.

For `GateKeeper`, `registered_users::appointments` which used to hold
the uuids of every appointment the user submitted was removed so that
`registered_users` only holds meta information about users.

Also now the gatekeeper is the entity responsible for deleting appointments from the database. Instead of the watcher/responder asking the gatekeeper for the users to update and carry out the deletion and update itself, now the watcher/responder will hand the gatekeeper the uuids to delete and the gatekeeper will figure out which users it needs to update (refund the freed slots to).

Also now, like in `Watcher::store_triggered_appointment`, if the appointment is invalid or was rejected by the network in block connections, the freed slots will not be refunded to the user.

Also the block connection order starts with the gatekeeper first, this
allows the gatekeeper to delete the outdated users so that the watcher
and the responder doesn't take them into account.
Copy link
Member

@sr-gi sr-gi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sr-gi sr-gi merged commit 658fcca into talaia-labs:master Sep 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Edge case where outdated users will persist in the DB
2 participants