[docs] Add code level docs for public types #620 (#633)

* update guide * update guide * add reference to main links * rename guide to efficient agdb * Update db_search_handlers.rs * add db docs * wip * update
agnesoft · Jul 9, 2023 · c5d0509 · c5d0509
1 parent 407c168
commit c5d0509
Show file tree

Hide file tree

Showing 57 changed files with 1,586 additions and 68 deletions.
diff --git a/.gitignore b/.gitignore
@@ -8,3 +8,5 @@ Cargo.lock
 
 # These are backup files generated by rustfmt
 **/*.rs.bk
+db.agdb
+.db.agdb
diff --git a/README.md b/README.md
@@ -17,10 +17,10 @@ The Agnesoft Graph Database (aka _agdb_) is persistent memory mapped graph datab
 # Key Features
 
 - Data plotted on a graph
-- Typed key-value properties of graph elements (nodes & edges)
+- Typed [key-value properties](docs/concepts.md#data-types) attached to graph elements (nodes & edges)
 - Persistent file based storage
 - ACID compliant
-- Object queries with builder pattern (no text, no query language)
+- [Object queries](docs/queries.md) with builder pattern (no text, no query language)
 - Memory mapped for fast querying
 - _No dependencies_
 
@@ -84,7 +84,7 @@ println!("{:?}", user);
 //   ] }
 ```
 
-For comprehensive overview of all queries see the [queries](docs/queries.md) reference or continue with more in-depth [efficient agdb](docs/efficient_agdb.md).
+For database concepts and **supported data** types see [concepts](docs/concepts.md). For comprehensive overview of all queries see the [queries](docs/queries.md) reference or continue with more in-depth [efficient agdb](docs/efficient_agdb.md).
 
 # Roadmap
 

diff --git a/docs/concepts.md b/docs/concepts.md
@@ -5,6 +5,7 @@
   - [Query](#query)
   - [Transaction](#transaction)
   - [Storage](#storage)
+  - [Data types](#data-types)
 
 ## Graph
 
@@ -68,7 +69,47 @@ The database durability is provided by the write ahead log (WAL) file which reco
 
 Just like the memory the main database file will get fragmented over time. Sectors of the file used for the data that was later reallocated will remain unused (fragmented) until the database file is defragmented. That operation is performed automatically on database object instance drop.
 
+The storage taken by individual elements are properties is generally as follows:
+
+- node: 32 bytes
+- edge: 32 bytes
+- single key or value (<=15 bytes): 16 bytes
+- single key or value (>15 bytes): 32 bytes (+)
+- key-value pair: 32 bytes (+)
+
+The size of the graph elements (nodes & edges) is fixed. The size of the properties (key-value pairs) is at least 32 bytes (16 per key and 16 per value) but can be greater if the value itself is greater. This creates some inefficiency for small values (e.g. integers) but it also allows application of small value optimization where values up to 15 bytes in size (e.g. strings) do not allocate or take extra space. When a value is larger than 15 bytes it will be stored separately with another 16 bytes overhead making it at least `32 + value length` bytes.
+
+The reason for values taking 16 bytes at minimum instead of 8 is that the value needs to store a type information for which 1 byte is required. 9 bytes is an awkward and very inefficient (as measured where 16 byte values were much faster) size even if it could save some file space. The next alignment is therefore 16 bytes which also allows the aforementioned small value optimization.
+
 **Terminology:**
 
 - File storage (underlying single data file)
 - Write ahead log (WAL, shadowing file storage to provide durability)
+
+## Data types
+
+Supported types of both keys and values are:
+
+- `i64`
+- `u64`
+- `f64`
+- `String`
+- `Vec<u8>`
+- `Vec<i64>`
+- `Vec<u64>`
+- `Vec<f64>`
+- `Vec<String>`
+
+It is an enum of limited number of supported types that are universal across all platforms and programming languages. They are serialized in file as follows:
+
+| Type          | Layout                                                       | Size     |
+| ------------- | ------------------------------------------------------------ | -------- |
+| `i64`         | little endian                                                | 8 bytes  |
+| `u64`         | little endian                                                | 8 bytes  |
+| `f64`         | little endian                                                | 8 bytes  |
+| `String`      | size as `u64` followed by UTF-8 encoded string as `u8` bytes | 8+ bytes |
+| `Vec<u8>`     | size as `u64` followed by individual `u8` bytes              | 8+ bytes |
+| `Vec<i64>`    | size as `u64` followed by individual `i64` elements          | 8+ bytes |
+| `Vec<u64>`    | size as `u64` followed by individual `u64` elements          | 8+ bytes |
+| `Vec<f64>`    | size as `u64` followed by individual `i64` elements          | 8+ bytes |
+| `Vec<String>` | size as `u64` followed by individual `String` elements       | 8+ bytes |
diff --git a/docs/queries.md b/docs/queries.md
@@ -26,6 +26,11 @@
     - [Select all aliases](#select-all-aliases)
   - [Search](#search)
     - [Conditions](#conditions)
+    - [Truth tables](#truth-tables)
+      - [And](#and)
+      - [Or](#or)
+      - [Modifiers](#modifiers)
+      - [Results](#results)
     - [Paths](#paths)
 
 All interactions with the `agdb` are realized through queries. There are two kinds of queries:
@@ -677,9 +682,73 @@ The conditions are applied one at a time to each visited element and chained usi
 
 The condition `Distance` and the condition modifiers `Beyond` and `NotBeyond` are particularly important because they can directly influence the search. The former (`Distance`) can limit the depth of the search and can help with constructing more elaborate queries (or sequence thereof) extracting only fine grained elements (e.g. nodes whose edges have particular properties or are connected to other nodes with some properties). The latter (`Beyond` and `NotBeyond`) can limit search to only certain areas of an otherwise larger graph. Its most basic usage would be with condition `ids` to flat out stop the search at certain elements or continue only beyond certain elements.
 
+### Truth tables
+
+The following information should help with reasoning about the query conditions. Most of it should be intuitive but there are some aspects that might not be obvious especially when combining logic operators and condition modifiers. The search is using the following `enum` when evaluating conditions:
+
+```Rust
+pub enum SearchControl {
+    Continue(bool),
+    Finish(bool),
+    Stop(bool),
+}
+```
+
+The type controls the search and the boolean value controls if the given element should be included in a search result. The `Stop` will prevent the search expanding beyond current element (stopping the search in that direction). `Finish` will immediately exit the search returning accumulated elements (ids) and is only used internally with `offset` and `limit` (NOTE: path search and `order_by` still require complete search regardless of `limit`).
+
+Each condition contributes to the final control result as follows with the starting/default value being always `Continue(true)`:
+
+#### And
+
+| Left           | Right           | Result                  |
+| -------------- | --------------- | ----------------------- |
+| Continue(left) | Continue(right) | Continue(left && right) |
+| Continue(left) | Stop(right)     | Stop(left && right)     |
+| Continue(left) | Finish(right)   | Finish(left && right)   |
+| Stop(left)     | Stop(right)     | Stop(left && right)     |
+| Stop(left)     | Finish(right)   | Finish(left && right)   |
+| Finish(left)   | Finish(right)   | Finish(left && right)   |
+
+#### Or
+
+| Left           | Right           | Result                    |
+| -------------- | --------------- | ------------------------- |
+| Continue(left) | Continue(right) | Continue(left \|\| right) |
+| Continue(left) | Stop(right)     | Continue(left \|\| right) |
+| Continue(left) | Finish(right)   | Continue(left \|\| right) |
+| Stop(left)     | Stop(right)     | Stop(left \|\| right)     |
+| Stop(left)     | Finish(right)   | Stop(left \|\| right)     |
+| Finish(left)   | Finish(right)   | Finish(left \|\| right)   |
+
+#### Modifiers
+
+Modifiers will change the result of a condition based on the control value (the boolean) as follows:
+
+| Modifier  | TRUE                | FALSE                  |
+| --------- | ------------------- | ---------------------- |
+| None      | -                   | -                      |
+| Beyond    | `&& Continue(true)` | `\|\| Stop(false)`     |
+| Not       | `!`                 | `!`                    |
+| NotBeyond | `&& Stop(true)`     | `\|\| Continue(false)` |
+
+#### Results
+
+Most conditions result in `Continue(bool)` except for `distance()` and nested `where()` which can also result in `Stop(bool)`:
+
+| Condition   | Continue | Stop |
+| ----------- | -------- | ---- |
+| Where       | YES      | YES  |
+| Edge        | YES      | NO   |
+| Node        | YES      | NO   |
+| Distance    | YES      | YES  |
+| EdgeCount\* | YES      | NO   |
+| Ids         | YES      | NO   |
+| Key(Value)  | YES      | NO   |
+| Keys        | YES      | NO   |
+
 ### Paths
 
-Path search (`from().to()`) uses A\* algorithm. Every element (node or edge) has a cost of `1` by default. If it passes all the conditions the cost will remain `1` and would be included in the result (if the path it is on would be selected). If it fails any of the conditions its cost will be `2`. This means that the algorithm will prefer paths where elements match the conditions rather than the absolutely shortest path (that can be achieved with no conditions). If the search is not to continue beyond certain element (through `beyond()` or `not_beyond()` conditions) its cost will be `0` and the paths it is on will no longer be considered for that search.
+Path search (`from().to()`) uses A\* algorithm. Every element (node or edge) has a cost of `1` by default. If it passes all the conditions (the `SearchControl` value `true`) the cost will remain `1` and would be included in the result (if the path it is on would be selected). If it fails any of the conditions (the `SearchControl` value `false`) its cost will be `2`. This means that the algorithm will prefer paths where elements match the conditions rather than the absolutely shortest path (that can be achieved with no conditions). If the search is not to continue beyond certain element (through `beyond()`, `not_beyond()` or `distance()` conditions) its cost will be `0` and the paths it is on will no longer be considered for that search.
 
 ---
 

diff --git a/src/agdb/db.rs b/src/agdb/db.rs
@@ -84,6 +84,91 @@ impl Serialize for DbStorageIndex {
     }
 }
 
+/// An instance of the `agdb` database. To create a database:
+///
+/// ```
+/// use agdb::Db;
+///
+/// let mut db = Db::new("db.agdb").unwrap();
+/// ```
+///
+/// This will try to create or load the database file path `db.agdb`.
+/// If the file does not exist a new database will be initialized creating
+/// the given file. If the file does exist the database will try to load
+/// it and memory map the data.
+///
+/// You can execute queries or transactions on the database object with
+///
+/// - exec() //immutable queries
+/// - exec_mut() //mutable queries
+/// - transaction() //immutable transactions
+/// - transaction_mut() // mutable transaction
+///
+/// # Examples
+///
+/// ```
+/// use agdb::{Db, QueryBuilder, QueryError};
+///
+/// let mut db = Db::new("db.agdb").unwrap();
+///
+/// // Insert single node
+/// db.exec_mut(&QueryBuilder::insert().nodes().count(1).query()).unwrap();
+///
+/// // Insert single node as a transaction
+/// db.transaction_mut(|t| -> Result<(), QueryError> { t.exec_mut(&QueryBuilder::insert().nodes().count(1).query())?; Ok(()) }).unwrap();
+///
+/// // Select single database element with id 1
+/// db.exec(&QueryBuilder::select().ids(1).query()).unwrap();
+///
+/// // Select single database element with id 1 as a transaction
+/// db.transaction(|t| -> Result<(), QueryError> { t.exec(&QueryBuilder::select().ids(1).query())?; Ok(()) }).unwrap();
+///
+/// // Search the database starting at element 1
+/// db.exec(&QueryBuilder::search().from(1).query()).unwrap();
+/// ```
+/// # Transactions
+///
+/// All queries are transactions. Explicit transactions take closures that are passed
+/// the transaction object to record & execute queries. You cannot explicitly commit
+/// nor rollback transactions. To commit a transaction simply return `Ok` from the
+/// transaction closure. Conversely to rollback a transaction return `Err`. Nested
+/// transactions are not allowed.
+///
+/// # Multithreading
+///
+/// The `agdb` is multithreading enabled. It is recommended to use `Arc<RwLock>`:
+///
+/// ```
+/// use std::sync::{Arc, RwLock};
+/// use agdb::Db;
+///
+/// let db = Arc::new(RwLock::new(Db::new("db.agdb").unwrap()));
+/// db.read().unwrap(); //for a read lock allowing Db::exec() and Db::transaction()
+/// db.write().unwrap(); //for a write lock allowing additionally Db::exec_mut() and Db::transaction_mut()
+/// ```
+/// Using the database in the multi-threaded environment is then the same as in a single
+/// threaded application (minus the locking). Nevertheless while Rust does prevent
+/// race conditions you still need to be on a lookout for potential deadlocks. This is
+/// one of the reasons why nested transactions are not supported by the `agdb`.
+///
+/// Akin to the Rust borrow checker rules the `agdb` can handle unlimited number
+/// of concurrent reads (transactional or regular) but only single write operation
+/// at any one time. For that reason the transactions are not database states or objects
+/// but rather a function taking a closure executing the queries in an attempt to limit
+/// their scope as much as possible (and therefore the duration of the [exclusive] lock).
+///
+/// # Storage
+///
+/// The `agdb` is using a single database file to store all of its data. Additionally
+/// a single shadow file with a `.` prefix of the main database file name is used as
+/// a write ahead log (WAL). On drop of the `Db` object the WAL is processed and removed
+/// aborting any unfinished transactions. Furthermore the database data is defragmented.
+///
+/// On load, if the WAL file is present (e.g. due to a crash), it will be processed
+/// restoring any consistent state that existed before the crash. Data is only
+/// written to the main file if the reverse operation has been committed to the
+/// WAL file. The WAL is then purged on commit of a transaction (all queries are
+/// transactional even if the transaction is not explicitly used).
 pub struct Db {
     storage: Rc<RefCell<FileStorage>>,
     graph: DbGraph,
@@ -94,11 +179,12 @@ pub struct Db {
 
 impl std::fmt::Debug for Db {
     fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
-        f.debug_struct("Db").finish_non_exhaustive()
+        f.debug_struct("agdb::Db").finish_non_exhaustive()
     }
 }
 
 impl Db {
+    /// Tries to create or load `filename` file as `Db` object.
     pub fn new(filename: &str) -> Result<Db, DbError> {
         match Self::try_new(filename) {
             Ok(db) => Ok(db),
@@ -110,20 +196,71 @@ impl Db {
         }
     }
 
+    /// Executes immutable query:
+    ///
+    /// - Select elements
+    /// - Select values
+    /// - Select keys
+    /// - Select key count
+    /// - Select aliases
+    /// - Select all aliases
+    /// - Search
+    ///
+    /// It runs the query as a transaction and returns either the result or
+    /// error describing what went wrong (e.g. query error, logic error, data
+    /// error etc.).
     pub fn exec<T: Query>(&self, query: &T) -> Result<QueryResult, QueryError> {
         self.transaction(|transaction| transaction.exec(query))
     }
 
+    /// Executes mutable query:
+    ///
+    /// - Insert nodes
+    /// - Insert edges
+    /// - Insert aliases
+    /// - Insert values
+    /// - Remove elements
+    /// - Remove aliases
+    /// - Remove values
+    ///
+    /// It runs the query as a transaction and returns either the result or
+    /// error describing what went wrong (e.g. query error, logic error, data
+    /// error etc.).
     pub fn exec_mut<T: QueryMut>(&mut self, query: &T) -> Result<QueryResult, QueryError> {
         self.transaction_mut(|transaction| transaction.exec_mut(query))
     }
 
+    /// Executes immutable transaction. The transaction is running a closure `f`
+    /// that will receive `&Transaction` object to run `exec` queries as if run
+    /// on the main database object. You shall specify the return type `T`
+    /// (can be `()`) and the error type `E` that must be constructible from the `QueryError`
+    /// (`E` can be `QueryError`).
+    ///
+    /// Read transactions cannot be committed or rolled back but their main function is to ensure
+    /// that the database data does not change during their duration. Through its generic
+    /// parameters it also allows transforming the query results into a type `T`.
     pub fn transaction<T, E>(&self, f: impl Fn(&Transaction) -> Result<T, E>) -> Result<T, E> {
         let transaction = Transaction::new(self);
 
         f(&transaction)
     }
 
+    /// Executes mutable transaction. The transaction is running a closure `f`
+    /// that will receive `&mut Transaction` to execute `exec` and `exec_mut` queries
+    /// as if run on the main database object. You shall specify the return type `T`
+    /// (can be `()`) and the error type `E` that must be constructible from the `QueryError`
+    /// (`E` can be `QueryError`).
+    ///
+    /// Write transactions are committed if the closure returns `Ok` and rolled back if
+    /// the closure returns `Err`. If the code panics and the program exits the write
+    /// ahead log (WAL) makes sure the data in the main database file is restored to a
+    /// consistent state prior to the transaction.
+    ///
+    /// Typical use case for a write transaction is to insert nodes and edges together.
+    /// When not using a transaction you could end up only with nodes being inserted.
+    ///
+    /// Through its generic parameters the transaction also allows transforming the query
+    /// results into a type `T`.
     pub fn transaction_mut<T, E: From<QueryError>>(
         &mut self,
         f: impl Fn(&mut TransactionMut) -> Result<T, E>,

diff --git a/src/agdb/db/db_element.rs b/src/agdb/db/db_element.rs
@@ -1,9 +1,14 @@
 use super::db_key_value::DbKeyValue;
 use crate::DbId;
 
+/// Database element used in `QueryResult`
+/// that represents a node or an edge.
 #[derive(Debug, PartialEq)]
 pub struct DbElement {
+    /// Element id.
     pub id: DbId,
+
+    /// List of key-value pairs associated with the element.
     pub values: Vec<DbKeyValue>,
 }
 

diff --git a/src/agdb/db/db_error.rs b/src/agdb/db/db_error.rs
@@ -8,6 +8,9 @@ use std::num::TryFromIntError;
 use std::panic::Location;
 use std::string::FromUtf8Error;
 
+/// Universal `agdb` database error. It represents
+/// any error caused by the database processing such as
+/// loading a database, writing data etc.
 #[derive(Debug)]
 pub struct DbError {
     pub description: String,

diff --git a/src/agdb/db/db_float.rs b/src/agdb/db/db_float.rs
@@ -5,6 +5,12 @@ use std::cmp::Ordering;
 use std::hash::Hash;
 use std::hash::Hasher;
 
+/// Database float is a wrapper around `f64` to provide
+/// functionality like comparison. The comparison is
+/// using `total_cmp` standard library function. See its
+/// [docs](https://doc.rust-lang.org/std/primitive.f64.html#method.total_cmp)
+/// to understand how it handles NaNs and other edge cases
+/// of floating point numbers.
 #[derive(Clone, Debug)]
 pub struct DbFloat(f64);
 

diff --git a/src/agdb/db/db_id.rs b/src/agdb/db/db_id.rs
@@ -5,6 +5,11 @@ use crate::utilities::serialize::Serialize;
 use crate::utilities::serialize::SerializeStatic;
 use crate::utilities::stable_hash::StableHash;
 
+/// Database id is a wrapper around `i64`.
+/// The id is an identifier of a database element
+/// both nodes and edges. The positive ids represent nodes,
+/// negative ids represent edges. The value of `0` is
+/// logically invalid (there cannot be element with id 0) and a default.
 #[derive(Clone, Copy, Debug, Default, Eq, Hash, PartialEq, Ord, PartialOrd)]
 pub struct DbId(pub i64);