NMT Review #94

staheri14 · 2023-02-01T21:44:48Z

This is not a formal pull request for review, but rather a place to share my findings, questions, and suggestions regarding the implementation of the NMT as part of the following EPIC (celestiaorg/celestia-app#1296). I am conducting a thorough review of the NMT implementation and design.
The current content of the PR will undertake more updates as I am continuously writing and updating the code documentation, and adding more questions and suggestion. I am using this PR only as a place to communicate my thoughts.

To enhance the code documentation, I have added descriptions for some of the functions that were previously lacking proper documentation. I would appreciate if you could take a look and provide feedback, @liamsi.

My suggestions and questions are annotated as TODO [Me] to distinguish them from existing annotations in the code. I will continue to add comments and questions as I proceed with my review. Upon resolution of any questions, I will replace the annotations with proper documentation. If my suggestions are approved, I will open GitHub issues to implement them in subsequent pull requests.

The primary objectives of my code review are to:

Understand the proof generation algorithm and the structure of the Proof, particularly the order of nodes retrieved in the "nodes" field of the Proof structure and how to interpret the other fields in this struct.
Evaluate the verification procedure.
Identify any opportunities for security or performance improvement.
Improve the code documentation and readability, which would also be beneficial for future maintainers.

cc: @evan-forbes

liamsi · 2023-02-06T19:02:38Z

namespace/data.go

+// TODO [Me] Shouldn't we specify that the first 8 bytes represent the namespace.ID
 type PrefixedData []byte


Yeah, the issue with this is that the library wants to support varying namespace sizes (not only 8 bytes necessarily).

We could instead:

just not use a type alias for byte and simply accept that this is even easier to misuse (would have the nice side effect that the user would not need to cast between bytes and PrefixedData at all)

use a dedicated type; e.g. one of the two options here: namespace: PrefixedData/PrefixedData8 are easy to misuse #71

nmt.go

liamsi · 2023-02-06T19:07:46Z

nmt.go

@@ -115,6 +119,7 @@ func New(h hash.Hash, setters ...Option) *NamespacedMerkleTree {
 		leaves:          make([][]byte, 0, opts.InitialCapacity),
 		leafHashes:      make([][]byte, 0, opts.InitialCapacity),
 		namespaceRanges: make(map[string]leafRange),
+		// TODO [Me] Shouldn't minNID be populated by `0x00`?


Then, this would not work:

nmt/nmt.go

Line 400 in b04eea5

func (n *NamespacedMerkleTree) updateMinMaxID(id namespace.ID) {

That said, I think we can get rid of these two fields entirely!

I see, then, does it mean that the minID is actually equal to 0xFF which corresponds to the Max namespace ID?

That said, I think we can get rid of these two fields entirely!

Noted!

Update: I just double checked the code, and figured these two fields represent the range of the namespace IDs of the data stored by a neamspace tree, and they are accessed in here:

nmt/nmt.go

Line 174 in ac87f1c

if nID.Less(n.minNID) || n.maxNID.Less(nID) { // TODO [Me] we could move this entire if block inside the `foundInRange` function

This means, as it is, we cannot remove these two fields.
However, we could retrieve the same information i.e., the namespaceID range by looking at the tree rawRoot field (as it is supposed to embody the max and min namespace IDs of the entire tree). However, that requires a bit of refactoring so that the root gets calculated after each Push operation (Which is not currently the case).

Yes, removing these fields will need some refactoring. I did not think it through beyond the fact that the info about min/max namespaces also materializes in the root.

I see, then, does it mean that the minID is actually equal to 0xFF which corresponds to the Max namespace ID?

Only to initialize the minID such that the if-block yields true for any namespace smaller than the max one (and hence the minID gets updated here properly):

nmt/nmt.go

Lines 400 to 407 in 29cca3c

func (n *NamespacedMerkleTree) updateMinMaxID(id namespace.ID) {

if id.Less(n.minNID) {

n.minNID = id

}

if n.maxNID.Less(id) {

n.maxNID = id

}

}

liamsi · 2023-02-06T19:13:00Z

proof.go

 				return false
 			}
-			leafData := append(gotLeafNid, gotLeaf[nIDLen:]...)
+			leafData := append(gotLeafNid, gotLeaf[nIDLen:]...) // TODO why not just passing the leaf? isn't it the same?


I think you are right!

nmt.go

proof.go

liamsi · 2023-02-06T20:11:19Z

proof.go

 	nth := NewNmtHasher(h, nID.Size(), proof.isMaxNamespaceIDIgnored)
 	min := namespace.ID(MinNamespace(root, nID.Size()))
 	max := namespace.ID(MaxNamespace(root, nID.Size()))
+	// TODO [Me] this never happens, the min and max are exactly nID.Size() bytes


The passed-in nID could be any length though (it's just a byte slice):

nmt/namespace/id.go

Line 5 in b04eea5

type ID []byte

The passed-in nID could be any length though

Right, however, just to clarify, what I meant is that the MinNamespace(root, nID.Size()) and MaxNamespace(root, nID.Size()) functions are designed to always return a namespace ID with a size of nID.Size().

nmt/nmt.go

Line 425 in b04eea5

func MinNamespace(hash []byte, size namespace.IDSize) []byte {

Thus, considering this, we do not need to check the size of the retrieved namespace IDs i.e., the part below:
if nID.Size() != min.Size() || nID.Size() != max.Size()

I see. Yes, you are right.

proof.go

liamsi · 2023-02-07T08:46:33Z

nmt.go

+		// TODO [Me] Shouldn't we instead return the first or the last node in the tree as the exclusion proof?
+		// TODO [Me] although I think this current logic is based on the premise that the root of the tree is trusted


That is a good question. I thought the root (which commits/includes the min/max nids of the tree) is proof enough that the tree does not contain the passed-in nID. Is there a way to construct a tree/root such that the root is not sufficient proof (that the namespace is not included)? I do not think so but maybe I am missing sth. These are exactly the edge-cases that might require further thoughts.

If all the querying nodes have access to the correct NMT roots, then there should be no problem hence no exclusion proof is required (I will still think about it to see if any corner cases may happen). Thanks for your answer!

liamsi · 2023-02-07T08:46:48Z

nmt.go

 	if !found {
 		// To generate a proof for an absence we calculate the
 		// position of the leaf that is in the place of where
 		// the namespace would be in:
-		proofStart = n.calculateAbsenceIndex(nID)
+		proofStart = n.calculateAbsenceIndex(nID) // TODO [Me] this could simply return a range, to avoid the line below


nmt.go

liamsi · 2023-02-07T08:52:35Z

nmt.go

 	recurse = func(start, end int, includeNode bool) []byte {
-		if start >= len(n.leafHashes) {
+		if start >= len(n.leafHashes) { // TODO [Me] why against leafHashes? and not leaves?


No particular reason. The lengths of both should be the same.

nmt.go

liamsi · 2023-02-07T09:02:09Z

nmt.go

+		// check whether the subtree representing the [start, end) range of leaves has overlap with the
+		// queried proof range i.e., [proofStart, proofEnd)
+		// if not
+		if (end <= proofStart || start >= proofEnd) && includeNode { //TODO [Me] the `&& includeNode` seems ineffective and unnecessary


Not sure I understand this 🤔

My suggestion is about a small simplification on the if statement.
What I meant is that if we remove the condition on the includeNode from the if statement, we could get the same result, i.e.,

newIncludeNode := includeNode if (end <= proofStart || start >= proofEnd) { newIncludeNode = false }

The reason is that:
In the original version, If includeNode is false, the if block is never reached, hence the newIncludeNode contains the includeNode value i.e., false (the same result is yield in the new if statement)
In the original version, if the includeNode is true, and if the first condition is true i.e., (end <= proofStart || start >= proofEnd) then we enter the if block and toggle the newIncludeNode to false (which is the same result we obtain from the new if statement).

proof.go

liamsi · 2023-02-07T09:05:08Z

proof.go

 			if len(gotLeaf) < int(nIDLen) {
 				// conflicting namespace sizes
 				return false
 			}
-			gotLeafNid := namespace.ID(gotLeaf[:nIDLen])
+			gotLeafNid := namespace.ID(gotLeaf[:nIDLen]) // TODO [Me] a helper function


👍🏼 everything that improves readability (without impacting performance much) should be done.

liamsi · 2023-02-07T09:10:51Z

proof.go

 	var leafIndex uint64
+	// TODO [Me] Why called leftSubtrees?


Not sure what would be a better name. LeftSubtreeRoots maybe?

My question was originally about why the the data populated inside leftSubtrees represent the left subtrees. Later, by examining the code, I figured why. I should have removed this comment.
Nevertheless, I liked your suggested name i.e., LeftSubtreeRoots, it is more indicative of the content of the variable.

proof.go

…commendations.

staheri14 · 2023-02-10T20:29:28Z

@liamsi I have added a few other questions and suggestions, would appreciate your review and thoughts on them.

staheri14 · 2023-05-12T21:56:03Z

Closing the PR since it has fulfilled its intended purpose, and several of the questions raised in this conversation have been converted into corresponding GH issues or addressed in separate PRs. If there are any remaining matters that need attention, I will address them by creating additional GH issues.

staheri14 added 21 commits January 25, 2023 16:31

clarifies the unit of nidSize

57481a0

adds function description for foundInRange

d842ff6

adds missing func descriptions or makes further clarifications

5495171

adds TODOs and missing function descriptions

f1f1a53

refactors gotLeafHashes to leafHashes

0d3d318

corrects the description of generateLeafData

be869c9

adds todos and comments and missing description of methods/functions

69dd211

adds some optimization ideas

b02c28b

adds todo for early hash calculation

da16c5d

adds two more questions about the namespace len size

2c82fbb

modifies the Prove documentation

12cc868

updates the comments

66fd0a8

adds godoc for calculateAbsenceIndex

256a705

adds the function name to the description

810de44

adds further clarification about the absence proof

3dbcd83

deletes excess line

184a0a7

removes a todo

bf514d0

elaborates on the return values of foundInRange

d659113

removes a todo

ef7c36f

adds a suggestion for early return in VerifyNamespace

dc838f1

removes some old questions

28b38b3

staheri14 self-assigned this Feb 2, 2023

staheri14 added 3 commits February 1, 2023 17:26

explains the order of nodes in the proof

393db51

removes a question

33cf50f

revises the type of the proof

ac87f1c

evan-forbes requested review from liamsi and evan-forbes February 6, 2023 16:25