Merge branch 'release/v24.0' into main

dgraph-io · Aug 11, 2024 · 1759fe3 · 1759fe3
2 parents 77de360 + 2302bc7
commit 1759fe3
Show file tree

Hide file tree

Showing 16 changed files with 237 additions and 15 deletions.
diff --git a/.github/styles/Vocab/Dgraph/accept.txt b/.github/styles/Vocab/Dgraph/accept.txt
@@ -130,6 +130,7 @@ rebalancing
 unary
 loopback
 snake_case
+semver
 
 Leia
 Skywalker

diff --git a/LICENSE.md b/LICENSE.md
@@ -1,6 +1,6 @@
 ## Dgraph Licensing
 
-Copyright 2016-2021 Dgraph Labs, Inc.
+Copyright 2016-2024 Dgraph Labs, Inc.
 
 Source code in this repository is variously licensed under the Apache Public
 License 2.0 (APL) and the Dgraph Community License (DCL). A copy of each license

diff --git a/README.md b/README.md
@@ -38,7 +38,9 @@ Making our documentation easy to understand includes optimizing it for client-si
 Use hugo shortcode for relref.
 
 Example, to reference a term, use a relref to the glossary :
+```
 >  [entity]({{< relref "dgraph-glossary.md#entity" >}})
+```
 
 ### Staging doc updates locally
 

diff --git a/content/deploy/cli-command-reference.md b/content/deploy/cli-command-reference.md
@@ -128,6 +128,7 @@ The `--badger` superflag allows you to set many advanced [Badger options](https:
 | `--query_edge_limit` | uint64 | `query-edge` | uint64 |`alpha`| Maximum number of edges that can be returned in a query |
 | `--normalize_node_limit` | int | `normalize-node` | int |`alpha`| Maximum number of nodes that can be returned in a query that uses the normalize directive |
 | `--mutations_nquad_limit` | int | `mutations-nquad` | int |`alpha`| Maximum number of nquads that can be inserted in a mutation request |
+| `--max-pending-queries` | int | `max-pending-queries` | int |`alpha`| Maximum number of concurrently processing requests allowed before requests are rejected with 429 Too Many Requests |
 
 ### Raft superflag
 

diff --git a/content/design-concepts/replication-concept.md b/content/design-concepts/replication-concept.md
@@ -9,4 +9,4 @@ weight = 85
 Each Highly-Available (HA) group will be served by at least 3 instances (or two if one is temporarily unavailable). In the case of an alpha instance
 failure, other alpha instances in the same group still handle the load for data in that group. In case of a zero instance failure, the remaining two zeros in the zero group will continue to hand out timestamps and perform other zero functions.
 
-In addition, Dgraph `Learner Nodes` are alpha instances that hold replicas of data, but this replication is to suupport read replicas, often in a different geography from the master cluster. This replication is implemented the same way as HA replication, but the learner nodes do not participate in quorum, and do not take over from failed nodes to provide high availability.
+In addition, Dgraph `Learner Nodes` are alpha instances that hold replicas of data, but this replication is to support read replicas, often in a different geography from the master cluster. This replication is implemented the same way as HA replication, but the learner nodes do not participate in quorum, and do not take over from failed nodes to provide high availability.
diff --git a/content/dql/dql-schema.md b/content/dql/dql-schema.md
@@ -16,6 +16,9 @@ revenue: float .
 running_time: int .
 starring: [uid] .
 director: [uid] .
+description: string .
+
+description_vector: float32vector @index(hnsw(metric:"cosine")) .
 
 type Person {
   name
@@ -28,6 +31,8 @@ type Film {
   running_time
   starring
   director
+  description
+  description_vector
 }
 ```
 
@@ -112,6 +117,15 @@ For all triples with a predicate of scalar types the object is a literal.
 are RFC 3339 compatible which is different from ISO 8601(as defined in the RDF spec). You should
 convert your values to RFC 3339 format before sending them to Dgraph.{{% /notice  %}}
 
+### Vector Type
+
+The `float32vector` type denotes a vector of floating point numbers, i.e an ordered array of float32.  A node type can contain more than one vector predicate.
+
+Vectors are normaly used to store embeddings obtained from other information through an ML model. When a `float32vector` is [indexed]({{<relref "dql/predicate-indexing.md">}}), the DQL [similar_to]({{<relref "query-language/functions#vector-similarity-search">}}) function can be used for similarity search.
+
+
+
+
 ### UID Type
 
 The `uid` type denotes a relationship; internally each node is identified by it's UID which is a `uint64`.

diff --git a/content/dql/predicate-indexing.md b/content/dql/predicate-indexing.md
@@ -9,6 +9,15 @@ weight = 4
 
 Filtering on a predicate by applying a [function]({{< relref "query-language/functions.md" >}}) requires an index.
 
+Indices are defined in the [Dgraph types schema]({{<relref "dql/dql-schema.md" >}}) using `@index` directive.
+
+Here are some examples:
+```
+name: string @index(term) .
+release_date: datetime @index(year) .
+description_vector: float32vector @index(hnsw(metric:"cosine")) .
+```
+
 When filtering by applying a function, Dgraph uses the index to make the search through a potentially large dataset efficient.
 
 All scalar types can be indexed.
@@ -17,6 +26,8 @@ Types `int`, `float`, `bool` and `geo` have only a default index each: with toke
 
 Types `string` and `dateTime` have a number of indices.
 
+Type `float32vector` supports `hnsw` index.
+
 ## String Indices
 The indices available for strings are as follows.
 
@@ -34,6 +45,30 @@ transaction conflict rate. Use only the minimum number of and simplest indexes
 that your application needs.
 {{% /notice %}}
 
+## Vector Indices
+
+The indices available for `float32vector` are as follows.
+
+| Dgraph function            | Required index / tokenizer             | Notes |
+| :-----------------------   | :------------                          | :---  |
+| `similar_to`                       | `hnsw` | HNSW index supports parameters `metric` and `exponent`. |
+
+
+#
+
+`hnsw` (**Hierarchical Navigable Small World**) index supports the following parameters
+- metric : indicate the metric to use to compute vector similarity. One of `cosine`, `euclidean`, and `dotproduct`. Default is `euclidean`.
+
+- exponent : An integer, represented as a string, roughly representing the number of vectors expected in the index in power of 10. The exponent value,is used to set "reasonable defaults" for HNSW internal tuning parameters. Default is "4" (10^4 vectors).
+
+
+Here are some examples:
+```
+simple_vector: float32vector @index(hnsw) .
+description_vector: float32vector @index(hnsw(metric:"cosine")) .
+large_vector: float32vector @index(hnsw(metric:"euclidean",exponent:"6")) .
+```
+
 ## DateTime Indices
 
 The indices available for `dateTime` are as follows.

diff --git a/content/graphql/mutations/mutations-overview.md b/content/graphql/mutations/mutations-overview.md
@@ -221,6 +221,35 @@ mutation {
 }
 ```
 
+## Vector Embedding mutations
+
+For types with vector embeddings Dgraph automatically generates the add mutation. For this example of add mutation we use the following schema.
+
+```graphql
+type User {
+    userID: ID!
+    name: String!
+    name_v: [Float!] @embedding @search(by: ["hnsw(metric: euclidean, exponent: 4)"])
+}
+
+mutation {
+addUser(input: [
+{ name: "iCreate with a Mini iPad", name_v: [0.12, 0.53, 0.9, 0.11, 0.32] },
+{ name: "Resistive Touchscreen", name_v: [0.72, 0.89, 0.54, 0.15, 0.26] },
+{ name: "Fitness Band", name_v: [0.56, 0.91, 0.93, 0.71, 0.24] },
+{ name: "Smart Ring", name_v: [0.38, 0.62, 0.99, 0.44, 0.25] }]) 
+  {
+    project {
+      id
+      name
+      name_v
+    }
+  }
+}
+```
+
+Note: The embeddings are generated outside of Dgraph using any suitable machine learning model.
+
 ## Examples
 
 You can refer to the following [link](https://github.com/dgraph-io/dgraph/tree/main/graphql/schema/testdata/schemagen) for more examples.
diff --git a/content/graphql/queries/aggregate.md b/content/graphql/queries/aggregate.md
@@ -1,7 +1,7 @@
 +++
 title = "Aggregate Queries"
 description = "Dgraph automatically generates aggregate queries for GraphQL schemas. These are compatible with the @auth directive."
-weight = 3
+weight = 4
 [menu.main]
     parent = "graphql-queries"
     name = "Aggregate Queries"

diff --git a/content/graphql/queries/vector-similarity.md b/content/graphql/queries/vector-similarity.md
@@ -0,0 +1,64 @@
++++
+title = "Similarity Search"
+description = "Dgraph automatically generates GraphQL queries for each vector index that you define in your schema. There are two types of queries generated for each index."
+weight = 3
+[menu.main]
+    parent = "graphql-queries"
+    identifier = "vector-queries"
++++
+
+Dgraph automatically generates two GraphQL similarity queries for each type that have at least one [vector predicate](/graphql/schema/types/#vectors) with `@search` directive.
+
+For example
+
+```graphql
+type User {
+    id: ID!
+    name: String!
+    name_v: [Float!] @embedding @search(by: ["hnsw(metric: euclidean, exponent: 4)"])
+}
+```
+
+With the above schema, the auto-generated `querySimilar<Object>ByEmbedding` query allows us to run similarity search using the vector index specified in our schema.
+
+```graphql
+getSimilar<Object>ByEmbedding(
+    by: vector_predicate, 
+    topK: n, 
+    vector: searchVector): [User]
+```
+
+For example in order to find top 3 users with names similar to a given user name embedding the following query function can be used. 
+
+```graphql  
+querySimilarUserByEmbedding(by: name_v, topK: 3, vector: [0.1, 0.2, 0.3, 0.4, 0.5]) {
+        id
+        name
+        vector_distance
+     }
+```
+The results obtained for this query includes the 3 closest Users ordered by vector_distance. The vector_distance is the Euclidean distance between the name_v embedding vector and the input vector used in our query.
+
+Note: you can omit vector_distance predicate in the query, the result will still be ordered by vector_distance.
+
+The distance metric used is specified in the index creation. 
+
+Similarly, the auto-generated `querySimilar<Object>ById` query allows us to search for similar objects to an existing object, given it’s Id. using the  function.
+
+```graphql
+getSimilar<Object>ById(
+    by: vector_predicate, 
+    topK: n, 
+    id: userID):  [User]
+```
+
+For example the following query searches for top 3 users whose names are most similar to the name of the user with id "0xef7".
+
+```graphql
+querySimilarUserById(by: name_v, topK: 3, id: "0xef7") {
+    id
+    name
+    vector_distance
+}
+```
+
diff --git a/content/graphql/quick-start/index.md b/content/graphql/quick-start/index.md
@@ -67,19 +67,21 @@ You may want to use the introspection capability of the client to explore the sc
 To populate the database,
 1. Open the [API Explorer](https://cloud.dgraph.io/_/explorer) tab
 2. Paste the following code into the text area:
-   ```graphql
-   mutation {
-      addProduct(input: [
-        { name: "GraphQL on Dgraph"},
-        { name: "Dgraph: The GraphQL Database"}
-      ]) {
-        product {
-          productID
-          name
-        }
+  ```graphql
+  mutation {
+    addProduct(
+      input: [
+        { name: "GraphQL on Dgraph" }
+        { name: "Dgraph: The GraphQL Database" }
+      ]
+    ) {
+      product {
+        productID
+        name
       }
-      addCustomer(input: [{ username: "Michael"}]) {
-        customer {
+    }
+    addCustomer(input: [{ username: "Michael" }]) {
+      customer {
         username
       }
     }

diff --git a/content/graphql/schema/directives/_index.md b/content/graphql/schema/directives/_index.md
@@ -38,6 +38,12 @@ Reference: [Deprecation]({{< relref "deprecated.md" >}})
 
 Reference: [@dgraph directive]({{< relref "directive-dgraph" >}})
 
+### @embedding
+
+`@embedding` directive designates one or more fields as vector embeddings.
+
+Reference: [@embedding directive]({{< relref "embedding" >}})
+
 ### @generate
 
 The `@generate` directive is used to specify which GraphQL APIs are generated for a type.

diff --git a/content/graphql/schema/directives/embedding.md b/content/graphql/schema/directives/embedding.md
@@ -0,0 +1,13 @@
++++
+title = "@embedding"
+weight = 1
+[menu.main]
+    parent = "directives"
++++
+
+
+A Float array can be used as a vector using `@embedding` directive. It denotes a vector of floating point numbers, i.e an ordered array of float32. 
+
+The embeddings can be defined on one or more predicates of a type and they are generated using suitable machine learning models.
+
+This directive is used in conjunction with `@search` directive to declare the HNSW index. For more information see: [@search](/graphql/schema/directives/search/#vector-embedding) directive for vector embeddings.
diff --git a/content/graphql/schema/directives/search.md b/content/graphql/schema/directives/search.md
@@ -624,3 +624,23 @@ query {
   }
 }
 ```
+
+### Vector embedding
+
+The `@search` directive is used in conjunction with `@embeding` directive to define the HNSW index on vector embeddings. These vector embeddings are obtained from external Machine Learning models.
+
+```graphql
+type User {
+    userID: ID!
+    name: String!
+    name_v: [Float!] @embedding @search(by: ["hnsw(metric: euclidean, exponent: 4)"])
+}
+```
+
+In this schema, the field `name_v` is an embedding on which the HNSW algorithm is used to create a vector search index. 
+
+The metric used to compute the distance between vectors (in this example) is Euclidean distance. Other possible metrics are `cosine` and `dotproduct`.
+
+The directive, `@embedding`, designates one or more fields as vector embeddings.
+
+The `exponent` value is used to set reasonable defaults for HNSW internal tuning parameters. It is an integer representing an approximate number for the vectors expected in the index, in terms of power of 10. Default is “4” (10^4 vectors).
diff --git a/content/graphql/schema/types.md b/content/graphql/schema/types.md
@@ -51,6 +51,26 @@ type User {
 
 Scalar lists in Dgraph act more like sets, so `tags: [String]` would always contain unique tags.  Similarly, `recentScores: [Float]` could never contain duplicate scores.
 
+### Vectors
+
+A Float array can be used as a vector using `@embedding` directive. It denotes a vector of floating point numbers, i.e an ordered array of float32. A type can contain more than one vector predicate.
+
+Vectors are normaly used to store embeddings obtained from an ML model. 
+
+When a Float vector is indexed, the GraphQL `querySimilar<type name>ByEmbedding` and `querySimilar<type name>ById` functions can be used for [similarity search]({{<relref "vector-similarity.md">}}).
+
+A simple example of adding a vector embedding on `name`  to `User` type is shown below. 
+
+```graphql
+type User {
+    userID: ID!
+    name: String!
+    name_v: [Float!] @embedding @search(by: ["hnsw(metric: euclidean, exponent: 4)"])
+}
+```
+
+In this schema, the field `name_v` is an embedding on which the [@search ](/graphql/schema/directives/search/#vector-embedding) directive for vector embeddings is used.
+
 ### The `ID` type
 
 In Dgraph, every node has a unique 64-bit identifier that you can expose in GraphQL using the `ID` type. An `ID` is auto-generated, immutable and never reused. Each type can have at most one `ID` field.

diff --git a/content/query-language/functions.md b/content/query-language/functions.md
@@ -177,6 +177,21 @@ Same query with a Levenshtein distance of 3.
 }
 {{< /runnable >}}
 
+## Vector Similarity Search
+
+Syntax Examples: `similar_to(predicate, 3, "[0.9, 0.8, 0, 0]")`
+
+Alternatively the vector can be passed as a variable: `similar_to(predicate, 3, $vec)`
+
+This function finds the nodes that have  `predicate` close to the provided vector. The search is based on the distance metric specified in the index (`cosine`, `euclidean`, or `dotproduct`). The shorter distance indicates more similarity.
+The second parameter, `3` specifies that top 3 matches be returned.
+
+Schema Types: `float32vector`
+
+Index Required: `hnsw`
+
+
+
 ## Full-Text Search
 
 Syntax Examples: `alloftext(predicate, "space-separated text")` and `anyoftext(predicate, "space-separated text")`