How to increase cpu usage of query node #38674

park12sj · 2024-12-23T13:25:04Z

park12sj
Dec 23, 2024

I want to maximize the search throughput using the DiskANN method. According to the Milvus benchmark report, search throughput increases proportionally with CPU cores when scaling up query nodes. However, when I attempted scaling up, I observed that the CPU usage of the query nodes remains very low (around 10%).

Which component in the pipeline could be a potential bottleneck before the query nodes?

test condition
- query node
  - replica : 100
  - cpu core : 8
  - memory : 16Gi
- embedding size : 768
- collection Entity Count : 10,000,000
- disk iops
  - Random Read/Write: 98K IOPS / 30K IOPS
  - Sequential Read/Write: 550MB/s / 520MB/s

Currently, we are measuring QPS using multi(5) vector search with a limit of 10, and the QPS is around 200. It seems unlikely that disk I/O is the bottleneck.

The reasoning is based on this discussion, which assumes that disk I/O per limit is 10.

With a 5 vector search, a limit of 10, and 200 QPS, the I/O generated per second would be calculated as 200 * 10 * 10 * 5 = 100K I/O. However, since there are 100 query node replicas, the I/O per query node is only 1000.

This is significantly below the IOPS capacity of the disk in use, so I concluded that disk I/O is not the bottleneck. If there is any flaw in this reasoning, please let me know.

Thank you.

yhmo · 2024-12-24T02:36:32Z

yhmo
Dec 24, 2024
Collaborator

To improve cpu usage, you can increase the replica_num of loading.

collection.release()
collection.load(replica_number=4)

With replica_number=4, the cluster requires 4X memory capacity. QPS will increase to 800. You need more clients to send requests parallelly.

DISKANN requires higher disk I/O than other index types. If you don't believe the disk I/O is the bottle, you can try:

replace the DISKANN index with IVF_FLAT or HNSW, and test the QPS

3 replies

park12sj Dec 24, 2024
Author

I have already tried increasing the replica_number, but it had no effect. Theoretically, I am curious whether increasing replica_number generally improves performance in DiskANN. According to the documentation, it seems to work in-memory.

park12sj Dec 24, 2024
Author

Due to the amount of data loaded into the collection exceeding the available cluster memory resources, I am unable to use memory-based indices such as IVF_FLAT or HNSW.

Apart from disk I/O, are there any other potential bottlenecks to consider?

As mentioned, the calculated I/O does not reach the IOPS of the hardware in use, so I do not think disk I/O is the bottleneck. Could there be an error in the calculation formula?
Additionally, since CPU usage is low, it does not seem to be a compute bottleneck.
If neither disk I/O nor compute is the bottleneck, there might be another bottleneck in front of the query node, which is why I opened this discussion.

yhmo Dec 24, 2024
Collaborator

I guess your client didn't have enough requests so most of the server resources are idle.

xiaobingxia-at · 2024-12-26T02:19:25Z

xiaobingxia-at
Dec 26, 2024

I had the similar issue, here's my perspective:

If using diskann or memory mapping, definitely use the local SSD disk, whose IOPS can come to the 1 million, versus 3000-16k IOPS on regular gp3 EBS disk on aws cloud.
I personally don't think one diskann search will generate such a small number of I/O request. I did my test using HNSW + MMAP, and checked the disk I/O metrics on data dog. My rough estimate at that time was 500-1500 I/O per request. I guess diskann just get worse.
If you are throttled by disk, then likely your CPU usage rate can't go up, even if you increase the num replicas, it will still stay low. And easy way to verify that is using in-memory index, which you don't have the I/O bottleneck, usually you will immediately see CPU usage surge. If that happens, you are throttled by disk.
If you are not throttled by disk, but by other factors, for example, coordinator, if you have too many partitions. Then it is possible that your query nodes don't receive enough requests.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to increase cpu usage of query node #38674

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

How to increase cpu usage of query node #38674

park12sj Dec 23, 2024

Replies: 2 comments · 3 replies

yhmo Dec 24, 2024 Collaborator

park12sj Dec 24, 2024 Author

park12sj Dec 24, 2024 Author

yhmo Dec 24, 2024 Collaborator

xiaobingxia-at Dec 26, 2024

park12sj
Dec 23, 2024

Replies: 2 comments 3 replies

yhmo
Dec 24, 2024
Collaborator

park12sj Dec 24, 2024
Author

park12sj Dec 24, 2024
Author

yhmo Dec 24, 2024
Collaborator

xiaobingxia-at
Dec 26, 2024