[Feature] Serving embedding and reranking model using vLLM #956

lvliang-intel · 2024-12-02T01:22:49Z

Priority

P1-Stopper

OS type

Ubuntu

Hardware type

Xeon-GNR

Running nodes

Single Node

Description

Feature: Serving Embedding and Reranking Models Using vLLM on Xeon and Gaudi
Description:
Integrate vLLM as a serving framework to enhance the performance and scalability of embedding and reranking models. This feature involves:

Leveraging vLLM's high-throughput serving capabilities to efficiently handle embedding and reranking requests.
Integration with the ChatQnA pipeline.
Optimizing the vLLM configuration for use cases involving embeddings and reranking, ensuring lower latency and better resource utilization.
Comparing vLLM's performance against the current TEI to determine the best setup for production.

Expected Outcome:

Applied another serving framework for embedding and reranking models, expect better performance on Gaudi.
Improved throughput for embedding and reranking services.
Enhanced flexibility to switch between serving frameworks based on specific requirements.

lvliang-intel added the feature New feature or request label Dec 2, 2024

lvliang-intel added this to the v1.2 milestone Dec 2, 2024

lvliang-intel assigned XinyaoWa Dec 2, 2024

joshuayao added this to OPEA Dec 2, 2024

lvliang-intel mentioned this issue Dec 2, 2024

[Feature] Serving embedding and reranking model using vLLM opea-project/GenAIExamples#1203

Open

joshuayao moved this to In progress in OPEA Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Serving embedding and reranking model using vLLM #956

[Feature] Serving embedding and reranking model using vLLM #956

lvliang-intel commented Dec 2, 2024

[Feature] Serving embedding and reranking model using vLLM #956

[Feature] Serving embedding and reranking model using vLLM #956

Comments

lvliang-intel commented Dec 2, 2024

Priority

OS type

Hardware type

Running nodes

Description