Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Serving embedding and reranking model using vLLM #956

Open
lvliang-intel opened this issue Dec 2, 2024 · 0 comments
Open

[Feature] Serving embedding and reranking model using vLLM #956

lvliang-intel opened this issue Dec 2, 2024 · 0 comments
Assignees
Labels
feature New feature or request
Milestone

Comments

@lvliang-intel
Copy link
Collaborator

Priority

P1-Stopper

OS type

Ubuntu

Hardware type

Xeon-GNR

Running nodes

Single Node

Description

Feature: Serving Embedding and Reranking Models Using vLLM on Xeon and Gaudi
Description:
Integrate vLLM as a serving framework to enhance the performance and scalability of embedding and reranking models. This feature involves:

Leveraging vLLM's high-throughput serving capabilities to efficiently handle embedding and reranking requests.
Integration with the ChatQnA pipeline.
Optimizing the vLLM configuration for use cases involving embeddings and reranking, ensuring lower latency and better resource utilization.
Comparing vLLM's performance against the current TEI to determine the best setup for production.

Expected Outcome:

Applied another serving framework for embedding and reranking models, expect better performance on Gaudi.
Improved throughput for embedding and reranking services.
Enhanced flexibility to switch between serving frameworks based on specific requirements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
Status: In progress
Development

No branches or pull requests

2 participants