Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: update server streaming mode documentation #9519

Merged
merged 1 commit into from
Dec 11, 2024

Conversation

CentricStorm
Copy link
Contributor

@CentricStorm CentricStorm commented Sep 17, 2024

Server documentation:

  • Mentioned that streaming mode uses a different response format
  • Added a link to documentation of that response format
  • Reduced the size of n_predict in existing non-streamed example script (because on some computers 512 tokens can take a long time)
  • Made style of existing non-streamed example script more consistent

@CentricStorm CentricStorm force-pushed the patch-2 branch 4 times, most recently from aa3b48a to aa32f90 Compare November 28, 2024 03:54
@CentricStorm
Copy link
Contributor Author

CentricStorm commented Nov 28, 2024

Updated the streaming mode example script with split data handling, which has been tested with these unit tests:

// Chunks with a total of 9 tokens: " token token token token, token, token,", split at different positions.
const chunks = [
    `data: {"co`,
    `ntent":" token"}\n\n`,
    `data: {"content":" token"}\n`,
    `\n`,
    `data: {"content":" token"}\n\n`,
    `data: {"content":" token"}\n\ndata: {"co`,
    `ntent":","}\n\n`,
    `data: {"content":" token"}\n\ndata: {"content":","}\n`,
    `\n`,
    `data: {"content":" token"}\n\ndata: {"content":","}\n\n`
]

Avoided using Node.js readline so that it can work in browsers as well with minimal modification.

@ngxson
Copy link
Collaborator

ngxson commented Nov 28, 2024

Btw have you been able to test it with latest version on master branch? We added [DONE] as last data chunk at some point, to be aligned with openai implementation

@CentricStorm
Copy link
Contributor Author

Btw have you been able to test it with latest version on master branch? We added [DONE] as last data chunk at some point, to be aligned with openai implementation

It seems like #9459 only added the [DONE] event for the OpenAI-compatible /chat/completions API, not for the /completion API that this example uses:

https://github.com/ggerganov/llama.cpp/blob/678d7994f4da0af3d29046be99950ac999ee9762/examples/server/server.cpp#L3027

@CentricStorm CentricStorm deleted the patch-2 branch December 9, 2024 04:43
@CentricStorm CentricStorm restored the patch-2 branch December 9, 2024 04:43
@CentricStorm CentricStorm deleted the patch-2 branch December 9, 2024 04:43
@CentricStorm CentricStorm restored the patch-2 branch December 9, 2024 04:45
@CentricStorm CentricStorm deleted the patch-2 branch December 9, 2024 04:46
@CentricStorm CentricStorm restored the patch-2 branch December 9, 2024 04:53
@CentricStorm CentricStorm reopened this Dec 9, 2024
@CentricStorm
Copy link
Contributor Author

Example script still works with b4291 (ce8784b), but changed localhost to 127.0.0.1 because of a separate issue with the latest version of Node.js.

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, I think it's not a good idea to add this to our documentation. Because we already follow SSE standard (except for the POST method), client code should be trivial to implement.

The documentation should be reserved for things that can only be found in llama.cpp and not on the internet.

In this case, the code you provided is the same as openai implementation (because they also use SSE+POST method), there are many libraries on npm that can handle this (for example, this). So adding this here brings no more additional info to the docs, while adding maintenance cost in the future.

@CentricStorm
Copy link
Contributor Author

On second thought, I think it's not a good idea to add this to our documentation. Because we already follow SSE standard (except for the POST method), client code should be trivial to implement.

The documentation should be reserved for things that can only be found in llama.cpp and not on the internet.

Removed example code.

Provide more documentation for streaming mode.
@CentricStorm
Copy link
Contributor Author

Suggestions implemented.

@ngxson ngxson merged commit 5555c0c into ggml-org:master Dec 11, 2024
6 checks passed
@CentricStorm CentricStorm deleted the patch-2 branch December 12, 2024 02:18
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024
Provide more documentation for streaming mode.
tinglou pushed a commit to tinglou/llama.cpp that referenced this pull request Feb 13, 2025
Provide more documentation for streaming mode.
mglambda pushed a commit to mglambda/llama.cpp that referenced this pull request Mar 8, 2025
Provide more documentation for streaming mode.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants