Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can you migrate the new server from llama.cpp to here? #51

Open
maddes8cht opened this issue Jul 8, 2023 · 3 comments
Open

Can you migrate the new server from llama.cpp to here? #51

maddes8cht opened this issue Jul 8, 2023 · 3 comments

Comments

@maddes8cht
Copy link

In the past few days, the server-example from llama.cpp has become a really useful piece of software - so much so that for many things it could replace the main program as the primary interaction tool with a model.

How difficult will it be to make this server available for falcon as well?
I have no idea how much falcon-specific code is actually in falcon-main - shouldn't most of the specific stuff be in the libraries, especially falcon_common and libfalcon?
How much is left to do once you've changed all the external calls in server.cpp to the corresponding calls from falcon_common and libfalcon?

@cmp-nct
Copy link
Owner

cmp-nct commented Jul 8, 2023

I'm just working on 3 larger extensions.
One of them will close some of the gaps of the main application the others will further widen it.

By now falcon is quite a bit more advanced in some features, finetune handling and syntax, stopwords, some bugfixes. So that likely is all missing from the server of llama.cpp.
I've not looked closer at it yet, I think having a server is useful but it's not top priority before we have the core functionality I've in mind.

I'm working on large context tests (such as processing 64k and more while keeping the model sane), a fully evaluated system prompt and replacing context swapping with continued generation.
It's all time consuming and they all interact a lot with the main app.

I think the server is useful but maybe not needed.
I've other improvements planned that would make a dedicated server obsolete by having server-like features in the core application already.

It's probably not a huge thing to port the server, but it's a bit bigger to also integrate the new features properly

@maddes8cht
Copy link
Author

By now falcon is quite a bit more advanced in some features, finetune handling and syntax, stopwords, some bugfixes. So that likely is all missing from the server of llama.cpp.

I can see that, and it is not the only thing that makes ggllm.cpp the more interesting single project for me at the moment.

In the end, ggllm.cpp may be sooner finished with a large conext size build for falcon than llama.cpp for llamas.

As for the server:
I like the idea of having the server integrated directly into main - just turn it on with appropriate additional parameters --host and --port, provide optional custom index.html with a --path PUBLIC_PATH - that would be great.

If this server could be addressed in the same way as the one of Llama.cpp, then programs would be found quickly, which use these servers and which could then use falcon immediately without any change. #52 will dissolve by itself.

I started to look at the code the last two days, and the transfer of the server seemed to me a good occasion, because it could work, even if I don't understand large parts of the code.
Even if the server won't be needed as a standalone program in the medium term, I at least came across the following:

  1. when examining the libraries like libfalcon and falcon_common, the bigger part of the names are changed from llama-thingy to falcon-thingy, but many are not - as a first example I come across llama_free(ctx).
    I guess it's just not finalised yet, although I can also imagine that it may be intentional for certain names to remain llama.

  2. one of the first preparations for the new server was the PR llama : make model stateless and context stateful (llama_state) (Make model stateless and context stateful (llama_state) ggml-org/llama.cpp#1797) - ggml-org/llama.cpp@527b6fb (github.com) ggml-org@527b6fb

If I understand it correctly, this PR ensures that when the server is accessed by different clients, each client gets its own context. This commit is important in itself to be able to offer a server function in a meaningful way - and for its implementation, changes have been made in all important libraries and sample programs.

Whether this would be a suitable project to become more familiar with the codebase as a whole (though the whole machine-learning and neural-network-layer stuff still remains largely foreign to me), or whether a commit will actually completely overwhelm me, I don't know yet - but if I do:
Would it be helpful?
Would it be better to first "clean up" at point 1 and do this in a separate PR that doesn't touch any functionalities and only afterwards start a separate PR for falcon-state?
Is the "tidying up" of Llama-terms at all desired, or should it just stay that way?

@cmp-nct
Copy link
Owner

cmp-nct commented Jul 10, 2023

It would probably be best to modularize it slightly. So main() is always a server-type loop.
Depending on your parameters it will add a module to interact in the loop.
One module could be "respond to prompt and exit", another one "spawn a server and wait for commands" another one "chat interaction mode".
The module would be kept external, so you can always just add a small piece of code and you'll have a custom application.
Much better than this mess of examples llama.cpp is getting into. It's not a ggml example anymore, so gradually we should also move out of "examples" and into just being an awesome application :)

Regarding the name changes: Whenever I touch something that needs a modification from original behavior I, usually, rename it to falcon.
So if you see something with falcon naming it means the behavior is likely adapted.
It would probably be best to keep it more generalized (no falcon, no llama - more like ggml itself) but that's too much work for now.

regarding PRs: If you take a llama server example, look at the differences between that loop and the falcon_main loop and implement/adapt as much as you can and provide a working version that's appreciated. It's basically a separate example to work on.
Changes to libfalcon or the main loop/common would need to be more carefully done.

Regarding the "stateless" part: we are quite there already. model loading and context creation is split.
I am not sure if it covers all from those PRs but it's quite close to it.
Loading two different models probably won't work yet but working with the same model in multiple contexts with each having their own rnd's, KV caches works fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants