Can you migrate the new server from llama.cpp to here? #51

maddes8cht · 2023-07-08T13:43:13Z

In the past few days, the server-example from llama.cpp has become a really useful piece of software - so much so that for many things it could replace the main program as the primary interaction tool with a model.

How difficult will it be to make this server available for falcon as well?
I have no idea how much falcon-specific code is actually in falcon-main - shouldn't most of the specific stuff be in the libraries, especially falcon_common and libfalcon?
How much is left to do once you've changed all the external calls in server.cpp to the corresponding calls from falcon_common and libfalcon?

cmp-nct · 2023-07-08T20:35:09Z

I'm just working on 3 larger extensions.
One of them will close some of the gaps of the main application the others will further widen it.

By now falcon is quite a bit more advanced in some features, finetune handling and syntax, stopwords, some bugfixes. So that likely is all missing from the server of llama.cpp.
I've not looked closer at it yet, I think having a server is useful but it's not top priority before we have the core functionality I've in mind.

I'm working on large context tests (such as processing 64k and more while keeping the model sane), a fully evaluated system prompt and replacing context swapping with continued generation.
It's all time consuming and they all interact a lot with the main app.

I think the server is useful but maybe not needed.
I've other improvements planned that would make a dedicated server obsolete by having server-like features in the core application already.

It's probably not a huge thing to port the server, but it's a bit bigger to also integrate the new features properly

maddes8cht · 2023-07-09T10:46:55Z

By now falcon is quite a bit more advanced in some features, finetune handling and syntax, stopwords, some bugfixes. So that likely is all missing from the server of llama.cpp.

I can see that, and it is not the only thing that makes ggllm.cpp the more interesting single project for me at the moment.

In the end, ggllm.cpp may be sooner finished with a large conext size build for falcon than llama.cpp for llamas.

As for the server:
I like the idea of having the server integrated directly into main - just turn it on with appropriate additional parameters --host and --port, provide optional custom index.html with a --path PUBLIC_PATH - that would be great.

If this server could be addressed in the same way as the one of Llama.cpp, then programs would be found quickly, which use these servers and which could then use falcon immediately without any change. #52 will dissolve by itself.

I started to look at the code the last two days, and the transfer of the server seemed to me a good occasion, because it could work, even if I don't understand large parts of the code.
Even if the server won't be needed as a standalone program in the medium term, I at least came across the following:

when examining the libraries like libfalcon and falcon_common, the bigger part of the names are changed from llama-thingy to falcon-thingy, but many are not - as a first example I come across llama_free(ctx).
I guess it's just not finalised yet, although I can also imagine that it may be intentional for certain names to remain llama.
one of the first preparations for the new server was the PR llama : make model stateless and context stateful (llama_state) (Make model stateless and context stateful (llama_state) ggml-org/llama.cpp#1797) - ggml-org/llama.cpp@527b6fb (github.com) ggml-org@527b6fb

If I understand it correctly, this PR ensures that when the server is accessed by different clients, each client gets its own context. This commit is important in itself to be able to offer a server function in a meaningful way - and for its implementation, changes have been made in all important libraries and sample programs.

Whether this would be a suitable project to become more familiar with the codebase as a whole (though the whole machine-learning and neural-network-layer stuff still remains largely foreign to me), or whether a commit will actually completely overwhelm me, I don't know yet - but if I do:
Would it be helpful?
Would it be better to first "clean up" at point 1 and do this in a separate PR that doesn't touch any functionalities and only afterwards start a separate PR for falcon-state?
Is the "tidying up" of Llama-terms at all desired, or should it just stay that way?

cmp-nct · 2023-07-10T16:33:38Z

It would probably be best to modularize it slightly. So main() is always a server-type loop.
Depending on your parameters it will add a module to interact in the loop.
One module could be "respond to prompt and exit", another one "spawn a server and wait for commands" another one "chat interaction mode".
The module would be kept external, so you can always just add a small piece of code and you'll have a custom application.
Much better than this mess of examples llama.cpp is getting into. It's not a ggml example anymore, so gradually we should also move out of "examples" and into just being an awesome application :)

Regarding the name changes: Whenever I touch something that needs a modification from original behavior I, usually, rename it to falcon.
So if you see something with falcon naming it means the behavior is likely adapted.
It would probably be best to keep it more generalized (no falcon, no llama - more like ggml itself) but that's too much work for now.

regarding PRs: If you take a llama server example, look at the differences between that loop and the falcon_main loop and implement/adapt as much as you can and provide a working version that's appreciated. It's basically a separate example to work on.
Changes to libfalcon or the main loop/common would need to be more carefully done.

Regarding the "stateless" part: we are quite there already. model loading and context creation is split.
I am not sure if it covers all from those PRs but it's quite close to it.
Loading two different models probably won't work yet but working with the same model in multiple contexts with each having their own rnd's, KV caches works fine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can you migrate the new server from llama.cpp to here? #51

Can you migrate the new server from llama.cpp to here? #51

maddes8cht commented Jul 8, 2023

cmp-nct commented Jul 8, 2023 •

edited

Loading

maddes8cht commented Jul 9, 2023

cmp-nct commented Jul 10, 2023

Can you migrate the new server from llama.cpp to here? #51

Can you migrate the new server from llama.cpp to here? #51

Comments

maddes8cht commented Jul 8, 2023

cmp-nct commented Jul 8, 2023 • edited Loading

maddes8cht commented Jul 9, 2023

cmp-nct commented Jul 10, 2023

cmp-nct commented Jul 8, 2023 •

edited

Loading