Interrogate data with LLM #2

mrocklin · 2024-02-29T23:36:13Z

People are curious about LLMs. It would be nice if we could go through the lifecycle that we expect other groups with large data corpi to go through. We have Terabytes of github data, the textual nature of which is mostly commit messages and issue comments (but not code). What can we do here?

I suspect that this is two inter-related questions:

What kinds of conversations would we want to ask of this data?
What is the right way to feed this data into an LLM in order to be able to ask those questions.

Technologically speaking, I'm hopeful that this involves both some training (maybe just a big GPU, but maybe several) on a regular basis, as well as some simple serving.

shughes-uk · 2024-03-02T06:08:39Z

Perhaps we can fine tune mistral 7b to act as a better search for issues?

This is an interesting article where they use an LLM to generate question/answer pairs to act as training data for the fine tuning

https://helixml.substack.com/p/how-we-got-fine-tuning-mistral-7b

Then you might be able to ask it things like "has anyone encountered x bug before?"

mrocklin · 2024-03-02T14:52:55Z

That would be fun. I have no idea how to do it.

Currently I'm playing around with vector databases and looking into RAG. I'm out of my depth here though if anyone has hands-on expertise.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interrogate data with LLM #2

Interrogate data with LLM #2

mrocklin commented Feb 29, 2024

shughes-uk commented Mar 2, 2024

mrocklin commented Mar 2, 2024

Interrogate data with LLM #2

Interrogate data with LLM #2

Comments

mrocklin commented Feb 29, 2024

shughes-uk commented Mar 2, 2024

mrocklin commented Mar 2, 2024