Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interrogate data with LLM #2

Open
mrocklin opened this issue Feb 29, 2024 · 2 comments
Open

Interrogate data with LLM #2

mrocklin opened this issue Feb 29, 2024 · 2 comments

Comments

@mrocklin
Copy link
Member

People are curious about LLMs. It would be nice if we could go through the lifecycle that we expect other groups with large data corpi to go through. We have Terabytes of github data, the textual nature of which is mostly commit messages and issue comments (but not code). What can we do here?

I suspect that this is two inter-related questions:

  1. What kinds of conversations would we want to ask of this data?
  2. What is the right way to feed this data into an LLM in order to be able to ask those questions.

Technologically speaking, I'm hopeful that this involves both some training (maybe just a big GPU, but maybe several) on a regular basis, as well as some simple serving.

@shughes-uk
Copy link

Perhaps we can fine tune mistral 7b to act as a better search for issues?

This is an interesting article where they use an LLM to generate question/answer pairs to act as training data for the fine tuning

https://helixml.substack.com/p/how-we-got-fine-tuning-mistral-7b

Then you might be able to ask it things like "has anyone encountered x bug before?"

@mrocklin
Copy link
Member Author

mrocklin commented Mar 2, 2024

That would be fun. I have no idea how to do it.

Currently I'm playing around with vector databases and looking into RAG. I'm out of my depth here though if anyone has hands-on expertise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants