Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement whisper.cpp #51

Open
ericcurtin opened this issue Aug 21, 2024 · 24 comments
Open

Implement whisper.cpp #51

ericcurtin opened this issue Aug 21, 2024 · 24 comments

Comments

@ericcurtin
Copy link
Collaborator

ericcurtin commented Aug 21, 2024

If there is a way to auto-detect between language model files and asr model files. We should do that, or if that's not possible we should just use a runtime flag, so some options for the runtime flag would be:

ramalama --runtime vllm
ramalama --runtime llama.cpp
ramalama --runtime whisper.cpp
@ericcurtin
Copy link
Collaborator Author

We added a basic version of whisper.cpp to the Container image here:

#49

@rhatdan
Copy link
Member

rhatdan commented Oct 14, 2024

Can this be closed? Do we have this functionality now?

@ericcurtin
Copy link
Collaborator Author

There's still more work for both --runtime whisper.cpp and --runtime vllm ... It can be tracked under this issue or somewhere else.

@FNGarvin
Copy link
Contributor

What do you see an example command-line looking like here? I've toyed a bit with whisper.cpp and find its use to be very different than launching a gradio server or a chat instance. AFAICT, you're usually calling the main executable with an argument for the model and one or more additional arguments describing the file to transcribe and options.

How would you feel about, as an alternative, dropping users into a [containerized] shell with access to input/output/model volumes and possibly some helper scripts to accomplish simple tasks? So, perhaps, ramalama run --runtime whisper.cpp ggml-large-v3-turbo.bin gets you a bash prompt with a motd saying type 'transcribe /audio/jfk.wav' to blah blah blah etc? Crude compared to running or serving the chat-centric models, but still adds value to Ramalama and also to Whisper (which has had broken CUDA support for nine+ months and isn't as slick wrt pulling images and mounting volumes).

Also, should the container files be pulling the latest whisper.cpp instead of the latest known to be good for Ramalama?

@ericcurtin
Copy link
Collaborator Author

ericcurtin commented Dec 17, 2024

I would say just this to start:

ramalama --runtime whisper.cpp run ggml-large-v3-turbo.bin jfk.wav

which would just perform a fairly standard whisper.cpp command.

No interactive support, it's not the same as a chat bot workflow.

We have renovate controlling what version of whisper.cpp we build against and it runs everything through CI before we rebase, I'd like to keep this, at least there's a CI run before we suddenly change version. Just cloning main/master without CI, I'd rather not.

@ericcurtin
Copy link
Collaborator Author

@p5 kindly set up renovate for us.

@FNGarvin
Copy link
Contributor

ramalama --runtime run whisper.cpp ggml-large-v3-turbo.bin

which would just perform a fairly standard whisper.cpp command.

Could you elaborate, please? What command would be performed? Would that be provided through additional command-line arguments to the command you've just given? Would you pass that as though it were a prompt to another model?

The boilerplate syntax for using whisper.cpp directly is something like ./main -m /models/ggml-large-v3-turbo.bin -f /audios/jfk.wav Could you please give me an example of the kind of command-line you're envisioning to complete the same task via Ramalama?

No interactive support, it's not the same as a chat bot workflow.

Sorry to be dense, but I can't tell if you're reiterating that whisper.cpp does not provide interactive support or if you're saying that you do not like the concept of dropping the user to a shell prompt with some workspace-like features.

@ericcurtin
Copy link
Collaborator Author

I meant something like this, corrected:

ramalama --runtime whisper.cpp run ggml-large-v3-turbo.bin jfk.wav

@ericcurtin
Copy link
Collaborator Author

Sorry to be dense, but I can't tell if you're reiterating that whisper.cpp does not provide interactive support or if you're saying that you do not like the concept of dropping the user to a shell prompt with some workspace-like features.

We are always open to ideas. It's just less relevant with whisper.cpp, one doesn't speak to an interactive prompt. But if someone finds that useful for some reason, always happy to look at PRs, etc.

@ericcurtin
Copy link
Collaborator Author

This PR is a perfect example of why we don't just clone main/master:

#474

@FNGarvin
Copy link
Contributor

Yes, failing each time whisper.cpp is updated is perhaps a better solution than blindly updating along with it. Anyway, back to the meat of my question...

It's just less relevant with whisper.cpp, one doesn't speak to an interactive prompt. But if someone finds that useful for some reason, always happy to look at PRs, etc.

I don't care at all about that. I just saw the seemingly abandoned whisper.cpp stub in the container and thought to make it usable. It is not currently usable, right? Do you care to see it made usable?

@ericcurtin
Copy link
Collaborator Author

Yes, failing each time whisper.cpp is updated is perhaps a better solution than blindly updating along with it. Anyway, back to the meat of my question...

It's just less relevant with whisper.cpp, one doesn't speak to an interactive prompt. But if someone finds that useful for some reason, always happy to look at PRs, etc.

I don't care at all about that. I just saw the seemingly abandoned whisper.cpp stub in the container and thought to make it usable. It is not currently usable, right? Do you care to see it made usable?

Yup we sure do, this issue is open for someone to complete it :)

@FNGarvin
Copy link
Contributor

this issue is open for someone to complete it :)

Great. I'm just having trouble understanding what "complete it" means to you. That's why I'm asking about the example command-lines you're envisioning.

The boilerplate syntax for using whisper.cpp directly is something like ./main -m /models/ggml-large-v3-turbo.bin -f /audios/jfk.wav Could you please give me an example of the kind of command-line you're envisioning to complete the same task via Ramalama?

@ericcurtin
Copy link
Collaborator Author

this issue is open for someone to complete it :)

Great. I'm just having trouble understanding what "complete it" means to you. That's why I'm asking about the example command-lines you're envisioning.

We are open to ideas. Rome wasn't built in a day, one PR at a time.

The boilerplate syntax for using whisper.cpp directly is something like ./main -m /models/ggml-large-v3-turbo.bin -f /audios/jfk.wav Could you please give me an example of the kind of command-line you're envisioning to complete the same task via Ramalama?

Getting this executing the main command you specified would be a start

ramalama --runtime whisper.cpp run ggml-large-v3-turbo.bin jfk.wav

@FNGarvin
Copy link
Contributor

FNGarvin commented Dec 17, 2024

Thanks. And in the case that Ramalama is running the inferencing in a container, how we get the input media into the model? Would you infer a directory to bind based on the working directory/file given as an argument?

@rhatdan
Copy link
Member

rhatdan commented Dec 17, 2024

Yes the wav file would need to be volume mounted into the container with a :z option.

Would it make sense to also allow stdin for a way file?

cat jfk.wav | ramalama --runtime whisper.cpp run ggml-large-v3-turbo.bin -

I think it makes sense to allow grab output from stdout.

cat jfk.wav | ramalama --runtime whisper.cpp run ggml-large-v3-turbo.bin - > /tmp/output

Not sure what ramalama serve would do?

@FNGarvin
Copy link
Contributor

Thank you for helping to bring me up to speed with some mock usage examples. I'm still pretty novice at containers, but could we bind the directory read-only instead of using :z? A dryrun (love that feature, btw) of something like(?):

    podman run --rm -i --device nvidia.com/gpu=all \
    --mount=type=image,src=WHISPER.CPP-MODEL-SHORTNAME,destination=/mnt/models,rw=false,subpath=/models \
    --mount=type=bind,src=/[FQual-dir-of-user-inputfilewav]/,destination=/mnt/audio,rw=false \
    ramalama-image /bin/sh -c "whisper-main -m /mnt/models/model.file -f /mnt/audio/USERFILE.wav"

Would it make sense to also allow stdin for a way file?

It's more appealing in principle than binding directories, but whether or not it is practical is currently unknown to me. It has been a verrrrry long time since I've relied on a shell to create pipes for large data and I don't quite remember what the gotchas were, but I'm thinking there were at least a few. I know even less about what might happen trying to pipe potentially massive, uncompressed audio files into the container on STDIN.

I think it makes sense to allow grab output from stdout.

Redirects aren't adequate?

Not sure what ramalama serve would do?

If the documentation on the whisper.cpp github is accurate, it seems like the provided web service is very basic. All the examples are making requests via curl from the command-line. I totally get it and that isn't criticism, but we're not talking about a convenient UI AFAICT. I am aware that there exist some third-party front-ends, like https://github.com/litongjava/whisper-cpp-server + https://github.com/litongjava/listen-know-web, but I don't really know anything about them.

For me, personally, the most likely use-case for whisper.cpp is in generating subtitles for arbitrary videos (upon audio extracted w/ ffmpeg) or for transcribing and translating arbitrary audio sequences. Command-line invocation over a bound directory seems adequate and possibly ideal. It probably doesn't require any code overlap with whisper.cpp, using Ramalama as scaffolding to bring all the pieces together and offering a layer of abstraction wrt GPU config, system libraries, etc. But I'm not deep into any of these techs or projects, so if there's a better vision / direction I'd love to hear about it.

@rhatdan
Copy link
Member

rhatdan commented Dec 17, 2024

The stdout stuff should just work, we could even just have whisper grab /dev/stdin when it sees the "-".

The volume mount can also be marked ro,z so it is both readonly and available for reading based on SELinux.

@rhatdan
Copy link
Member

rhatdan commented Dec 17, 2024

podman run --rm -i --device nvidia.com/gpu=all \
--mount=type=image,src=WHISPER.CPP-MODEL-SHORTNAME,destination=/mnt/models,rw=false,subpath=/models \
--mount=type=bind,src=/[FQual-dir-of-user-inputfilewav]/,destination=/mnt/audio/USERFILE.wav,rw=false,z \
ramalama-image /bin/sh -c "whisper-main -m /mnt/models/model.file -f /mnt/audio/USERFILE.wav"

@rhatdan
Copy link
Member

rhatdan commented Dec 17, 2024

Slightly simpler.

podman run --rm -i --device nvidia.com/gpu=all
--mount=type=image,src=WHISPER.CPP-MODEL-SHORTNAME,destination=/mnt/models,rw=false,subpath=/models
-v/[FQual-dir-of-user-inputfilewav]/:/mnt/audio/USERFILE.wav:ro,z
ramalama-image /bin/sh -c "whisper-main -m /mnt/models/model.file -f /mnt/audio/USERFILE.wav"

@rhatdan
Copy link
Member

rhatdan commented Dec 17, 2024

Looks like whisper -f - is supported now, although the entire file needs to be flushed through the pipe I believe.

@FNGarvin
Copy link
Contributor

Looks like whisper -f - is supported

It does look that way, though I haven't prepared any large wav files to test with. Thank you - your way is much better, I think, than binding a directory for a process that will only require one input. Especially if it creates the possibility of chaining ffmpeg without intermediate conversion files.

So, consensus seems to be that cat jfk.wav | ramalama --runtime whisper.cpp run ggml-tiny.bin should produce and run something vaguely like podman run --rm -i --device nvidia.com/gpu=all --mount=type=bind,src=models/,destination=/mnt/models,rw=false ramalama /bin/sh -c "whisper-main -m /mnt/models/ggml-tiny.bin -f -" That doesn't seem too difficult at first blush. I'll see what I can do in the coming days.

Thanks

@ericcurtin
Copy link
Collaborator Author

ericcurtin commented Dec 18, 2024

stdin seems fine to me. Also we can autodetect when there is stdin coming in, it's better for usability, you don't need the explicit '-' then, although we can have the ability to explicitly use stdin also '-', grep is an example of a command that does this and llama-run. Once llama-run is integrated into RamaLama, I plan on adding this to the ramalama run:

git diff | ramalama run granite-code "Write a git commit message for this change"

llm-gguf and ollama have this feature.

@rhatdan
Copy link
Member

rhatdan commented Dec 18, 2024

Cool

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants