callm
enables you to run Generative AI models (such as Large Language Models) directly on your hardware, offline.
Under the hood, callm
heavily relies on the candle crate and is written in pure Rust 🦀.
Model | Safetensors | GGUF (quantized) |
---|---|---|
Llama | ✅ | ✅ |
Mistral | ✅ | ✅ |
Phi3 | ✅ | ❌ |
Qwen2 | ✅ | ❌ |
While pipelines are safe to send between threads, callm
has not undergone extensive testing for thread-safety.
Caution is advised.
callm
is known to run on Linux and macOS, and has been tested on these platforms. While Windows has not been extensively tested, it is expected to work out-of-the-box without issues.
callm
is still in an early development stage and is not production-ready yet.
Add callm
to your dependencies:
$ cargo add callm
callm
uses features to selectively enable support for GPU acceleration.
Enable the cuda
feature to include support for CUDA devices.
$ cargo add callm --features cuda
Enable the metal
feature to include support for Metal devices.
$ cargo add callm --features metal
callm
uses builder pattern to create inference pipelines.
use callm::pipelines::PipelineText;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Build pipeline
let mut pipeline = PipelineText::builder()
.with_location("/path/to/model")
.build()?;
// Run inference
let text_completion = pipeline.run("Tell me a joke about Rust borrow checker")?;
println!("{text_completion}");
Ok(())
}
Override default sampling parameters during pipeline build or afterwards.
use callm::pipelines::PipelineText;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Build pipeline with custom sampling parameters
let mut pipeline = PipelineText::builder()
.with_location("/path/to/model")
.with_temperature(0.65)
.with_top_k(25)
.build()?;
// Adjust sampling parameters later on
pipeline.set_seed(42);
pipeline.set_top_p(0.3);
// Run inference
let text_completion = pipeline.run("Write an article about Pentium F00F bug")?;
println!("{text_completion}");
Ok(())
}
If the model you are loading includes a chat template, you can use conversation-style inference via run_chat()
. It accepts a slice of tuples in the form: (MessageRole, String)
.
use callm::pipelines::PipelineText;
use callm::templates::MessageRole;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Build pipeline
let mut pipeline = PipelineText::builder()
.with_location("/path/to/model")
.with_temperature(0.1)
.build()?;
// Prepare conversation messages
let messages = vec![
(
MessageRole::System,
"You are impersonating Linus Torvalds.".to_string(),
),
(
MessageRole::User,
"What is your opinion on Rust for Linux kernel development?".to_string(),
),
];
// Run chat-style inference
let assistant_response = pipeline.run_chat(&messages)?;
println!("{assistant_response}");
Ok(())
}
Consult the documentation for a full API reference.
Several examples and tools can be found in a separate callm-demos repo.
Thank you for your interest in contributing to callm
!
As this project is still in its early stages, your help is invaluable. Here are some ways you can get involved:
- Report issues: If you encounter any bugs or unexpected behavior, please file an issue on GitHub. This will help us track and fix problems.
- Submit a pull request: If you'd like to contribute code, please fork the repository, make your changes, and submit a pull request. We'll review and merge your changes as soon as possible.
- Help with documentation: If you have expertise in a particular area, please help us improve our documentation.
Thank you for your contributions! 💪