Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High latency #33

Open
ferdinandl007 opened this issue Aug 21, 2024 · 0 comments
Open

High latency #33

ferdinandl007 opened this issue Aug 21, 2024 · 0 comments
Assignees
Labels
question Further information is requested

Comments

@ferdinandl007
Copy link

Describe the bug
Inference latency seems to be a lot higher when using LLM Swift compared to when using it through LM Studio
About x2 the latency to 1st token and 5 X latency per token.
To Reproduce
You must include minimal code that can reproduce the behavior, for example:

import SwiftUI
import LLM


class ChatBot: LLM {
    convenience init() {
        let url = Bundle.main.url(forResource: "gemma-2-2b-it-Q8_0", withExtension: "gguf")!
        let systemPrompt = "you are helpful, highly intelligent assistant!"
        self.init(from: url, template: .chatML(systemPrompt))
    }
}

struct ChatView: View {
    @ObservedObject var bot: ChatBot
    @State var input = "Give me seven national flag emojis people use the most; You must include South Korea."
    init(_ bot: ChatBot) { self.bot = bot }
    func respond() { Task { await bot.respond(to: input) } }
    func stop() { bot.stop() }
    var body: some View {
        VStack(alignment: .leading) {
            ScrollView { Text(bot.output).monospaced() }
            Spacer()
            HStack {
                ZStack {
                    RoundedRectangle(cornerRadius: 8).foregroundStyle(.thinMaterial).frame(height: 40)
                    TextField("input", text: $input).padding(8)
                }
                Button(action: respond) { Image(systemName: "paperplane.fill") }
                Button(action: stop) { Image(systemName: "xmark") }
            }
        }.frame(maxWidth: .infinity).padding()
    }
}

Expected behavior
As both run on llama CCP I would expect the latency to be the same

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • Chip: [e.g. Apple M1]
  • Memory: [e.g. 16GB]
  • OS: [e.g. macOS 14.0]

Additional context
Try to make the inference settings to be identical as well and it did not help latency was still significantly slower. Am I missing anything here?

@eastriverlee eastriverlee added question Further information is requested and removed bug unconfirmed labels Oct 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants