You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Inference latency seems to be a lot higher when using LLM Swift compared to when using it through LM Studio
About x2 the latency to 1st token and 5 X latency per token. To Reproduce
You must include minimal code that can reproduce the behavior, for example:
import SwiftUI
import LLM
classChatBot:LLM{convenienceinit(){leturl=Bundle.main.url(forResource:"gemma-2-2b-it-Q8_0", withExtension:"gguf")!
letsystemPrompt="you are helpful, highly intelligent assistant!"self.init(from: url, template:.chatML(systemPrompt))}}structChatView:View{@ObservedObjectvarbot:ChatBot@Statevarinput="Give me seven national flag emojis people use the most; You must include South Korea."init(_ bot:ChatBot){self.bot = bot }func respond(){Task{await bot.respond(to: input)}}func stop(){ bot.stop()}varbody:someView{VStack(alignment:.leading){ScrollView{Text(bot.output).monospaced()}Spacer()HStack{ZStack{RoundedRectangle(cornerRadius:8).foregroundStyle(.thinMaterial).frame(height:40)TextField("input", text: $input).padding(8)}Button(action: respond){Image(systemName:"paperplane.fill")}Button(action: stop){Image(systemName:"xmark")}}}.frame(maxWidth:.infinity).padding()}}
Expected behavior
As both run on llama CCP I would expect the latency to be the same
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Chip: [e.g. Apple M1]
Memory: [e.g. 16GB]
OS: [e.g. macOS 14.0]
Additional context
Try to make the inference settings to be identical as well and it did not help latency was still significantly slower. Am I missing anything here?
The text was updated successfully, but these errors were encountered:
Describe the bug
Inference latency seems to be a lot higher when using LLM Swift compared to when using it through LM Studio
About x2 the latency to 1st token and 5 X latency per token.
To Reproduce
You must include minimal code that can reproduce the behavior, for example:
Expected behavior
As both run on llama CCP I would expect the latency to be the same
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Additional context
Try to make the inference settings to be identical as well and it did not help latency was still significantly slower. Am I missing anything here?
The text was updated successfully, but these errors were encountered: