Example with `stream = True`? #319

matthiasgeihs · 2023-06-05T07:09:54Z

matthiasgeihs
Jun 5, 2023

Hi, is there an example on how to use Llama.create_completion with stream = True?

(In general, I think a few more examples in the documentation would be great.)

Jun 5, 2023

Here is a one way to do it.

prompt = """
# Task
Name the planets in the solar system?

# Answer
"""

# With stream=True, the output is of type `Iterator[CompletionChunk]`.
output = llm.create_completion(prompt, stop=["# Question"], echo=True, stream=True)

# Iterate over the output and print it.
for item in output:
    print(item['choices'][0]['text'], end='')

View full answer

matthiasgeihs · 2023-06-05T08:00:40Z

matthiasgeihs
Jun 5, 2023
Author

Here is a one way to do it.

prompt = """
# Task
Name the planets in the solar system?

# Answer
"""

# With stream=True, the output is of type `Iterator[CompletionChunk]`.
output = llm.create_completion(prompt, stop=["# Question"], echo=True, stream=True)

# Iterate over the output and print it.
for item in output:
    print(item['choices'][0]['text'], end='')

1 reply

kerrickchan Feb 9, 2024

Here is my test on Mixtral

from llama_cpp import Llama

llm = Llama(
    model_path="./models/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf",
    chat_format="llama-2",
    n_ctx=4096,
    n_threads=8,
    n_gpu_layers=33,
)

output = llm.create_chat_completion(
    messages=[
        { "role": "system", "content": "You are a story writing assistant." },
        {
            "role": "user",
            "content": "Write a story about llamas."
        }
    ],
    stream=True
)

for chunk in output:
    delta = chunk['choices'][0]['delta']
    if 'role' in delta:
        print(delta['role'], end=': ')
    elif 'content' in delta:
        print(delta['content'], end='')

ramzeez88 · 2024-05-26T15:20:24Z

ramzeez88
May 26, 2024

here is how streaming works for me :

while True:
    user_input = input("You: ")
    messages.append({"role": "user", "content": user_input})
    response = llm.create_chat_completion(
        messages,
        temperature=0.6,
        stream=True,
        max_tokens= 2048,
        repeat_penalty=1,
        seed=-1,
        
    )
    
    for chunk in response:
        delta = chunk['choices'][0]['delta']
        if 'role' in delta:
            print(delta['role'], end=': ', flush=True)
        elif 'content' in delta:
            tokens = delta['content'].split()
            for token in tokens:
                print(token, end=" ", flush=True)

0 replies

celsowm · 2024-08-26T00:26:48Z

celsowm
Aug 26, 2024

and here how to use on llama cpp python[server]:

import time, requests, json

# record the time before the request is sent
start_time = time.time()

# prepare the request payload
payload = {
    'messages': [
        {'role': 'user', 'content': 'Count to 100, with a comma between each number and no newlines. E.g., 1, 2, 3, ...'}
    ],
    'temperature': 0,
    'stream': True  # streaming
}

# send the request to the LLaMA CPP server
response = requests.post('http://localhost:8000/v1/chat/completions', json=payload, stream=True)

# create variables to collect the stream of chunks
collected_chunks = []
collected_messages = []

# iterate through the stream of events
for line in response.iter_lines():
    if line:  
        decoded_line = line.decode('utf-8')
        print(f"Raw line received: {decoded_line}")

        if decoded_line.startswith('data:'):
            json_data = decoded_line[len('data:'):].strip()  # Remove prefix 'data:' and white spaces

            if json_data == '[DONE]':  # if is the end of streaming
                print("Stream completed")
                break

            try:
                chunk = json.loads(json_data)  
                chunk_time = time.time() - start_time  
                collected_chunks.append(chunk)

                if 'choices' in chunk and chunk['choices']:
                    chunk_message = chunk['choices'][0]['delta'].get('content', '')  # get message
                    if chunk_message: 
                        collected_messages.append(chunk_message)  # save message
                        print(f"Message received {chunk_time:.2f} seconds after request: {chunk_message}")
            except json.JSONDecodeError as e:
                print(f"Failed to decode JSON: {e}")
                continue

if 'chunk_time' in locals():
    print(f"Full response received {chunk_time:.2f} seconds after request")
else:
    print("No valid response received from the server")

collected_messages = [m for m in collected_messages if m]
full_reply_content = ''.join(collected_messages)
print(f"Full conversation received: {full_reply_content}")

0 replies

controlecidadao · 2024-09-10T23:11:40Z

controlecidadao
Sep 10, 2024

I'm using this hyperparameter in conjunction with enumerate function to capture the order of each generated token.

The strings in the position of the values of each hyperparameter are variables adjusted by the user in the LLM interface we are developing:

# Loop to generate text, token by token. Model is loaded in memory here.
for nu, i in enumerate(llm.create_chat_completion(
        messages = messages,
        functions =  None,
        function_call = None,
        temperature = temperature,             # default: 0.2
        top_p = top_p,                         # default: 0.95
        min_p = min_p,                         # default: 0.05
        typical_p = typical_p,                 # default: 1.0
        top_k = top_k,                         # default: 40
        stream = True,                         # default: False <<<<<<<<<<<<<<<<<<<<< HERE
        stop = eval(stop_generation),          # default: None
        max_tokens = max_tokens,               # default: 254 32k = 32768 None
        presence_penalty = presence_penalty,   # default: 0
        frequency_penalty = frequency_penalty, # default: 0
        repeat_penalty = repeat_penalty,       # default: 1.1
        tfs_z = tfs_z,                         # default: 1
        mirostat_mode = 0,
        mirostat_tau = 5,
        mirostat_eta = 0.1,
        model = None,
        logits_processor = None,
        grammar = None,
    )):

Samantha Interface Assistant project:

https://github.com/controlecidadao/samantha_ia

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example with `stream = True`? #319

{{title}}

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Example with stream = True? #319

matthiasgeihs Jun 5, 2023

Replies: 4 comments · 1 reply

matthiasgeihs Jun 5, 2023 Author

kerrickchan Feb 9, 2024

ramzeez88 May 26, 2024

celsowm Aug 26, 2024

controlecidadao Sep 10, 2024

Example with `stream = True`? #319

matthiasgeihs
Jun 5, 2023

Replies: 4 comments 1 reply

matthiasgeihs
Jun 5, 2023
Author

ramzeez88
May 26, 2024

celsowm
Aug 26, 2024

controlecidadao
Sep 10, 2024