Model name | Model source | Sample workspace | Kubernetes Workload | Distributed inference |
---|---|---|---|---|
phi-3-mini-4k-instruct | microsoft | link | Deployment | false |
phi-3-mini-128k-instruct | microsoft | link | Deployment | false |
phi-3-medium-4k-instruct | microsoft | link | Deployment | false |
phi-3-medium-128k-instruct | microsoft | link | Deployment | false |
- Public: Kaito maintainers manage the lifecycle of the inference service images that contain model weights. The images are available in Microsoft Container Registry (MCR).
Phi-3 Mini models are best suited for prompts using the chat format as follows. You can provide the prompt as a question with a generic template as follows:
<|user|>\nQuestion<|end|>\n<|assistant|>
For more information on usage, check the phi-3 repo: Phi-3 Repo
The inference service endpoint is /chat
.
curl -X POST "http://<SERVICE>:80/chat" -H "accept: application/json" -H "Content-Type: application/json" -d '{"prompt":"<|user|> How to explain Internet for a medieval knight?<|end|><|assistant|>"}'
curl -X POST \
-H "accept: application/json" \
-H "Content-Type: application/json" \
-d '{
"prompt":"<|user|> What is the meaning of life?<|end|><|assistant|>",
"return_full_text": false,
"clean_up_tokenization_spaces": false,
"prefix": null,
"handle_long_generation": null,
"generate_kwargs": {
"max_length":200,
"min_length":0,
"do_sample":true,
"early_stopping":false,
"num_beams":1,
"num_beam_groups":1,
"diversity_penalty":0.0,
"temperature":1.0,
"top_k":10,
"top_p":1,
"typical_p":1,
"repetition_penalty":1,
"length_penalty":1,
"no_repeat_ngram_size":0,
"encoder_no_repeat_ngram_size":0,
"bad_words_ids":null,
"num_return_sequences":1,
"output_scores":false,
"return_dict_in_generate":false,
"forced_bos_token_id":null,
"forced_eos_token_id":null,
"remove_invalid_values":null
}
}' \
"http://<SERVICE>:80/chat"
prompt
: The initial text provided by the user, from which the model will continue generating text.return_full_text
: If False only generated text is returned, else full text is returned.clean_up_tokenization_spaces
: True/False, determines whether to remove potential extra spaces in the text output.prefix
: Prefix added to the prompt.handle_long_generation
: Provides strategies to address generations beyond the model's maximum length capacity.max_length
: The maximum total number of tokens in the generated text.min_length
: The minimum total number of tokens that should be generated.do_sample
: If True, sampling methods will be used for text generation, which can introduce randomness and variation.early_stopping
: If True, the generation will stop early if certain conditions are met, for example, when a satisfactory number of candidates have been found in beam search.num_beams
: The number of beams to be used in beam search. More beams can lead to better results but are more computationally expensive.num_beam_groups
: Divides the number of beams into groups to promote diversity in the generated results.diversity_penalty
: Penalizes the score of tokens that make the current generation too similar to other groups, encouraging diverse outputs.temperature
: Controls the randomness of the output by scaling the logits before sampling.top_k
: Restricts sampling to the k most likely next tokens.top_p
: Uses nucleus sampling to restrict the sampling pool to tokens comprising the top p probability mass.typical_p
: Adjusts the probability distribution to favor tokens that are "typically" likely, given the context.repetition_penalty
: Penalizes tokens that have been generated previously, aiming to reduce repetition.length_penalty
: Modifies scores based on sequence length to encourage shorter or longer outputs.no_repeat_ngram_size
: Prevents the generation of any n-gram more than once.encoder_no_repeat_ngram_size
: Similar tono_repeat_ngram_size
but applies to the encoder part of encoder-decoder models.bad_words_ids
: A list of token ids that should not be generated.num_return_sequences
: The number of different sequences to generate.output_scores
: Whether to output the prediction scores.return_dict_in_generate
: If True, the method will return a dictionary containing additional information.pad_token_id
: The token ID used for padding sequences to the same length.eos_token_id
: The token ID that signifies the end of a sequence.forced_bos_token_id
: The token ID that is forcibly used as the beginning of a sequence token.forced_eos_token_id
: The token ID that is forcibly used as the end of a sequence when max_length is reached.remove_invalid_values
: If True, filters out invalid values like NaNs or infs from model outputs to prevent crashes.