The models used in our paper can be downloaded from the following links (compatible speculative models are listed as children of their respective target models):
-
The above links host files for several different quantization levels. The specific quantizations used can be found in Table 1 of our paper.
When running PipeInfer, one must pass the large, target model as the
-m
parameter, and the smaller speculative model as the-md
parameter. An example for running Dolphin 70B with a speculative model of TinyLlama, given that each model is downloaded to amodels/
folder:-m models/models/dolphin-2.2-70b.Q3_K_M.gguf -md models/tinyllama-1.1b-1t-openorca.Q4_K_M.gguf