The introduction to command line parameters will cover base arguments, atomic arguments, and integration arguments. The final list of arguments used in the command line is the integration arguments. The integration arguments inherit from the base arguments and some atomic arguments.
- 🔥tuner_backend: Optional values are 'peft' and 'unsloth', default is 'peft'
- 🔥train_type: Default is 'lora'. Optional values: 'lora', 'full', 'longlora', 'adalora', 'llamapro', 'adapter', 'vera', 'boft', 'fourierft', 'reft'
- seed: Default is 42
- model_kwargs: Extra parameters specific to the model. This parameter list will be logged during training for reference.
- load_dataset_config: When specifying resume_from_checkpoint/ckpt_dir, it will read the
args.json
in the saved file and assign values to any parameters that are None (can be overridden by manual input). If this parameter is set to True, it will read the data parameters as well. Default is False. - use_hf: Default is False. Controls model and dataset downloading, and model pushing to the hub.
- hub_token: Hub token. You can check the modelscope hub token here.
- custom_register_path: The file path for the custom model, chat template, and dataset registration
.py
files.
- 🔥model: Model ID or local path to the model. If it's a custom model, please use it with
model_type
andtemplate
. The specific details can be referred to in the Custom Model. - model_type: Model type. The same model architecture, template, and loading process define a model_type.
- model_revision: Model version.
- 🔥torch_dtype: Data type for model weights, supports
float16
,bfloat16
,float32
, default is read from the config file. - attn_impl: Attention type, supports
flash_attn
,sdpa
,eager
, default is sdpa. - rope_scaling: Rope type, supports
linear
anddynamic
, to be used withmax_length
. - device_map: Configuration of the device map used by the model, e.g., 'auto', 'cpu', json string, json file path.
- local_repo_path: Some models require a GitHub repo when loading. To avoid network issues during
git clone
, you can directly use a local repo. This parameter needs to pass the local repo path, default isNone
.
- 🔥dataset: Dataset ID or path. The format is
dataset_id or dataset_path:sub_dataset#sample_count
, where sub_dataset and sample_count are optional. Use spaces to pass multiple datasets. Local datasets support jsonl, csv, json, and folders, etc. For custom datasets, you can refer to Custom Dataset. - 🔥val_dataset: Validation dataset ID or path.
- 🔥split_dataset_ratio: How to split the training and validation sets when val_dataset is not specified, default is 0.01.
- data_seed: Random seed for the dataset, default is 42.
- 🔥dataset_num_proc: Number of processes for dataset preprocessing, default is 1.
- 🔥streaming: Stream read and process the dataset, default is False.
- load_from_cache_file: Use cache for dataset preprocessing, default is False.
- Note: If set to True, it may not take effect if the dataset changes. If modifying this parameter leads to issues during training, consider setting it to False.
- download_mode: Dataset download mode, including
reuse_dataset_if_exists
andforce_redownload
, default is reuse_dataset_if_exists. - strict: If True, the dataset will throw an error if any row has a problem; otherwise, it will discard the erroneous row. Default is False.
- 🔥model_name: For self-awareness tasks, input the model's Chinese and English names separated by space.
- 🔥model_author: For self-awareness tasks, input the model author's Chinese and English names separated by space.
- custom_dataset_info: Custom simple dataset registration, refer to the Custom Dataset Documentation.
- 🔥template: Type of dialogue template, which defaults to the template type corresponding to the model.
swift pt
will convert the dialogue template into a generation template for use. - 🔥system: Custom system field, default is None, uses the default system of the template.
- 🔥max_length: Maximum length of tokens for a single sample, default is None (no limit).
- truncation_strategy: How to handle overly long tokens, supports
delete
andleft
, representing deletion and left trimming, default is left. - 🔥max_pixels: Maximum pixel count for pre-processing images in multimodal models (H*W), default is no scaling.
- tools_prompt: The list of tools for agent training converted to system format, refer to Agent Training, default is 'react_en'.
- loss_scale: How to add token loss weight during training. Default is
'default'
, meaning all responses (including history) are treated as 1 for cross-entropy loss. For specifics, see Pluginization and Agent Training. - sequence_parallel_size: Number of sequence parallelism. Refer to example.
- use_chat_template: Use chat template or generation template, default is
True
.swift pt
is automatically set to the generation template. - template_backend: Use swift or jinja for inference. If using jinja, it will utilize transformers'
apply_chat_template
. Default is swift.
Refer to the generation_config documentation.
- 🔥max_new_tokens: Maximum new token count supported during inference, default is None (no limit).
- temperature: Temperature parameter, default is None, read from generation_config.json.
- Note: The do_sample parameter has been removed in this version; set temperature to 0 for the same effect.
- top_k: Top_k parameter, default is None, read from generation_config.json.
- top_p: Top_p parameter, default is None, read from generation_config.json.
- repetition_penalty: Penalty for repetition, default is None, read from generation_config.json.
- num_beams: Number of beams for beam search, default is 1.
- 🔥stream: Stream output, default is
False
. - stop_words: Additional stop words, default is
[]
.
The following are quantization parameters for loading models. For specific meanings, see the Quantization documentation. This does not include quantization parameters related to swift export
, such as gptq
and awq
.
- 🔥quant_method: Quantization method used when loading the model, options are
bnb
,hqq
,eetq
. - 🔥quant_bits: Number of bits for quantization, default is None.
- hqq_axis: HQQ quantization axis, default is None.
- bnb_4bit_compute_dtype: BNB quantization compute type, options are
float16
,bfloat16
,float32
, default is set totorch_dtype
. - bnb_4bit_quant_type: BNB quantization type, supports
fp4
andnf4
, default isnf4
. - bnb_4bit_use_double_quant: Whether to use double quantization, default is
True
. - bnb_4bit_quant_storage: BNB quantization storage type, default is None.
This parameter list inherits from transformers Seq2SeqTrainingArguments
, with default values overridden by ms-swift. For unlisted items, refer to the HF Official Documentation.
- 🔥output_dir: Default is
output/<model_name>
. - 🔥gradient_checkpointing: Whether to use gradient checkpointing, default is True.
- 🔥deepspeed: Default is None. Can be set to 'zero2', 'zero3', 'zero2_offload', 'zero3_offload' to use the built-in deepspeed configuration files from ms-swift.
- 🔥per_device_train_batch_size: Default is 1.
- 🔥per_device_eval_batch_size: Default is 1.
- weight_decay: Weight decay coefficient, default value is 0.1.
- 🔥learning_rate: Learning rate, default is 1e-5 for all parameters, and 1e-4 for the tuner.
- lr_scheduler_type: LR scheduler type, default is cosine.
- lr_scheduler_kwargs: Other parameters for the LR scheduler.
- report_to: Default is
tensorboard
. - remove_unused_columns: Default is False.
- logging_first_step: Whether to log the first step print, default is True.
- logging_steps: Interval for logging prints, default is 5.
- metric_for_best_model: Default is None. When
predict_with_generate
is set to False, it is 'loss'; otherwise, it is 'rouge-l'. - greater_is_better: Default is None. When
metric_for_best_model
contains 'loss', set to False; otherwise, set to True.
Other important parameters:
- 🔥num_train_epochs: Number of training epochs, default is 3.
- 🔥gradient_accumulation_steps: Gradient accumulation, default is 1.
- 🔥gradient_checkpointing_kwargs: Parameters passed to
torch.utils.checkpoint
. For example, set to{"use_reentrant": false}
. - 🔥save_strategy: Strategy for saving the model, options are 'no', 'steps', 'epoch', default is 'steps'.
- 🔥save_steps: Default is 500.
- 🔥save_total_limit: Default is None, saving all checkpoints.
- 🔥eval_strategy: Evaluation strategy, follows
save_strategy
. - 🔥eval_steps: Default is None. If evaluation dataset exists, follows
save_steps
. - max_steps: Default is -1, maximum number of training steps. Must be set when the dataset is streaming.
- 🔥warmup_ratio: Default is 0.
- save_on_each_node: Default is False. To be considered in multi-machine training.
- save_only_model: Default is False. Whether to save only model weights.
- 🔥resume_from_checkpoint: Checkpoint resume parameter, default is None.
- 🔥ddp_backend: Default is None, optional values are "nccl", "gloo", "mpi", "ccl", "hccl", "cncl", "mccl".
- 🔥ddp_find_unused_parameters: Default is None.
- 🔥dataloader_num_workers: Default is 0.
- 🔥neftune_noise_alpha: Noise coefficient added by neftune, default is 0. Generally can be set to 5, 10, 15.
- push_to_hub: Push training weights to hub, default is False.
- hub_model_id: Default is None.
- hub_private_repo: Default is False.
- 🔥freeze_llm: Freeze LLM. Default is False. Applicable for full parameters and LoRA.
- 🔥freeze_vit: Freeze ViT. Default is True. Applicable for full parameters and LoRA.
- 🔥freeze_aligner: Freeze aligner. Default is True, applicable for full parameters and LoRA.
- 🔥target_modules: Specify the LoRA module, default is
all-linear
, automatically finds linear layers except for lm_head and attaches the tuner. This parameter is not limited to LoRA. - 🔥target_regex: Specify a regex expression for the LoRA module. Default is
None
, if this value is provided, target_modules does not take effect. This parameter is not limited to LoRA. - 🔥init_weights: The method of init tuner weights, For lora the accepted values are
true
,false
,guassian
,pissa
,pissa_niter_[number of iters]
, for bone aretrue
,false
,bat
, default istrue
- modules_to_save: After the tuner is attached, the original model's modules used during training and storage, default is
[]
. This parameter is not limited to LoRA.
- freeze_parameters: Prefix of parameters to be frozen, default is
[]
. - freeze_parameters_ratio: Ratio of parameters to freeze from the bottom up, default is 0. Setting it to 1 will freeze all parameters. Combine with
trainable_parameters
to set trainable parameters. - trainable_parameters: Prefix of trainable parameters, default is
[]
.
- 🔥lora_rank: Default is
8
. - 🔥lora_alpha: Default is
32
. - lora_dropout: Default is
0.05
. - lora_bias: Default is
'none'
, selectable values are: 'none', 'all'. If you want to set all biases as trainable, you can set it to'all'
. - lora_dtype: Specify the dtype of the LoRA module. Supports 'float16', 'bfloat16', 'float32', defaults to the original model type.
- 🔥use_dora: Default is
False
, whether to useDoRA
. - use_rslora: Default is
False
, whether to useRS-LoRA
. - 🔥lorap_lr_ratio: LoRA+ parameter, default value is
None
, recommended values10~16
, specifying this parameter allows using lora+ when using LoRA.
FourierFt uses the three parameters target_modules
, target_regex
, and modules_to_save
.
- fourier_n_frequency: Number of frequencies in Fourier transform, an
int
, similar tor
in LoRA. Default value is2000
. - fourier_scaling: Scaling value of matrix W, a
float
, similar tolora_alpha
in LoRA. Default value is300.0
.
BOFT uses the three parameters target_modules
, target_regex
, and modules_to_save
.
- boft_block_size: Size of BOFT blocks, default value is 4.
- boft_block_num: Number of BOFT blocks, cannot be used simultaneously with
boft_block_size
. - boft_dropout: Dropout value for BOFT, default is 0.0.
Vera uses the three parameters target_modules
, target_regex
, and modules_to_save
.
- vera_rank: Size of Vera Attention, default value is 256.
- vera_projection_prng_key: Whether to store the Vera mapping matrix, default is True.
- vera_dropout: Dropout value for Vera, default is
0.0
. - vera_d_initial: Initial value of Vera's d matrix, default is
0.1
.
- 🔥use_galore: Default value is False, whether to use GaLore.
- galore_target_modules: Default is None, if not provided, applies GaLore to attention and MLP.
- galore_rank: Default value is 128, GaLore rank value.
- galore_update_proj_gap: Default is 50, interval for updating decomposed matrices.
- galore_scale: Default is 1.0, matrix weight coefficient.
- galore_proj_type: Default is
std
, type of GaLore matrix decomposition. - galore_optim_per_parameter: Default value is False, whether to set a separate optimizer for each Galore target parameter.
- galore_with_embedding: Default value is False, whether to apply GaLore to embedding.
- galore_quantization: Whether to use q-galore, default is
False
. - galore_proj_quant: Whether to quantize the SVD decomposition matrix, default is
False
. - galore_proj_bits: Number of bits for SVD quantization.
- galore_proj_group_size: Number of groups for SVD quantization.
- galore_cos_threshold: Cosine similarity threshold for updating projection matrices. Default value is 0.4.
- galore_gamma_proj: As the projection matrix becomes more similar over time, this parameter is the coefficient for extending the update interval. Default value is 2.
- galore_queue_size: Length of the queue for calculating projection matrix similarity, default is 5.
Note: LISA only supports full parameters, i.e., train_type full
.
- 🔥lisa_activated_layers: Default value is
0
, representing LISA is not used. Setting to a non-zero value activates that many layers, it is recommended to set to 2 or 8. - lisa_step_interval: Default value is
20
, number of iter to switch to layers that can be backpropagated.
🔥Unsloth has no new parameters; adjusting existing ones will suffice to support it:
--tuner_backend unsloth
--train_type full/lora
--quant_bits 4
- 🔥llamapro_num_new_blocks: Default value is
4
, total number of new layers to insert. - llamapro_num_groups: Default value is
None
, number of groups to insert new blocks. IfNone
, it equalsllamapro_num_new_blocks
, meaning each new layer is inserted separately into the original model.
When the train_type
parameter is set to adalora
, the following parameters take effect. The adalora
parameters such as target_modules
inherit from the corresponding parameters of lora
, but the lora_dtype
parameter does not take effect.
- adalora_target_r: Default value is
8
, average rank of AdaLoRA. - adalora_init_r: Default value is
12
, initial rank of AdaLoRA. - adalora_tinit: Default value is
0
, initial warmup of AdaLoRA. - adalora_tfinal: Default value is
0
, final warmup of AdaLoRA. - adalora_deltaT: Default value is
1
, step interval of AdaLoRA. - adalora_beta1: Default value is
0.85
, EMA parameter of AdaLoRA. - adalora_beta2: Default value is
0.85
, EMA parameter of AdaLoRA. - adalora_orth_reg_weight: Default value is
0.5
, regularization parameter for AdaLoRA.
The following parameters are effective when train_type
is set to reft
.
- ReFT cannot merge tuners.
- ReFT is not compatible with gradient checkpointing.
- If experiencing issues while using DeepSpeed, please uninstall DeepSpeed temporarily.
- 🔥reft_layers: Which layers ReFT is applied to, default is
None
, representing all layers. You can input a list of layer numbers, e.g.,reft_layers 1 2 3 4
. - 🔥reft_rank: Rank of ReFT matrix, default is
4
. - reft_intervention_type: Type of ReFT, supports 'NoreftIntervention', 'LoreftIntervention', 'ConsreftIntervention', 'LobireftIntervention', 'DireftIntervention', 'NodireftIntervention', default is
LoreftIntervention
. - reft_args: Other supported parameters for ReFT Intervention, input in json-string format.
- use_liger: Use liger-kernel for training.
Parameter meanings can be found in the lmdeploy documentation.
- 🔥tp: Tensor parallelism degree, default value is
1
. - session_len: Default value is
None
. - cache_max_entry_count: Default value is
0.8
. - quant_policy: Default value is
0
. - vision_batch_size: Default value is
1
.
Parameter meanings can be found in the vllm documentation.
- 🔥gpu_memory_utilization: Default value is
0.9
. - 🔥tensor_parallel_size: Default is
1
. - pipeline_parallel_size: Default is
1
. - max_num_seqs: Default is
256
. - 🔥max_model_len: Default is
None
. - disable_custom_all_reduce: Default is
False
. - enforce_eager: Whether vllm uses pytorch eager mode or establishes a cuda graph. Default is
False
. Setting to True can save memory but may affect efficiency. - 🔥limit_mm_per_prompt: Controls vllm using multiple images, default is
None
. For example, use--limit_mm_per_prompt '{"image": 10, "video": 5}'
. - vllm_max_lora_rank: Default value is
16
. Parameters supported by vllm for LoRA. - lora_modules: Used to support dynamic switching between multiple LoRAs, default is
[]
.
- 🔥merge_lora: Whether to merge LoRA. This parameter supports LoRA, llmpro, longlora, default is False.
- safe_serialization: Whether to store safetensors, default is True.
- max_shard_size: Maximum size of a single storage file, default is '5GB'.
Training arguments include the base arguments, Seq2SeqTrainer arguments, tuner arguments, and also include the following parts:
-
add_version: Add directory to output_dir with
'<version>-<timestamp>'
to prevent weight overwrite, default is True. -
resume_only_model: If resume_from_checkpoint, only resume model weights, default is False.
-
check_model: Check local model files for corruption or modification and give a prompt, default is True. If in an offline environment, please set to False.
-
loss_type: Type of loss, default uses the model's built-in loss function.
-
num_labels: To be specified for classification models, representing the number of labels, default is None.
-
packing: Whether to use packing, default is False.
-
🔥lazy_tokenize: Whether to use lazy_tokenize, default is False during LLM training, default is True during MLLM training.
-
acc_strategy: Strategy for training accuracy, can be
seq
ortoken
level accuracy, default istoken
. -
max_new_tokens: Maximum generated token count when
predict_with_generate=True
, default 64. -
temperature: Temperature when
predict_with_generate=True
, default 0. -
optimizer: Custom optimizer name for plugin.
-
metric: Custom metric name for plugin.
RLHF arguments inherit from the training arguments.
-
🔥rlhf_type: Alignment algorithm type, supports
dpo
,orpo
,simpo
,kto
,cpo
. -
ref_model: Original comparison model in algorithms like DPO.
-
ref_model_type: Same as model_type.
-
ref_model_revision: Same as model_revision.
-
🔥beta: KL regularization term coefficient, default is
None
, i.e., forsimpo
algorithm default is2.
, for other algorithms default is0.1
. Refer to the documentation for specifics. -
label_smoothing: Whether to use DPO smoothing, default value is
0
, generally set between 0~0.5. -
🔥rpo_alpha: Weight for adding sft_loss in DPO, default is
1
. The final loss isKL_loss + rpo_alpha * sft_loss
. -
cpo_alpha: The coefficient of nll loss in CPO/SimPO loss, default is
1.
. -
simpo_gamma: Reward margin term in SimPO algorithm, recommended to set between 0.5-1.5 in the paper, default is
1.
. -
desirable_weight: Loss weight for desirable response in KTO algorithm
$\lambda_D$ , default is1.
. -
undesirable_weight: Loss weight for undesirable response in KTO paper
$\lambda_U$ , default is1.
.
Inference arguments include the base arguments, merge arguments, vLLM arguments, LMDeploy arguments, and also contain the following:
- 🔥ckpt_dir: Path to the model checkpoint folder, default is None.
- 🔥infer_backend: Inference backend, supports 'pt', 'vllm', 'lmdeploy', default is 'pt'.
- 🔥max_batch_size: Batch size for pt backend, default is 1.
- result_path: Path to store inference results (jsonl), default is None, saved in the checkpoint directory or './result' directory.
- val_dataset_sample: Number of samples from the inference dataset, default is None.
Deployment Arguments inherit from the inference arguments.
- host: Service host, default is '0.0.0.0'.
- port: Port number, default is 8000.
- api_key: Access key required for access.
- owned_by: Default is
swift
. - 🔥served_model_name: Model name for serving, defaults to the model's suffix.
- verbose: Print access logs, default is True.
- log_interval: Interval for printing tokens/s statistics, default is 20 seconds. If set to -1, it will not be printed.
- max_logprobs: Maximum number of logprobs to return, default is 20.
Evaluation Arguments inherit from the deployment arguments.
- 🔥eval_dataset: Evaluation dataset, refer to Evaluation documentation.
- eval_limit: Number of samples for each evaluation set, default is None.
- eval_output_dir: Folder for storing evaluation results, default is 'eval_output'.
- temperature: Default is 0.
- verbose: This parameter is passed to DeployArguments during local evaluation, default is
False
. - max_batch_size: Maximum batch size, default is 256 for text evaluation, 16 for multimodal.
- 🔥eval_url: Evaluation URL. Default is None, uses local deployment for evaluation.
Export Arguments include the basic arguments and merge arguments, and also contain the following:
-
🔥ckpt_dir: Checkpoint path, default is None.
-
🔥output_dir: Path for storing export results, default is None.
-
🔥quant_method: Options are 'gptq' and 'awq', default is None.
-
quant_n_samples: Sampling size for the validation set in gptq/awq, default is 256.
-
max_length: Max length for the calibration set, default value is 2048.
-
quant_batch_size: Quantization batch size, default is 1.
-
group_size: Group size for quantization, default is 128.
-
🔥push_to_hub: Whether to push to the hub, default is False.
-
hub_model_id: Model ID for pushing, default is None.
-
hub_private_repo: Whether it is a private repo, default is False.
-
commit_message: Commit message, default is 'update files'.