You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for the great contributions! I recently try to reproduce the experiments of TIES, and I find a small part of merged parameters (less than 3%) deviate from my expectation.
Compared to Algorithm 1 in TIES (page 4 in arxiv), this toolkit add a weight variable for each model. Intuitively, I believe weight can only affect the step of disjoint merging, which replaces the equation $$\tau_m^p=\frac{1}{\vert A_p\vert}\sum_{t\in A_p}\hat{\tau}^p_t$$ by $\tau_m^p=\frac{1}{\sum_{t\in A_p} w_t}\sum_{t\in A_p}w_t\cdot\hat{\tau}^p_t$ (with $w_1, \cdots, w_n$ representing the weight for model 1 to n).
However, I find weight does not only affect averaging of retained parameters, but also changes the behavior of sign election. Carefully reviewing the codes, I find this toolkit implements TIES based on Task Arithmetic, which introduces a rescaling variable $\lambda$ (i.e., weight here) for all task vectors.
Following Task Arithmetic, this toolkit multiplies the task vectors by weight once after sparsify. It means the magnitude of each parameter is rescaled during sign election. For example, I merge a French model and a Math model with following config:
Then, if this toolkit chooses the sign of French model, the original magnitude of a delta parameter in llama-2-french should be 4 times larger than the magnitude of corresponding one in llama-2-mathsft.
I am not sure whether it is a correct overriding of TIES. I guess there might be several people hoping the weight variables are only active when averaging but do not affect the sign election.
Thanks!
The text was updated successfully, but these errors were encountered:
You're right that per-model weight is an extension from the method defined in the paper. This was a common request when I first implemented it and I kinda just came up with what I thought made sense.
In the case where weight is equal between all models, the approach in mergekit should give identical results to the algorithm as defined. I suspect the small percentage deviation you're seeing can be attributed to floating point precision. If you try the merge with dtype: float32 (with or without out_dtype: float16), do you see results in line with what you expect?
As for the inequal contribution to sign election from models with different weight values, this was a deliberate choice based on what I think people probably want - consider a model that disagrees extensively with the consensus, but is weighted very low (think a few percent.) Should it really have as much sway as a model with weight 1.0? My intuition says no.
This is totally a judgement call though and either interpretation could be valid. If that's specifically the behavior you want then it shouldn't be too hard to introduce a variant that does sign election before rescaling.
I will check whether floating point precision may results in the deviation between two implements.
I agree that it is easy to adjust the order of sign election and rescaling. I have also tried these two methods, and the results show that rescaling before sign election has better performance on my datasets.
Hi,
Thank you for the great contributions! I recently try to reproduce the experiments of TIES, and I find a small part of merged parameters (less than 3%) deviate from my expectation.
Compared to Algorithm 1 in TIES (page 4 in arxiv), this toolkit add a$$\tau_m^p=\frac{1}{\vert A_p\vert}\sum_{t\in A_p}\hat{\tau}^p_t$$ by $\tau_m^p=\frac{1}{\sum_{t\in A_p} w_t}\sum_{t\in A_p}w_t\cdot\hat{\tau}^p_t$ (with $w_1, \cdots, w_n$ representing the
weight
variable for each model. Intuitively, I believeweight
can only affect the step of disjoint merging, which replaces the equationweight
for model 1 to n).However, I find$\lambda$ (i.e.,
weight
does not only affect averaging of retained parameters, but also changes the behavior of sign election. Carefully reviewing the codes, I find this toolkit implements TIES based on Task Arithmetic, which introduces a rescaling variableweight
here) for all task vectors.Following Task Arithmetic, this toolkit multiplies the task vectors by
weight
once after sparsify. It means the magnitude of each parameter is rescaled during sign election. For example, I merge a French model and a Math model with following config:Then, if this toolkit chooses the sign of French model, the original magnitude of a delta parameter in
llama-2-french
should be 4 times larger than the magnitude of corresponding one inllama-2-mathsft
.I am not sure whether it is a correct overriding of TIES. I guess there might be several people hoping the
weight
variables are only active when averaging but do not affect the sign election.Thanks!
The text was updated successfully, but these errors were encountered: