Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seems that MoA does not work on MATH and QA with both weak and strong LLMs #41

Open
yananchen1989 opened this issue Aug 7, 2024 · 4 comments

Comments

@yananchen1989
Copy link

yananchen1989 commented Aug 7, 2024

I have thoroughly tested MoA (with one layer) on some objective benchmarks (less subjective compared to MT-bench), such as GSM8K, HotpotQA.
It seems that when the LLMs are 7B-level, it does not work anymore.
Here in my setting,
the three LLMs in layer one is mistralai/Mistral-7B-Instruct-v0.1/2/3, while the aggregator is meta-llama/Meta-Llama-3.1-8B-Instruct.
(before the experiment, I have tested each model's capability to solve the problem, the most powerful one is llama-3.1-8B).

Then, when applying MoA, I find that the performance decrease, for example, in GSM8K, the acc decreases from 75.1 to 61.3, where llama-3.1 solely achives 75.1, here rounds=0; while 61.3 is from rounds=1 that the intermidiate layer consists of the mistral-7B v0.1/2/3.

This finding also applies to HotpotQA.

Does anyone face the similar observation with me ? Any suggestions on how to use 7B-level llms ?

@yananchen1989
Copy link
Author

here i paste one example, for your information.
In gsm8k, one of the examples' oracle answer is

Since each apple costs $4, it means that on Monday, Lucy picked 56/4 = <<56/4=14>>14 apples
Combining Monday and Tuesday, Lucy picked 14 + 12 = <<14+12=26>>26 apples
On Wednesday, she picked 2 * 12 = <<2*12=24>>24 apples
All together, Lucy picked 26 + 24 = <<26+24=50>>50 apples.
#### 50

And llama-3.1-8B's answer is (which is correct)

On Monday, Lucy sold 56/4 = 14 apples.
On Tuesday, she picked 12 apples.
On Wednesday, she picked double the number of apples she did the previous day, which is 2*12 = 24 apples.
Total apples picked over the three days = 14 + 12 + 24 = 50 apples.
#### 50

but when applying MoA, llama-3.1-8B not only takes the originla question, but also three answers from the inter layer, here
mistal-v0.1's answer is (which is wrong)

On Monday, Lucy sold all the apples picked, so she picked 12 apples on Tuesday and 24 apples on Wednesday.
The total number of apples picked over the three days is 12 + 24 + 24 = <<12+24+24=60>>60 apples.
The total amount of money Lucy got from selling the apples picked over the three days is $4/piece \* 60 pieces = $<<4*60=240>>240.
The total amount of money Lucy got from selling the apples picked on Monday is $56.
Therefore, the total amount of money Lucy got from selling the apples picked over the three days is $56 + $240 = $<<56+240=296>>296.
#### 296

mistal-v0.2's answer is (which is wrong)

Let's denote the number of apples picked on Monday as x.
We know that x = $56 / $4 = <<56/4=14>>14 apples

On Tuesday, Lucy picked 12 apples.

On Wednesday, she picked double the number of apples she picked on Tuesday, so she picked 12 \* 2 = <<12*2=24>>24 apples.

The total number of apples picked over the three days is 14 + 12 + 24 = <<14+12+24=42>>42 apples.
#### 42

mistal-v0.3's answer is (which is correct)

Let's think step by step:

1. On Monday, Lucy sold all the apples picked, and she got $56. So, the number of apples picked on Monday is $56 / $4 = <<56/4=14>>14 apples.
2. On Tuesday, Lucy picked 12 apples.
3. On Wednesday, Lucy picked double the number of apples she did the previous day, which is 12 * 2 = <<12*2=24>>24 apples.
4. To find the total number of apples picked over the three days, we add the number of apples picked on each day: 14 + 12 + 24 = <<14+12+24=50>>50 apples.

#### 50

By ingesting the answers from mistral models, llama-3.1 changes its answer to (which is wrong)

Let's think step by step:

1. On Monday, Lucy sold all the apples picked, and she got $56. So, the number of apples picked on Monday is $56 / $4 = 14 apples.
2. On Tuesday, Lucy picked 12 apples.
3. On Wednesday, Lucy picked double the number of apples she did the previous day, which is 12 * 2 = 24 apples.
4. To find the total number of apples picked over the three days, we add the number of apples picked on each day: 14 + 12 + 24 = 50 apples.

However, the problem states that on Monday, she sold all the apples picked, and on Tuesday, she picked 12 apples. This means that the number of apples picked on Monday is actually 56 / 4 = 14, but this is the number of apples sold, not picked. The number of apples picked on Monday is actually unknown.

Let's denote the number of apples picked on Monday as x. We know that x - 12 = 14, because she sold 14 apples on Monday, and she picked 12 apples on Tuesday. Solving for x, we get x = 26.

On Tuesday, Lucy picked 12 apples.

On Wednesday, she picked double the number of apples she picked on Tuesday, so she picked 12 * 2 = 24 apples.

The total number of apples picked over the three days is 26 + 12 + 24 = 62 apples.

#### 62

@yananchen1989
Copy link
Author

among other testing cases, not always mistral-v0.3 is more like correct, but not always. It outperforms v0.1 and v0.2 a little bit actually, but not a significant margin.
Here in my setting, i just stimulate the full version MoA where the aggregator is most capable while the LLMs in inter layers are not the final boss.

I expect with MoA, to be exact, with intermidiate opitions, the final aggregator should have better reponse compared to the version without it.

@yananchen1989 yananchen1989 changed the title seems that MoA does not work on MATH and QA with weak LLMs seems that MoA does not work on MATH and QA with both weak and strong LLMs Aug 8, 2024
@yananchen1989
Copy link
Author

yananchen1989 commented Aug 8, 2024

update:
now I use more strong LLMs:
intermidiate layer:

reference_models = ['microsoft/WizardLM-2-8x22B',
                    'mistralai/Mixtral-8x7B-Instruct-v0.1', 
                    'Qwen/Qwen2-72B-Instruct', 
                    'meta-llama/Meta-Llama-3-70B-Instruct-Turbo', 
                    'deepseek-ai/deepseek-llm-67b-chat']

aggregator: meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo

The experimental result is that with moa setting does not outperform without moa setting. They achieved equally.

Here I also paste one example from GSM8K

oracle answer:

Pat's flower bed is 111 feet / 1.5 feet per plant = <<111/1.5=74>>74 plants needed.
Pat needs to buy 74 plants - 17 plants = <<74-17=57>>57 plants needed to purchase.
Pat will spend 57 plants * $6 = $<<57*6=342>>342.
#### 342

answer from aggregator directly without MoA:

First, convert the length of the flower bed to inches: 111 feet * 12 inches/foot = 1332 inches.
Each plant takes up 12 inches + 1.5 feet * 12 inches/foot = 12 + 18 = 30 inches.
The total number of plants needed is 1332 inches / 30 inches/plant = 44.4, round up to 45 plants.
Pat already owns 17 flowers, so she needs to buy 45 - 17 = 28 flowers.
Each flower costs $6, so Pat will spend 28 flowers * $6/flower = $168.
#### 168

answer from aggregator with MoA, where in the prompt, it will consist of the responses from llms in intermidiate layer:

After reviewing the responses from other agents, I agree with agents #1, #3, and #4 that the correct calculation is:
The space needed for each plant, including the gap, is 1 foot (for the plant) + 1.5 feet (for the gap) = 2.5 feet.
The total number of plants Pat can fit in her flower bed is 111 feet / 2.5 feet per plant = 44.4 plants. Since Pat can't plant a fraction of a plant, she can only plant 44 plants.
Pat already owns 17 plants, so she needs to buy 44 - 17 = 27 more plants.
Each flowering plant costs $6, so Pat will spend 27 * $6 = $162.
I disagree with agent #2, who rounded up to 45 plants, and agent #5, who incorrectly calculated the number of plants that can fit in the flower bed.
#### 162

answer from microsoft/WizardLM-2-8x22B in intermidiate layer:

1.5 feet is the space needed between each plant, which includes the space taken by the plant itself. Since the plants grow to be 12 inches wide, that's 1 foot wide. So, the total space needed for each plant including the gap is 1 foot (for the plant) + 1.5 feet (for the gap) = 2.5 feet.

Now, let's calculate how many more plants Pat needs to buy to fill the 111 feet long flower bed. Pat already has 17 plants. Since each plant requires 2.5 feet of space, the total number of spaces (including the gap) that can be filled in a 111 feet long bed is 111 / 2.5 = <<111/2.5=44.4>>44.4, but since we can't have a fraction of a plant, we round down to 44 plants.

Pat already has 17 plants, so she needs to buy 44 - 17 = <<44-17=27>>27 more plants.

Each flowering plant costs $6, so Pat will spend 27 * $6 = $<<27*6=162>>162 on new plants.
#### 162

answer from mistralai/Mixtral-8x7B-Instruct-v0.1 in intermidiate layer:

First, we need to determine how many plants Pat needs to fill her flower bed.

Pat's flower bed is 111 feet long, and her plants grow 12 inches wide, which is equivalent to 1 foot. So, the space that each plant occupies is 1 foot + 1.5 feet = <<1+1.5=2.5>>2.5 feet.

The number of plants Pat needs is 111 feet / 2.5 feet per plant = <<111/2.5=44.4>>44.4 plants.

Since Pat cannot have a fraction of a plant, she will need to round up to the nearest whole number, which is 45 plants.

Pat already owns 17 plants, so she needs to buy 45 - 17 = <<45-17=28>>28 plants.

Each plant costs $6, so Pat will spend $6 * 28 plants = $<<6\*28=168>>168.
#### 168

Note: I assumed that the space between the plants is 1.5 feet between the center of each plant. If the space is between the edges of the plants, the space would be 1.5 feet + 1 foot = 2.5 feet, and the number of plants needed would be 111 feet / 2.5 feet per plant = 44.4 plants, which would still round up to 45 plants. The final answer would still be $168.

answer from Qwen/Qwen2-72B-Instruct in intermidiate layer:

To find out how many flowers Pat can plant in her flower bed, we need to divide the length of the flower bed by the width of each flower plus the space between each flower. The width of each flower is 12 inches, which is equal to 1 foot, and the space between each flower is 1.5 feet. So, the total space needed for each flower is 1 + 1.5 = 2.5 feet.
The length of the flower bed is 111 feet, so Pat can plant 111 / 2.5 = 44.4 flowers. Since Pat can't plant a fraction of a flower, she can only plant 44 flowers.
Pat already owns 17 flowers, so she needs to buy 44 - 17 = 27 more flowers.
Each flowering plant costs $6, so Pat will spend 27 * $6 = $162 at the store to fill up her flower bed.
#### 162

answer from meta-llama/Meta-Llama-3-70B-Instruct-Turbo in intermidiate layer:

First, convert the length of the flower bed from feet to inches: 111 feet * 12 inches/foot = 1332 inches.
The width of each plant is 12 inches, and Pat needs to leave 1.5 feet (18 inches) between each plant, so the total space per plant is 12 + 18 = 30 inches.
The total number of plants Pat can fit in her flower bed is 1332 inches / 30 inches/plant = 44.4 plants. Since Pat already has 17 plants, she needs to buy 44 - 17 = 27 more plants.
Each plant costs $6, so Pat will spend 27 plants * $6/plant = $162.
#### 162

answer from deepseek-ai/deepseek-llm-67b-chat in intermidiate layer:

Pat's flower bed is 111 feet long and she wants to leave 1.5 feet between each plant, so she can fit 111/1.5 = <<111/1.5=74>>74 plants in her flower bed.
Pat already owns 17 flowers, so she needs to buy 74 - 17 = <<74-17=57>>57 more plants.
Each plant costs $6, so Pat will spend 57 * $6 = $<<57*6=342>>342 at the store.
The answer is: 342

@ikamensh
Copy link

This sounds not surprising after reading the MoA paper https://arxiv.org/pdf/2406.04692 . Table 4. there shows effect of having different models either as proposers or as aggregators, and weaker models drop a lot as aggregators, while still being useful proposers.

Model As aggregator As proposer
Qwen1.5-110B-Chat 61.3% 56.7%
Qwen1.5-72B-Chat 59.3% 53.3%
LLaMA-3-70b-Instruct 45.0% 60.6%
WizardLM 8x22B 52.9% 63.8%
Mixtral-8x22B-Instruct 48.4% 54.8%
dbrx-instruct 41.5% 55.1%

From experience, when you overwhelm weak model with a lot of info it will struggle to pick useful bits, perhaps something like this happens in your experiments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants