Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INSERT sometimes unexpectedly mutates input when Returning is not specified #1029

Open
david-bezero opened this issue Oct 15, 2024 · 4 comments
Assignees

Comments

@david-bezero
Copy link

david-bezero commented Oct 15, 2024

Bun version: 1.2.3

In some specific circumstances, an insert statement can mutate its input even though Returning was not specified. This can lead to surprising results:

package main_test

import (
	"context"
	"github.com/stretchr/testify/require"
	"github.com/uptrace/bun"
	"github.com/uptrace/bun/dialect/pgdialect"
	"log/slog"
	"testing"
)

func TestTheory(t *testing.T) {
	connection := // get a Postgres database connection by whatever means

	db := bun.NewDB(connection, pgdialect.New())
	_, err = db.Exec("CREATE TABLE test (foo INT PRIMARY KEY, bar TEXT)")
	require.NoError(t, err)

	// set up some "constants":
	bar1 := "v1"
	bar2 := "v2"
	bar3 := "v3"

	// seed some data:
	require.NoError(t, mySaveFunction(db, []pocEntity{
		{Foo: 1, Bar: &bar1},
		{Foo: 2, Bar: &bar2},
	}))

	// update with some new data:
	require.NoError(t, mySaveFunction(db, []pocEntity{
		{Foo: 1, Bar: &bar1},
		{Foo: 2, Bar: nil}, // this NULL value is required to trigger the behaviour
		{Foo: 3, Bar: &bar3},
	}))

	// check constants have not mutated
	require.Equal(t, "v1", bar1)
	require.Equal(t, "v2", bar2)
	require.Equal(t, "v3", bar3)
}

func mySaveFunction(db *bun.DB, data []pocEntity) error {
	slog.Info("saving", slog.Any("entities", data))
	_, err := db.NewInsert().
		Model(&data).
		On("CONFLICT (foo) DO NOTHING").
		//Returning("NULL").
		//Returning("*").
		Exec(context.Background())
	if err != nil {
		return err
	}
	slog.Info("got", slog.Any("entities", data))
	return nil
}

type pocEntity struct {
	bun.BaseModel `bun:"table:test"`

	Foo int     `bun:"foo,pk"`
	Bar *string `bun:"bar"`
}

INFO saving entities="[{BaseModel:{} Foo:1 Bar:0x1400036ea40} {BaseModel:{} Foo:2 Bar:0x1400036ea60}]"
INFO got entities="[{BaseModel:{} Foo:1 Bar:0x1400036ea40} {BaseModel:{} Foo:2 Bar:0x1400036ea60}]"
INFO saving entities="[{BaseModel:{} Foo:1 Bar:0x1400036ea40} {BaseModel:{} Foo:2 Bar:} {BaseModel:{} Foo:3 Bar:0x1400036ea70}]"
INFO got entities="[{BaseModel:{} Foo:1 Bar:0x1400036ea40}]"
Error: Not equal: expected: "v1" actual: "v3"


There are 2 issues revealed here. One is quite clear:

The default behaviour of an insert should be to leave the input unchanged, but in the example above, the input is partially mutated in a way which is not consistent with either Returning("NULL") nor Returning("*"). Specifically: the 2 entries which are unchanged are removed from the input, and the remaining entry is a combination of the first value's key and the updated value's data.

Specifying Returning("NULL") explicitly fixes this, but should not be required.


The second issue is that if the data returned by an INSERT ... RETURNING is not in the same order as the original data (which can happen due to rows being omitted, and also due to the order simply being different, as the order is not guaranteed to match), values are mutated to match the order of the output instead of being matched by their primary key before being mutated. This can lead to surprising results, especially if pointers are involved at any level of the input data (as they will be mutated despite the returned data appearing to be consistent with the sent data).

@david-bezero david-bezero changed the title INSERT unexpectedly mutates input when Returning is not specified INSERT sometimes unexpectedly mutates input when Returning is not specified Oct 15, 2024
@j2gg0s j2gg0s self-assigned this Oct 30, 2024
@j2gg0s
Copy link
Collaborator

j2gg0s commented Oct 30, 2024

  1. Postgresql's RETURNING only return rows be affected.
  2. bun's Return("*") enable RETURNING, Return("NULL") disable RETURNING
  3. When RETURNING is enabled, bun will update the returned data to dest
  4. You can provide dest in method Scan
  5. If you dont provide dest, or if you use the method Exec, bun will use Model as default dest

I don't think point 5 is entirely corret, but it's difficult to change it as this point. It will breaking api.
I suggest that if you want to enable RETURNING, use method Scan

@david-bezero
Copy link
Author

My understanding of the issue has improved since I wrote the report above, so to clarify the exact bug here:

By default (if Returning is not explicitly specified), Bun will try to figure out which columns might be auto-generated (seems to be by checking for places where we insert nils?). The exact mechanism for this isn't important here; the important thing is that Bun ends up writing a SQL statement which has RETURNING <some but not all columns>. The bug also applies if a user explicitly specifies Returning with a subset of columns.

As you note: Postgres only returns rows that were actually touched by the statement, meaning the returned rows do not necessarily match up with the sent rows.

It seems that Bun's intended behaviour is to augment the given records with the generated values from the database, but this is where the bug lies: Bun makes a fundamental assumption that the rows will match up exactly, and does not attempt to identify which input record each output row refers to. This is fine (ish) if it's returning all columns (because the resulting records are still internally consistent, though the pointer issues can still apply). But it's a big problem when returning only some columns, because the records end up being a mix of two unrelated records.

It seems the most direct fix would be for Bun to find the matching input record for each returned row to update when using Returning (and leave other records from the input which do not appear in the output unchanged). In cases where columns marked as pk have been given, this is easy to do. In cases where the primary key itself is generated, this is more difficult and perhaps not possible to fix in general, but it is also rarer to specify ON CONFLICT when not setting the primary key explicitly.

The "best" non-breaking fix might be to change the default to return every column, and not attempt to augment the provided records but instead replace them entirely (i.e. truncate the list then add items as if we had performed a Scan). For most users this will be invisible (if slightly less efficient), while entirely avoiding this whole class of error. Users who want greater efficiency can specify the columns to return explicitly, and with knowledge of the potential risks.

Though I do think a better default (albeit breaking) would be to return no columns unless the user has explicitly requested them.

Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. If there is no update within the next 7 days, this issue will be closed.

@vmihailenco
Copy link
Member

The "best" non-breaking fix might be to change the default to return every column, and not attempt to augment the provided records but instead replace them entirely (i.e. truncate the list then add items as if we had performed a Scan). For most users this will be invisible (if slightly less efficient), while entirely avoiding this whole class of error. Users who want greater efficiency can specify the columns to return explicitly, and with knowledge of the potential risks.

Though I do think a better default (albeit breaking) would be to return no columns unless the user has explicitly requested them.

Agreed with both points, but we probably can only do the non-breaking fix at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants