Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Implement option 'delete_rows' of argument 'if_exists' in 'DataFrame.to_sql' API. #60376

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

gmcrocetti
Copy link

@gmcrocetti
Copy link
Author

@WillAyd I chose the name delete_rows instead of delete_replace because the behavior of replace right now is of recreate - as you mentioned - and delete_rows means exactly what is going on behind the scenes.

@erfannariman tagging you due to your help/interest during the lifecycle of this issue.

@gmcrocetti gmcrocetti marked this pull request as draft November 20, 2024 14:04
@@ -2698,6 +2700,20 @@ def test_drop_table(conn, request):
assert not insp.has_table("temp_frame")


@pytest.mark.parametrize("conn", sqlalchemy_connectable + ["sqlite_buildin"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use the all_connectable here? We should have it work for ADBC as well

Copy link
Author

@gmcrocetti gmcrocetti Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey o/. Thanks a ton for the fast review.
I'm unit testing the method delete_rows here. I didn't implement this method for the ADBCDatabase to follow the existing convention.

Do you believe we should implement delete_rows for the ADBCDatabase ?

Do you see value in unit testing delete_rows as its functionality is tested at test_to_sql_exist ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea generally we strive to have the same behavior across all drivers; we should only really pare back from using all_connectable for very specific use cases

I think the fixture in test_to_sql_exist is a good start, but it is missing assertions around functionality specific to delete_rows; that test would pass if you re-used the same behavior of "replace", but that's not what we are after

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect. We're 100% aligned then. That's why I created the unit test.

I'm going to refactor it so the ADBC backend implements this new functionality

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that ADBC doesn't use SQLAlchemy so I think will have to write the SQL statement. I think that's OK given DELETE FROM is ANSI-SQL, but will also want to be careful that we aren't at risk of SQL injection

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to implement the method but using parameters with the adbc driver didn't work at all.
While cursor.execute("SELECT ?", parameters=(1,)) works like a charm the query DELETE FROM ? didn't work at all. I couldn't get why and it is related to the internals of the driver. Maybe that's the reason no query in the ADBC backend is using it.
I believe we should create a new issue to investigate it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think a prepared statement is the right approach; I believe that syntax you are using is also specific to SQLite

I think you can just use f-strings. Ultimately, the driver should be preventing against SQL injection, so as long as we test that is the case I don't think we need any special handling on our end

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, just out of curiosity - it is failing for the SQLite database.
image

Ultimately, the driver should be preventing against SQL injection, so as long as we test that is the case I don't think we need any special handling on our end

Humm... I'm not sure about this statement. AFAIK there's no driver preventing out of box protection of SQL Injection. This responsibility lies to whom is using it, and we're not protected as f-strings are used all over the ADBC backend. Might be worth creating a ticket to check this ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prepared statements do not bind table names, which is the issue you are running into.

Humm... I'm not sure about this statement. AFAIK there's no driver preventing out of box protection of SQL Injection. This responsibility lies to whom is using it, and we're not protected as f-strings are used all over the ADBC backend. Might be worth creating a ticket to check this ?

The presence of f-strings are a red herring, as the ADBC drivers are implemented in lower level languages. Upstream, there are PRs like:

apache/arrow-adbc#643

and

apache/arrow-adbc#1340

that attempt to guard against injection. Are you able to circumvent those safeguards?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today I learned 🙇 . Thanks a lot - for real. I'll submit a patch in a few.

pandasSQL.to_sql(test_frame1, table_name, if_exists="replace")
== test_frame1.shape[0]
)
assert pandasSQL.delete_rows(table_name) is None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an assertion that the table does not actually get dropped?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure, good catch !

assert pandasSQL.delete_rows(table_name) is None
assert count_rows(conn, table_name) == 0


@pytest.mark.parametrize("conn", all_connectable)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test that ensures that this is atomic?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, the location of this comment got me confused. You mean testing delete_rows is atomic ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea; I think can add a new test (unless you see a good one that can be parametrized)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll need some help here. I failed to get the transaction in which delete_rows is running.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to inspect the transaction; that's probably a few layers removed from where we are operating.

I think you can just write some data, change some integral columns to string, and try to rewrite the same data. I think this should fail, and afterwards you can inspect the table to make sure the data that was successfully written remains

Copy link
Author

@gmcrocetti gmcrocetti Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was surprised by this behavior but learned it is a feature. It is possible to use the strict mode when creating the table but neither of pandas or sqlalchemy support that as of now.

flexible-type
flexible-type-2.

Other than that I've been scratching my head the last couple of hours. I was able to assert the atomic behavior for the SQLAlchemy backend but failed miserably for the ADBC ones. Still trying to grok what's going on behind scenes. The result I get running the exact same code is that ADBC is not rolling back on failure - we end up with an empty table. Any hunch ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification on SQLite. So if you were to inspect the type of the column after performing a DELETE operation, would it retain the data type prior to the DELETE or what you append after? I suppose there's some cause for concern there, as this DELETE / Append might introduce different behaviors across different databases, which is something we need to avoid.

Not sure about the ADBC portion, but sounds like the transaction management in the implementation is off

Copy link
Author

@gmcrocetti gmcrocetti Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification on SQLite. So if you were to inspect the type of the column after performing a DELETE operation, would it retain the data type prior to the DELETE or what you append after? I suppose there's some cause for concern there, as this DELETE / Append might introduce different behaviors across different databases, which is something we need to avoid.

Yes, it would retain the original type. This affects all databases with dynamic types (SQLite is the only I know). Are you thinking of adding safeguards for both delete/append to avoid that ? Maybe we should document that behavior instead of blocking something a user might want to have as it is provided by the database itself ?

Not sure about the ADBC portion, but sounds like the transaction management in the implementation is off

Yes, and this is what is bothering me. I'll push the latest changes would be great if you could take a look ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm if we can't get consistent behavior I'm not sure its worth pursuing here. This is just a small convenience of the user calling DELETE themselves, but for us to take on and manage expectations around diverging database behavior would be pretty far out of scope for the pandas project

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello again o/,

Tests are working now.

Hmm if we can't get consistent behavior I'm not sure its worth pursuing here.

Pandas is consistent. Be it using append (prone to the same problem) or delete_rows. The naming delete_rows implies we're not going to change the schema of a table. If the database you're using supports such "non-standard" behavior (appending / deleting different types) that is up to you to know - pandas should not interfere on that matter.

This is just a small convenience of the user calling DELETE themselves

I do agree and we're consistent on that matter. We delete all rows of the table in a transaction - no matter the database - and insert the new records.

@WillAyd WillAyd added the IO SQL to_sql, read_sql, read_sql_query label Nov 20, 2024
@gmcrocetti gmcrocetti force-pushed the issue-37210-to-sql-truncate branch 2 times, most recently from b1c0e03 to 3c33249 Compare November 21, 2024 18:39
@gmcrocetti gmcrocetti marked this pull request as ready for review November 22, 2024 12:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: DataFrame.to_sql with if_exists='replace' should do truncate table instead of drop table
2 participants