Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Seed data for DbContainer #541

Open
OverkillGuy opened this issue Apr 16, 2024 · 12 comments
Open

Feature: Seed data for DbContainer #541

OverkillGuy opened this issue Apr 16, 2024 · 12 comments

Comments

@OverkillGuy
Copy link
Contributor

OverkillGuy commented Apr 16, 2024

What are you trying to do?

In order to do meaningful tests using my fresh database, I wish to load arbitrary SQL scripts from my DbContainer-s.

I can do (and have done) this per-project, in test folders, but believe there's value in allowing an arbitrary list of SQL scripts to be run before the testcontainer is yielded, and making this a pytest fixture to reuse.

I do this usually by first passing a path to a folder with scripts, volume-mount it to /seeds path in the database container, and also pass a list of script files in that folder to be executed. Then before yielding the ready db-container, I loop over each script filename and do container.exec_run(["mysql", "-e" f"source /seeds/{script}").

The list of scripts means we can feed both schema first, then arbitrary sample data files.

I suggest a new DbContainer._seed() method, defaulting to raise NotImplementedError, to be overriden per database, allowing a new optional parameter seed.

Each database then just has to define their _seed() function to the specific way to exec_run the scripts (psql for Postgres, mysql for MySQL CLI...)

Example from #542 implementation draft:

>>> import sqlalchemy
>>> from testcontainers.mysql import MySqlContainer
>>> seed_data = ("db/", ["schema.sql", "data.sql"])
>>> with MySqlContainer(seed=seed_data) as mysql:
...     engine = sqlalchemy.create_engine(mysql.get_connection_url())
...     with engine.begin() as connection:
...         query = "select * from stuff"  # Can now rely on schema/data
...         result = connection.execute(sqlalchemy.text(query))
...         first_stuff, = result.fetchone()

Why should it be done this way?

I believe most users of testcontainers, beyond just testing connectivity with a database, want a simple-to-reproduce test environment with realistic datastructure and mock data. This is currently annoying to do locally, and can be upstreamed into a simple-looking seed parameter, which could enable the feature for most databases, by pushing it into DbContainer itself.

See #542 for a draft implementation using MySQL as target (generalizeable to others of course)

@alexanderankin
Copy link
Member

how does this compare to this functionality in the mysql and postgres images?

@OverkillGuy
Copy link
Contributor Author

how does this compare to this functionality in the mysql and postgres images?

Good shout, I didn't know these existed (there goes my evening, reading the manual like I should have done first thing)

Let's say, at best, this proposed seed option can get reduced to just the local path, and we mount it to those standard /docker-entrypoint-initdb.d/ folders, simplifying most of the implementation
This can harmonize the feature for images that support it?

Or we could just scrap the feature request altogether because it exists already as part of most images, haha. Either way works for me, the feature gap itself is solved by the links provided.

@alexanderankin
Copy link
Member

i think theoretically it could be a neat thing to highlight the feature in the library images, but since it seems like we are now discussing quite a different implementation than proposed in the PR, im going to close the PR

@OverkillGuy
Copy link
Contributor Author

OverkillGuy commented Apr 21, 2024

Makes sense!

Just to be clear, the only convenient way I see of exposing the /docker-entrypoint-initdb.d/ folder is to mount it as .with_volume_mapping() before the container is created in docker, which requires a change in the DbContainer derivatives themselves?

The less convenient way that I don't recommend is to have some kind of Dockerfile with:

FROM somedatabaseimage
COPY db/seeds/ /docker-entrypoint-initdb.d/

and then using that image? That doesn't feel very convenient, building an intermediary image we don't care about (long term) just to get this one test to work?

With that in mind, I will explore a similar seed parameter as before, but with the intent to mount a volume to that well known /docker-entrypoint-initdb.d/ folder, during the DbContainer creation, but of course I welcome different solutions.

I'm aiming for:

with MysqlContainer(seed="db/seeds/"):
# equivalent to:
container.with_volume_mapping("db/seeds/", "/docker-entrypoint-initdb.d/")

I'll need to confirm if this folder works for ALL DbContainer variants (in which case it's nice to move it to that level) or if it's an image-specific thing, in which case I leave it to the mysql or postgres level.

@OverkillGuy
Copy link
Contributor Author

OverkillGuy commented Apr 21, 2024

I'm aiming for:

with MysqlContainer(seed="db/seeds/"):
# equivalent to:
container.with_volume_mapping("db/seeds/", "/docker-entrypoint-initdb.d/")

Implemented in #552 for MySQL, to further this discussion with specific example.
If that's deemed useful, I can do the same for postgres and then look for other images supporting this folder structure.

Edit to add: Note I'm not attached to any of these PRs, so feel free to chuck them out if that's not useful. I'm just happy to get familiar with the innards of this codebase for my own future uses as I explore this feature request, hoping the PRs make the discussion easier.

@alexanderankin
Copy link
Member

i think theoretically there is a preference for is for using transferable over volume mapping, because volume mapping only works locally (local to the docker engine), but we dont have that in python implementation yet - it looks something like this - theoretically my implementation could be even more tied to tar equivalent and also there's an even more concise version of "copy file to container" algorithm floating around in the codebase already, like in the kafka module. havent had time to really think about this too much but i approved your run and it looks like all the checks passed.

@OverkillGuy
Copy link
Contributor Author

OverkillGuy commented Apr 26, 2024

So, I was curious about put_archive and the "transferable" system, so I implemented it experimentally, and it didn't work:
The files did make it to the folder, but the tables weren't loaded in any of the tests, which I can see by running docker exec on breakpoints.

Can't prove it, but I think that since the archive can only be loaded once the container has started (compared to volume mounts which become part of the creation of container), the mysql server is no longer looking at the /docker-entrypoint-initdb.d/ folder by the time data gets in, so the test table stays empty and fails test.

Attached is the patch (applies on top of the linked PR), in case it's useful anyway?

Patch: transferring using tar archives. Click to expand
From ac0fc34717c27b53d3f291ca59b3218d2036c762 Mon Sep 17 00:00:00 2001
From: Jb DOYON <[email protected]>
Date: Fri, 26 Apr 2024 02:06:06 +0100
Subject: [PATCH] Experiment with transfering via tar archives

---
 core/testcontainers/core/generic.py            |  4 ++++
 modules/mysql/testcontainers/mysql/__init__.py | 14 +++++++++++++-
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/core/testcontainers/core/generic.py b/core/testcontainers/core/generic.py
index 515c283..23c24ea 100644
--- a/core/testcontainers/core/generic.py
+++ b/core/testcontainers/core/generic.py
@@ -71,7 +71,11 @@ class DbContainer(DockerContainer):
         self._configure()
         super().start()
         self._connect()
+        self._transfer_seed()
         return self
 
     def _configure(self) -> None:
         raise NotImplementedError
+
+    def _transfer_seed(self) -> None:
+        pass
diff --git a/modules/mysql/testcontainers/mysql/__init__.py b/modules/mysql/testcontainers/mysql/__init__.py
index e95b87d..c7ed97b 100644
--- a/modules/mysql/testcontainers/mysql/__init__.py
+++ b/modules/mysql/testcontainers/mysql/__init__.py
@@ -11,7 +11,10 @@
 #    License for the specific language governing permissions and limitations
 #    under the License.
 import re
+import tarfile
+from io import BytesIO
 from os import environ
+from pathlib import Path
 from typing import Optional
 
 from testcontainers.core.generic import DbContainer
@@ -85,7 +88,7 @@ class MySqlContainer(DbContainer):
         if self.username == "root":
             self.root_password = self.password
         if seed is not None:
-            self.with_volume_mapping(seed, "/docker-entrypoint-initdb.d/")
+            self.seed = seed
 
     def _configure(self) -> None:
         self.with_env("MYSQL_ROOT_PASSWORD", self.root_password)
@@ -105,3 +108,12 @@ class MySqlContainer(DbContainer):
         return super()._create_connection_url(
             dialect="mysql+pymysql", username=self.username, password=self.password, dbname=self.dbname, port=self.port
         )
+
+    def _transfer_seed(self) -> None:
+        src_path = Path(self.seed)
+        dest_path = "/docker-entrypoint-initdb.d/"
+        with BytesIO() as archive, tarfile.TarFile(fileobj=archive, mode="w") as tar:
+            for filename in src_path.iterdir():
+                tar.add(filename.absolute(), arcname=filename.relative_to(src_path))
+            archive.seek(0)
+            self.get_wrapped_container().put_archive(dest_path, archive)
-- 
2.34.1

Curious if you have an idea on how to work around that.

EDIT: Wrong again, it's because I waited to do the archive transfer till AFTER the connect, not before, which waits for container ready, and moving it before makes it work again. Updated #552 to use this new system!

@alexanderankin
Copy link
Member

alexanderankin commented Apr 26, 2024 via email

@OverkillGuy
Copy link
Contributor Author

Small bump:
Implementation of the proof of concept of MySQL in #552 is using transferable now (as stated preference) and the feature was agreed to be of (vague hehe) interest.

Should we get into a review phase for this proof of concept PR?

alexanderankin pushed a commit that referenced this issue May 11, 2024
Ref #541.
New capability of "seeding" a db container using image's support for
/docker-entrypoint-initdb.d/ folder.

Using the "transferable" system, borrowed from Kafka. 

Updates DbContainer to have a new (NOOP-default) `_transfer_seed()`
method, run after `_start()` and before `_connect()`, to allow the
folder transfer.

Currently implemented only in MySQL, but extensible to others that use
the `/docker-entrypoint-initdb.d/` system.

---------

Co-authored-by: Jb DOYON <[email protected]>
@OverkillGuy
Copy link
Contributor Author

Wow, thank you for merging that PR! (I expected more of a fight, haha)
I feel like our Postgres friends would be missing out on this, so I'll try to draft a similar PR for pg, will link here.

Can anyone think of other DbContainer derivatives that may benefit from this feature too?

@alexanderankin
Copy link
Member

alexanderankin commented May 11, 2024 via email

@OverkillGuy
Copy link
Contributor Author

OverkillGuy commented May 21, 2024

See #576 which fixes the doctest of mysql, which acted as documentation for the feature, and adds the same feature to postgres.

I'm interested in better documentation for the feature, but I'm not sure how that could look like?
This is a per-image feature worth a paragraph or two, as I've got it set up right now, no?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants