frictionlessdata#1622: Fix issues with nested context manager calls

# Summary ## Problem statement The `Resource` class is also a [Context Manager](https://docs.python.org/3/reference/datamodel.html#context-managers). That is, it implements the `__enter()__` and `__exit()__` methods to allow the use of `with Resource(...)` statements. Prior to this PR, there was no limit on nesting `with` statements on the same `Resource`, but this caused problems because while the second `__enter()__` allowed the `Resource` to already be open, the first `__exit()__` would `close()` the Resource while the higher level context would expect it to still be open. This would cause errors like "ValueError: I/O operation on closed file", or the iterator would appear to start from part way through a file rather than at the start of the file, and other similar behaviour depending on the exact locations of the nested functions. This was made more complex because these `with` statements were often far removed from each other in the code, hidden behind iterators driven by generators, etc. They also could have different behaviour depending on number of rows read, the type of Resource (local file vs inline, etc.), the different steps in a pipeline, etc. etc. All this meant that the problem was rare, hard to reduce down to an obvious reproduction case, and not realistic to expect developers to understand while developing new functionality. ## Solution This PR prevents nested contexts being created by throwing an exception when the second, nested, `with` is attempted. This means that code that risks these issues can be quickly identified and resolved during development. The best way to resolve it is to use `Resource.to_copy()` to copy so that the nested `with` is acting on an independent view of the same Resource, which is likely what is intended in most cases anyway. This PR also updates a number of the internal uses of `with` to work on a copy of the Resource they are passed so that they are independent of any external code and what it might have done with the Resource prior to the library methods being called. ## Breaking Change This is technically a breaking change as any external code that was developed using nested `with` statements - possibly deliberately, but more likely unknowingly not falling into the error cases - will have to be updated to use `to_copy()` or similar. However, the library functions have all been updated in a way that doesn't change their signature or their expected behaviour as documented by the unit tests. All pre-existing unit tests pass with no changes, and added unit tests for the specific updated behaviour do not require any unusual constructs. It is still possible that some undocumented and untested side effect behaviours are different than before and any code relying on those may also be affected (e.g. `to_petl()` iterators are now independent rather than causing changes in each other) So it is likely that very few actual impacts will occur in real world code, and the exception thrown does it's best to explain the issue and suggest resolutions. # Tests - All existing unit tests run and pass unchanged - New unit tests were added to cover the updated behaviour - These unit tests were confirmed to fail without the updates in this PR (where appropriate). - These unit tests now pass with the updated code. - The original script that identified the issue in frictionlessdata#1622 was run and now gives the correct result (all rows appropriately converted and saved to file)
Engine-B · Jan 25, 2024 · 985cfc0 · 985cfc0
1 parent ae3763d
commit 985cfc0
Show file tree

Hide file tree

Showing 14 changed files with 274 additions and 29 deletions.
diff --git a/frictionless/formats/csv/parser.py b/frictionless/formats/csv/parser.py
@@ -63,7 +63,8 @@ def write_row_stream(self, source: TableResource):
             "wt", delete=False, encoding=self.resource.encoding, newline=""
         ) as file:
             writer = csv.writer(file, **options)  # type: ignore
-            with source:
+            # Use a copy of the source to avoid side effects (see #1622)
+            with source.to_copy() as source:
                 if self.resource.dialect.header:
                     writer.writerow(source.schema.field_names)
                 for row in source.row_stream:

diff --git a/frictionless/formats/excel/parsers/xls.py b/frictionless/formats/excel/parsers/xls.py
@@ -109,7 +109,8 @@ def write_row_stream(self, source: TableResource):
         if isinstance(title, int):
             title = f"Sheet {control.sheet}"
         sheet = book.add_sheet(title)
-        with source:
+        # Write from a copy of the source to avoid side effects (see #1622)
+        with source.to_copy() as source:
             if self.resource.dialect.header:
                 for field_index, name in enumerate(source.schema.field_names):
                     sheet.write(0, field_index, name)

diff --git a/frictionless/formats/excel/parsers/xlsx.py b/frictionless/formats/excel/parsers/xlsx.py
@@ -148,7 +148,8 @@ def write_row_stream(self, source: TableResource):
         if isinstance(title, int):
             title = f"Sheet {control.sheet}"
         sheet = book.create_sheet(title)
-        with source:
+        # Write from a copy of the source to avoid side effects (see #1622)
+        with source.to_copy() as source:
             if self.resource.dialect.header:
                 sheet.append(source.schema.field_names)
             for row in source.row_stream:

diff --git a/frictionless/formats/inline/parser.py b/frictionless/formats/inline/parser.py
@@ -91,7 +91,8 @@ def read_cell_stream_create(self):  # type: ignore
     def write_row_stream(self, source: TableResource):
         data: List[Any] = []
         control = InlineControl.from_dialect(self.resource.dialect)
-        with source:
+        # Write from a copy of the source to avoid side effects (see #1622)
+        with source.to_copy() as source:
             if self.resource.dialect.header and not control.keyed:
                 data.append(source.schema.field_names)
             for row in source.row_stream:

diff --git a/frictionless/formats/json/parsers/json.py b/frictionless/formats/json/parsers/json.py
@@ -54,7 +54,8 @@ def read_cell_stream_create(self) -> types.ICellStream:
     def write_row_stream(self, source: TableResource):
         data: List[Any] = []
         control = JsonControl.from_dialect(self.resource.dialect)
-        with source:
+        # Write from a copy of the source to avoid side effects (see #1622)
+        with source.to_copy() as source:
             if self.resource.dialect.header and not control.keyed:
                 data.append(source.schema.field_names)
             for row in source.row_stream:

diff --git a/frictionless/formats/json/parsers/jsonl.py b/frictionless/formats/json/parsers/jsonl.py
@@ -46,7 +46,8 @@ def write_row_stream(self, source: TableResource):
         control = JsonControl.from_dialect(self.resource.dialect)
         with tempfile.NamedTemporaryFile(delete=False) as file:
             writer = platform.jsonlines.Writer(file)
-            with source:
+            # Write from a copy of the source to avoid side effects (see #1622)
+            with source.to_copy() as source:
                 if self.resource.dialect.header and not control.keyed:
                     writer.write(source.schema.field_names)
                 for row in source.row_stream:

diff --git a/frictionless/formats/ods/parser.py b/frictionless/formats/ods/parser.py
@@ -82,15 +82,16 @@ def write_row_stream(self, source: TableResource):
         file.close()
         book = platform.ezodf.newdoc(doctype="ods", filename=file.name)
         title = f"Sheet {control.sheet}"
-        # Get size
-        with source:
+        # Get size. Use a copy of the source to avoid side effects (see #1622)
+        with source.to_copy() as source:
             row_size = 1
             col_size = len(source.schema.fields)
             for _ in source.row_stream:
                 row_size += 1
         book.sheets += platform.ezodf.Sheet(title, size=(row_size, col_size))
         sheet = book.sheets[title]
-        with source:
+        # Write from a copy of the source to avoid side effects (see #1622)
+        with source.to_copy() as source:
             if self.resource.dialect.header:
                 for field_index, name in enumerate(source.schema.field_names):
                     sheet[(0, field_index)].set_value(name)

diff --git a/frictionless/formats/yaml/parser.py b/frictionless/formats/yaml/parser.py
@@ -52,7 +52,8 @@ def read_cell_stream_create(self) -> types.ICellStream:
     def write_row_stream(self, source: TableResource):
         data: List[Any] = []
         control = YamlControl.from_dialect(self.resource.dialect)
-        with source:
+        # Write from a copy of the source to avoid side effects (see #1622)
+        with source.to_copy() as source:
             if self.resource.dialect.header and not control.keyed:
                 data.append(source.schema.field_names)
             for row in source.row_stream:

diff --git a/frictionless/resource/resource.py b/frictionless/resource/resource.py
@@ -238,6 +238,7 @@ def __attrs_post_init__(self):
         # Internal
         self.__loader: Optional[Loader] = None
         self.__buffer: Optional[types.IBuffer] = None
+        self.__context_manager_entered: bool = False
 
         # Detect resource
         system.detect_resource(self)
@@ -257,11 +258,58 @@ def __attrs_post_init__(self):
     # TODO: shall we guarantee here that it's at the beggining for the file?
     # TODO: maybe it's possible to do type narrowing here?
     def __enter__(self):
-        if self.closed:
-            self.open()
+        """
+        Enters a context manager for the resource.
+        We need to be careful with contexts because they open and close the Resource
+        (and thus any underlying files) and we don't want to close a file that is
+        being used somewhere higher up the call stack.
+
+        e.g. if nested contexts were allowed then:
+
+            with Resource("in.csv") as resource:
+                with resource:
+                    # use resource
+                resource.write("out.csv")
+
+        would result in errors because the second context would close the file
+        before the write happened.  While the above code is obvious, similar
+        things can happen when composing steps in pipelines, calling petl code etc. 
+        where the various functions may have no knowledge of each other.
+        See #1622 for more details.
+
+        So we only allow a single context to be open at a time, and raise an 
+        exception if nested context is attempted.  For similar reasons, we
+        also raise an exception if a context is attempted on an open resource.
+
+        The above code can be successfully written as:
+
+            with Resource("in.csv") as resource:
+                with resource.to_copy() as resource2:
+                    use resource2:
+                resource.write("out.csv")
+
+        which keeps resource and resource2 as independent views on the same file.
+
+        Note that if you absolutely need to use a resource in a manner where you
+        don't care if it is "opened" multiple times and closed once then you
+        can directly use `open()` and `close()` but you also become responsible
+        for ensuring the file is closed at the correct time.
+        """
+        if self.__context_manager_entered:
+            note = "Resource has previously entered a context manager (`with` statement) and does not support nested contexts.  To use in a nested context use `to_copy()` then use the copy in the `with`."
+            raise FrictionlessException(note)
+        if self.closed == False:
+            note = "Resource is currently open, and cannot be used in a `with` statement (which would reopen the file).  To use `with` on an open Resouece, use to_copy() then use the copy in the `with`."
+            raise FrictionlessException(note)
+
+        self.__context_manager_entered = True
+
+        self.open()
         return self
 
     def __exit__(self, type, value, traceback):  # type: ignore
+        # Mark the context manager as closed so that sequential contexts are allowed.
+        self.__context_manager_entered = False
         self.close()
 
     @property

diff --git a/frictionless/resources/table.py b/frictionless/resources/table.py
@@ -254,7 +254,8 @@ def __open_lookup(self):
             self.__lookup[source_name][source_key] = set()
             if not source_res:
                 continue
-            with source_res:
+            # Iterate on a copy to avoid side effects (see #1622)
+            with source_res.to_copy() as source_res:
                 for row in source_res.row_stream:  # type: ignore
                     cells = tuple(row.get(field_name) for field_name in source_key)  # type: ignore
                     if set(cells) == {None}:  # type: ignore
@@ -633,12 +634,15 @@ def from_petl(view: Any, **options: Any):
 
     def to_petl(self, normalize: bool = False):
         """Export resource as a PETL table"""
-        resource = self.to_copy()
+        # Store a copy of self to avoid side effects (see #1622)
+        self_copy = self.to_copy()
 
         # Define view
         class ResourceView(platform.petl.Table):  # type: ignore
             def __iter__(self):  # type: ignore
-                with resource:
+                # Iterate over a copy of the resource so that each instance of the iterator is independent  (see #1622)
+                # If we didn't do this, then different iterators on the same table would interfere with each other.
+                with self_copy.to_copy() as resource:
                     if normalize:
                         yield resource.schema.field_names
                         yield from (row.to_list() for row in resource.row_stream)

diff --git a/frictionless/steps/table/table_normalize.py b/frictionless/steps/table/table_normalize.py
@@ -24,11 +24,12 @@ class table_normalize(Step):
     # Transform
 
     def transform_resource(self, resource: Resource):
-        current = resource.to_copy()
+        resource_copy = resource.to_copy()
 
         # Data
         def data():  # type: ignore
-            with current:
+            # Yield from a copy to avoid side effects (see #1622)
+            with resource_copy.to_copy() as current:
                 yield current.header.to_list()  # type: ignore
                 for row in current.row_stream:  # type: ignore
                     yield row.to_list()  # type: ignore

diff --git a/frictionless/validator/validator.py b/frictionless/validator/validator.py
@@ -94,10 +94,6 @@ def validate_resource(
         errors: List[Error] = []
         warnings: List[str] = []
 
-        # Prepare checklist
-        checklist = checklist or Checklist()
-        checks = checklist.connect(resource)
-
         # Validate metadata
         try:
             resource.to_descriptor(validate=True)
@@ -119,13 +115,20 @@ def validate_resource(
             try:
                 resource.open()
             except FrictionlessException as exception:
-                resource.close()
                 return Report.from_validation_task(
                     resource, time=timer.time, errors=exception.to_errors()
                 )
+            finally:
+                # Always close the resource if we opened it to avoid side effects
+                resource.close()
+
+        # Validate row data
+        # Run the per-row validation against a copy of the resource to avoid side effects  (see #1622)
+        with resource.to_copy() as resource_copy:
+            # Prepare checklist, and connect it to the resource copy
+            checklist = checklist or Checklist()
+            checks = checklist.connect(resource_copy)
 
-        # Validate data
-        with resource:
             # Validate start
             for index, check in enumerate(checks):
                 for error in check.validate_start():
@@ -135,20 +138,20 @@ def validate_resource(
                         errors.append(error)
 
             # Validate file
-            if not isinstance(resource, platform.frictionless_resources.TableResource):
-                if resource.hash is not None or resource.bytes is not None:
-                    helpers.pass_through(resource.byte_stream)
+            if not isinstance(resource_copy, platform.frictionless_resources.TableResource):
+                if resource_copy.hash is not None or resource_copy.bytes is not None:
+                    helpers.pass_through(resource_copy.byte_stream)
 
             # Validate table
             else:
                 row_count = 0
-                labels = resource.labels
+                labels = resource_copy.labels
                 while True:
                     row_count += 1
 
                     # Emit row
                     try:
-                        row = next(resource.row_stream)  # type: ignore
+                        row = next(resource_copy.row_stream)  # type: ignore
                     except FrictionlessException as exception:
                         errors.append(exception.error)
                         continue
@@ -189,6 +192,11 @@ def validate_resource(
                         if checklist.match(error):
                             errors.append(error)
 
+            # Update the stats in the base resource with those from the copy
+            # Note that this mutation of the base resource is an expected result of the validation,
+            # but depending on what other code does with the resource, they may be overwritten.
+            resource.stats = resource_copy.stats
+
         # Return report
         return Report.from_validation_task(
             resource, time=timer.time, labels=labels, errors=errors, warnings=warnings

diff --git a/tests/resource/test_context_manager.py b/tests/resource/test_context_manager.py
@@ -0,0 +1,102 @@
+from frictionless import Resource, FrictionlessException
+import pytest
+
+# Test that the context manager implementation works correctly
+
+# As per PEP-343, the context manager should be a single-use object (like files)
+# See https://peps.python.org/pep-0343/#caching-context-managers
+
+
+def test_context_manager_opens_resource():
+    with Resource("data/table.csv") as resource:
+        assert resource.closed is False
+
+
+def test_context_manager_closes_resource():
+    with Resource("data/table.csv") as resource:
+        pass
+    assert resource.closed is True
+
+
+def test_context_manager_returns_same_resource():
+    resource = Resource("data/table.csv")
+    with resource as context_manager_return_value:
+        assert resource == context_manager_return_value
+
+
+def test_nested_context_causes_exception():
+    with pytest.raises(FrictionlessException):
+        # Create nested with statements to test that we can't open
+        # the same resource twice via context managers
+        with Resource("data/table.csv") as resource:
+            with resource:
+                pass
+
+
+def test_resource_copy_can_use_nested_context():
+    # Create nested with statements to test that we can still open
+    # the same resource twice via context if we copy the resource
+    # before the second `with`
+    with Resource("data/table.csv") as resource:
+        copy = resource.to_copy()
+        with copy:
+            assert (copy.closed is False)
+            assert (resource.closed is False)
+
+        # Check that the original resource is still open
+        assert (copy.closed is True)
+        assert (resource.closed is False)
+
+
+def test_resource_can_use_repeated_non_nested_contexts():
+    # Repeat context allowed
+    resource = Resource("data/table.csv")
+    with resource:
+        assert (resource.closed is False)
+
+    assert (resource.closed is True)
+
+    with resource:
+        assert (resource.closed is False)
+    assert (resource.closed is True)
+
+
+def test_resource_copy_can_use_repeated_context():
+    # Repeated context with a copy is allowed
+    resource = Resource("data/table.csv")
+    copy = resource.to_copy()
+    with resource:
+        assert (resource.closed is False)
+        assert (copy.closed is True)
+
+    with copy:
+        assert (resource.closed is True)
+        assert (copy.closed is False)
+
+
+def test_context_manager_on_open_resource_throw_exception():
+    """
+    Using the Resource in a `with` statement after it has been opened will unexpectedly close the resource
+    at the end of the context.  So this is prevented by throwing an exception.
+    """
+    resource = Resource("data/table.csv")
+    resource.open()
+    assert (resource.closed is False)
+    with pytest.raises(FrictionlessException):
+        with resource:
+            pass
+
+
+def test_explicit_open_can_be_repeated():
+    # Explicit open can be nested
+    # Note that the first close() call will close the resource, so anyone
+    # using explicit open() calls must be aware of that.
+    resource = Resource("data/table.csv")
+    resource.open()
+    assert (resource.closed is False)
+    resource.open()
+    assert (resource.closed is False)
+    resource.close()
+    assert (resource.closed is True)
+    resource.close()
+    assert (resource.closed is True)