-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
frictionless transform unhandled exception on 200 line csv, but not 197 line csv #1622
Comments
I've been doing some debugging of this and have some ideas what it might be related to, but I'm relatively new to Frictionless so I may be confused about the architecture or the specifics. resource.py enter and exit not re-entrantThere seems to be an issue where # TODO: shall we guarantee here that it's at the beggining for the file?
# TODO: maybe it's possible to do type narrowing here?
def __enter__(self):
if self.closed:
self.open()
return self
def __exit__(self, type, value, traceback): # type: ignore
self.close() So if you do something like (pseudo-code): resource = Resource()
with resource as r1:
# A1
with resource as r2:
# B1
# A2 Then If you started calling a generator on the resource (e.g. Why eventually fail (and not immediately)I believe the number of rows matters because the csv parser takes a sample, and then iterates through Other concernsThe comment Proposed solutionsI can see a number of ways to potentially fix this, but my limited understanding of the architecture and the central nature of Option 1: Make enter and exit reference countingWe could add a reference count so that However I think I am then seeing other weird behaviour like our code only starts outputing line 297 of the input rather than line 1, which I suspect is due to this "re-open in the current state (e.g. first 100 lines previously read into Option 2: Make enter return a deep clone of the Resource with a reset parser/loader as neededThis would mean that in the above pseudo-code With enough nesting or chaining, you could also presumably reach a point where you can't open any more connections to the file (particularly remote files). Option 3: something elsePerhaps we are using it wrong, or I'm misunderstanding entirely what is going wrong, so there may be an entirely different way and better way to fix it. |
Further Investigation: Option 2 seems correct, but design goals confirmation neededSummary of proposal(s)See details below, but from further investigation I believe that Option 2 - returning a clone/copy - is the correct solution, but there are still some decisions on the right way to do it. Essentially we could:
I believe that Option 2a is the simplest and best solution, with Option 2b a close second, but I'm keen to get thoughts from more experience users before making a PR. Simplified error demo for discussionAs discussed in the previous comment, there is a relatively simple reproduction case. The pseudo code for that is below, but I'm also attaching a file of working demo (remove the .txt used to allow github to upload it): crash.py.txt resource = TableResource(file="data.csv")
with resource as r1:
# A1
# iterate r1.row_stream() 100 times and print out the row
with resource as r2:
# B1
# iterate r2.row_stream() 50 times and print out the row
# A2
# iterate r1.row_stream() another 100 times and print out the row With no changes this crashes at Why not Option 1 (reference counting)?By reference counting, I mean increment a counter whenever This prevents the crash, but the output of the above code would be:
While this might be understandable in the above file, it often happens deeply nested in the pipeline in ways that wouldn't be expected and give the wrong result. e.g. it happens when This means that if we implement reference counting, and run the pipeline above it does run to completion but skips the first 297 rows in the output (as those rows are consumed setting up the pipeline, and aren't available for yielding by the time the whole generator chain gets row_stream-ed to write the file). Option 2 - cloning the resource - works as expectedOption 2a - simple return a copyChange the
This also prevents the crash, and the output of the above code would be:
However, more globally this works if and only if you go through and change all usages of This can be resolved by going through the library and fixing all the cases to use the This would also be a breaking change for any external code that uses I have implemented this fix in a private copy and it does work for the pipeline given at the start, and does return the correct rows (starting at row 1 as expected). So I am confident this can be made to work in a relatively neat PR. Option 2b - make Resources explicitly single-useWe could make Resources explicitly single-use by throwing an exception if The benefit is that you get an explicit exception when you do the wrong thing. The downside is that you have to update every occurence of a nested I have not implemented this, but it is really the same as 2a with the code changes in a different place. Option 2c - caching context stackYou could in theory do something similar to 2a, but keep the copy on an internal stack. Every time This would make either form of I have not implemented this, and I'm somewhat concerned about the scale and complexity of trying to implement an appropriate facade and context stack. Option 3 - something I've missed?I'd also be very happy to hear that there's something obvious I've missed, and there's a different and simpler way to to fix it!. Discussion in relation to PEP 343 – The “with” StatementThese issues and possible solutions are somewhat discussed in the PEP that defined the with statement:
We could enforce Resources to be single-use objects as they are essentially generators which the PEP suggests are single-use, but that seems to go against as the architecture as there are numerous places where Equally they do suggest the use of caching contexts as per Option 2B, and it seems something like this is implemented for Decimal precision (see example 8 in https://peps.python.org/pep-0343/#examples). That does however seem a lot harder in the Resource case due to the complexity of Resources. |
# Summary ## Problem statement The `Resource` class is also a [Context Manager](https://docs.python.org/3/reference/datamodel.html#context-managers). That is, it implements the `__enter()__` and `__exit()__` methods to allow the use of `with Resource(...)` statements. Prior to this PR, there was no limit on nesting `with` statements on the same `Resource`, but this caused problems because while the second `__enter()__` allowed the `Resource` to already be open, the first `__exit()__` would `close()` the Resource while the higher level context would expect it to still be open. This would cause errors like "ValueError: I/O operation on closed file", or the iterator would appear to start from part way through a file rather than at the start of the file, and other similar behaviour depending on the exact locations of the nested functions. This was made more complex because these `with` statements were often far removed from each other in the code, hidden behind iterators driven by generators, etc. They also could have different behaviour depending on number of rows read, the type of Resource (local file vs inline, etc.), the different steps in a pipeline, etc. etc. All this meant that the problem was rare, hard to reduce down to an obvious reproduction case, and not realistic to expect developers to understand while developing new functionality. ## Solution This PR prevents nested contexts being created by throwing an exception when the second, nested, `with` is attempted. This means that code that risks these issues can be quickly identified and resolved during development. The best way to resolve it is to use `Resource.to_copy()` to copy so that the nested `with` is acting on an independent view of the same Resource, which is likely what is intended in most cases anyway. This PR also updates a number of the internal uses of `with` to work on a copy of the Resource they are passed so that they are independent of any external code and what it might have done with the Resource prior to the library methods being called. ## Breaking Change This is technically a breaking change as any external code that was developed using nested `with` statements - possibly deliberately, but more likely unknowingly not falling into the error cases - will have to be updated to use `to_copy()` or similar. However, the library functions have all been updated in a way that doesn't change their signature or their expected behaviour as documented by the unit tests. All pre-existing unit tests pass with no changes, and added unit tests for the specific updated behaviour do not require any unusual constructs. It is still possible that some undocumented and untested side effect behaviours are different than before and any code relying on those may also be affected (e.g. `to_petl()` iterators are now independent rather than causing changes in each other) So it is likely that very few actual impacts will occur in real world code, and the exception thrown does it's best to explain the issue and suggest resolutions. # Tests - All existing unit tests run and pass unchanged - New unit tests were added to cover the updated behaviour - These unit tests were confirmed to fail without the updates in this PR (where appropriate). - These unit tests now pass with the updated code. - The original script that identified the issue in frictionlessdata#1622 was run and now gives the correct result (all rows appropriately converted and saved to file)
# Summary The CI tests identified some issues that don't show up on a normal test run. This commit fixes those issues. It also highlighted that there were numerous areas that didn't have sufficient test coverage for the case that the caller had already opened the resource. The indexer has some notable changes, but the biggest area affected is the parsers when writing from an already opened source. This commit adds unit tests for the index and all the parser formats for this case, and fixes the code to support the lack of nested contexts. # Tests - Setup the required databases for CI by copying the commands in the github actions - Run `hatch run +py=3.11 ci:test` and ensure all tests pass and coverage remains sufficient - Run `hatch run test` in case it is different and ensure all tests pass and coverage remains sufficient This also means that all linting etc. has been run too.
# Summary ## Problem statement The `Resource` class is also a [Context Manager](https://docs.python.org/3/reference/datamodel.html#context-managers). That is, it implements the `__enter()__` and `__exit()__` methods to allow the use of `with Resource(...)` statements. Prior to this PR, there was no limit on nesting `with` statements on the same `Resource`, but this caused problems because while the second `__enter()__` allowed the `Resource` to already be open, the first `__exit()__` would `close()` the Resource while the higher level context would expect it to still be open. This would cause errors like "ValueError: I/O operation on closed file", or the iterator would appear to start from part way through a file rather than at the start of the file, and other similar behaviour depending on the exact locations of the nested functions. This was made more complex because these `with` statements were often far removed from each other in the code, hidden behind iterators driven by generators, etc. They also could have different behaviour depending on number of rows read, the type of Resource (local file vs inline, etc.), the different steps in a pipeline, etc. etc. All this meant that the problem was rare, hard to reduce down to an obvious reproduction case, and not realistic to expect developers to understand while developing new functionality. ## Solution This PR prevents nested contexts being created by throwing an exception when the second, nested, `with` is attempted. This means that code that risks these issues can be quickly identified and resolved during development. The best way to resolve it is to use `Resource.to_copy()` to copy so that the nested `with` is acting on an independent view of the same Resource, which is likely what is intended in most cases anyway. This PR also updates a number of the internal uses of `with` to work on a copy of the Resource they are passed so that they are independent of any external code and what it might have done with the Resource prior to the library methods being called. ## Breaking Change This is technically a breaking change as any external code that was developed using nested `with` statements - possibly deliberately, but more likely unknowingly not falling into the error cases - will have to be updated to use `to_copy()` or similar. However, the library functions have all been updated in a way that doesn't change their signature or their expected behaviour as documented by the unit tests. All pre-existing unit tests pass with no changes, and added unit tests for the specific updated behaviour do not require any unusual constructs. It is still possible that some undocumented and untested side effect behaviours are different than before and any code relying on those may also be affected (e.g. `to_petl()` iterators are now independent rather than causing changes in each other) So it is likely that very few actual impacts will occur in real world code, and the exception thrown does it's best to explain the issue and suggest resolutions. # Tests - All existing unit tests run and pass unchanged - New unit tests were added to cover the updated behaviour - These unit tests were confirmed to fail without the updates in this PR (where appropriate). - These unit tests now pass with the updated code. - The original script that identified the issue in frictionlessdata#1622 was run and now gives the correct result (all rows appropriately converted and saved to file)
# Summary The CI tests identified some issues that don't show up on a normal test run. This commit fixes those issues. It also highlighted that there were numerous areas that didn't have sufficient test coverage for the case that the caller had already opened the resource. The indexer has some notable changes, but the biggest area affected is the parsers when writing from an already opened source. This commit adds unit tests for the index and all the parser formats for this case, and fixes the code to support the lack of nested contexts. # Tests - Setup the required databases for CI by copying the commands in the github actions - Run `hatch run +py=3.11 ci:test` and ensure all tests pass and coverage remains sufficient - Run `hatch run test` in case it is different and ensure all tests pass and coverage remains sufficient This also means that all linting etc. has been run too.
# Summary The CI tests identified some issues that don't show up on a normal test run. This commit fixes those issues. It also highlighted that there were numerous areas that didn't have sufficient test coverage for the case that the caller had already opened the resource. The indexer has some notable changes, but the biggest area affected is the parsers when writing from an already opened source. This commit adds unit tests for the index and all the parser formats for this case, and fixes the code to support the lack of nested contexts. # Tests - Setup the required databases for CI by copying the commands in the github actions - Run `hatch run +py=3.11 ci:test` and ensure all tests pass and coverage remains sufficient - Run `hatch run test` in case it is different and ensure all tests pass and coverage remains sufficient This also means that all linting etc. has been run too.
Overview
I am finding a very strange error when doing a transfrom (either in python code or via the command line tool). Depending on the size of the input file the transform succeeds fine, or throws an "I/O operation on closed file" exception. The number of lines required to trigger it seems to vary, even by execution environment.
On a M1 Mac Mini it's currently 198 lines crashes, 197 lines passes. On a gitpod instance (Ubuntu), it was around the same yesterday, but today is more like 150. In our code version it can take 10k lines+. But there is always a size above which this fails (and a size far short of e.g.
settings.FIELD_SIZE_LIMIT
).Example Command Line
Pipeline.json
data.csv
Sample files
data.csv
pipeline.json
The text was updated successfully, but these errors were encountered: