Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues while updating a dataset (ingesting again updated data files) #80

Open
didierearith opened this issue Aug 4, 2015 · 1 comment
Assignees

Comments

@didierearith
Copy link

Hi AGDC Team,

While I'm testing WOfS ingestion, I found an issue.

I have downloaded some WOfS file from http://dapds00.nci.org.au/thredds/catalog/fk4/wofs/current/extents in a directory on my machine.
Then I run the ingest command for the first time: e.g
agdc/ingest/wofs.py --source /home/adminprod/data1/rs0/tiles/wofs/

Ingestion of the data files is processed successfully.

Then I want to test an ingestion of existing data in the Data Cube (source files have been updated and I want to update my Data Cube).

To do this, I change the date of the source files (with the Linux 'touch' command).

The datetime of the data is now greater than the datetime of the dataset in the database.

I run again agdc/ingest/wofs.py --source /home/adminprod/data1/rs0/tiles/wofs/ and I get the following exception:

2015-08-04 11:56:02,123 agdc.ingest.tile_contents INFO Tile already in place: '/home/adminprod/data1/rs0/tiles/wofs/LS7_ETM_WATER_115_-035_2011-01-10T01-59-19.155557.tif'
2015-08-04 11:56:02,217 agdc.ingest.core INFO Ingestion complete for dataset '/home/adminprod/data1/rs0/tiles/wofs/LS7_ETM_WATER_115-035_2011-01-10T01-59-19.155557.tif' in 0:00:00.197192.
Traceback (most recent call last):
File "/home/adminprod/agdc-develop/agdc/ingest/wofs.py", line 97, in
agdc.ingest.run_ingest(WofsIngester)
File "/home/adminprod/agdc-develop/agdc/ingest/_core.py", line 586, in run_ingest
ingester.ingest(ingester.args.source_dir)
File "/home/adminprod/agdc-develop/agdc/ingest/_core.py", line 186, in ingest
self.ingest_individual_dataset(dataset_path)
File "/home/adminprod/agdc-develop/agdc/ingest/core.py", line 207, in ingest_individual_dataset
self.tile(dataset_record, dataset)
File "/home/adminprod/agdc-develop/agdc/ingest/pretiled.py", line 312, in tile
dataset_record.store_tiles([tile_contents])
File "/home/adminprod/agdc-develop/agdc/ingest/dataset_record.py", line 238, in store_tiles
return [self.create_tile_record(tile_contents) for tile_contents in tile_list]
File "/home/adminprod/agdc-develop/agdc/ingest/dataset_record.py", line 320, in create_tile_record
size_mb=tile_contents.get_output_size_mb(),
File "/home/adminprod/agdc-develop/agdc/ingest/tile_contents.py", line 174, in get_output_size_mb
return get_file_size_mb(path)
File "/home/adminprod/agdc-develop/agdc/cube_util.py", line 109, in get_file_size_mb
return os.path.getsize(path) // (1024 * 1024)
File "/usr/lib/python2.7/genericpath.py", line 49, in getsize
return os.stat(filename).st_size
OSError: [Errno 2] No such file or directory: '/home/adminprod/data1/rs0/tiles/wofs/LS7_ETM_WATER_115
-035_2011-02-27T01-59-34.560472.tif'
2015-08-04 11:56:02,352 agdc.ingest.core ERROR Unexpected error during path '/home/adminprod/data1/rs0/tiles/wofs/LS7_ETM_WATER_115-035_2011-02-27T01-59-34.560472.tif'

After some investigation, I think the issue is due to the fact the data file is removed in the '__commit' function of the 'collection.py' module: i.e.

    # Remove tile files just after the commit, to avoid removing
    # tile files when the deletion of a tile record has been rolled
    # back. Again, tile files without records are possible if there
    # is an exception or crash just after the commit.
    #
    # The tile remove list is filtered against the tile create list
    # to avoid removing a file that has just been re-created. It is
    # a bad idea to overwrite a tile file in this way (in a single
    # transaction), because it will be overwritten just before the
    # commit (above) and the wrong file will be in place if the
    # transaction is rolled back.

    tile_create_set = {t.get_output_path()
                       for t in self.tile_create_list}
    for tile_pathname in self.tile_remove_list:
        if tile_pathname not in tile_create_set:
            if os.path.isfile(tile_pathname):
                os.remove(tile_pathname)

To be able to ingest again the updated data source files, I have comment the 'os.remove' instruction above.

Note if the data source have not been updated (i.e. data of the source file = date of the database dataset), there is no issue.

Note If I run again the ingestion, the issue doesn't occur always on the same file: sometimes on the first file, sometimes on the nth file.

@jeremyh jeremyh self-assigned this Aug 5, 2015
@jeremyh
Copy link
Member

jeremyh commented Aug 5, 2015

Thanks Didier.

We hit this bug last week ourselves in the development code – the overlap cleaner identified the second tile as redundant, which for other ingesters implies tile removal, and this was incorrectly running during WOfS ingestion. The WOfS ingester should be runnable with read-only access to its inputs (which is how we're running it), so any file modification is a serious bug.

Try updating to the latest version of the develop branch and retesting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants