Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read time from filename #48

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

jasonmlkang
Copy link

Some datasets lack a time variable and attributes and only indicate the date and/or time in the filename. In these cases, the ingester needs to be able to extract the time stamp from the filenames according to a new regular expression setting in the collections-config ConfigMap.

Copy link
Contributor

@tloubrieu-jpl tloubrieu-jpl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Json for the pull request.

I have a change request which is not mandatory but which will make me use this new feature immediatly: configuration of the datetime pattern. I don't think it is to difficult to add.

Otherwise, I believe it is quite important to make the datetime timezone aware and force it to UTC.

You'll see me comments in the pull request code changes.

days = int(days)

# since datetime is set to 1/1 (+1 day), timedelta needs to -1 to cancel it out
return int((datetime.datetime(year, 1, 1) + datetime.timedelta(days-1)).timestamp())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to be careful about the UTC timezone.

For example with code

import pytz
aware = datetime.datetime(2011,8,15,8,15,12,0,pytz.UTC)

Or with the example in my comment above:

import pytz
datetime.strptime(filename, 'A%Y%j.L3m_CHL_chlor_a_4km.nc').replace(tzinfo=pytz.utc)

That would be more simple than the current version of the code and spare some lines of code.

I am not sure if the hardcoded -1 days is also something we would like to have for all the datasets. Maybe an additional optional configuration variable for the collection should give an offset for the min and offset for the max time.
In this case that would be -1 and -1 for both.

I guess the most difficult change in my request is to read from the collection configmap.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jasonmlkang did you have a chance to look at my comments. We're having similar need for AQACF project but of course the file naming convention is different so we would need to discuss how that could be configured.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Examples of file names from AQACF project:
V5GL01.HybridPM25.NorthAmerica.200807-200807.nc
V4NA03_PM25_NA_200401_200412-RH35.nc

@jasonmlkang
Copy link
Author

@tloubrieu-jpl I had implemented a more generic way to parse date time from file name. The commit is 92eb7dc. Please take a look ParseDatetimeUtilsTests.py for examples of how to use the parser. Notice that this is only the first take and it is not the complete work. There are additional works to fine tune the time zone and integrate this into sdap ingester.

@jasonmlkang
Copy link
Author

@tloubrieu-jpl Updated the code to use strptime instead of regex

@tloubrieu-jpl
Copy link
Contributor

Hi @jasonmlkang , are you still working on this pull request ?
I think you were trying to deploy that on your laptop to test from completelly the new feature.

From reading the code, I believe something is missing around https://github.com/apache/incubator-sdap-ingester/blob/7ee17fdf16201c499f7bd35cf398844f2c70f046/granule_ingester/granule_ingester/processors/reading_processors/GridReadingProcessor.py#L43 so that the time read in the file is used a dimension in the dataset.

I am coming back to this pull request now because we would need it for AQACF project for the GMU dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants