Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As air quality expert, I would like to check if certain air quality data from the Province of Bolzano is correctly imported in the Open Data Hub #668

Open
rcavaliere opened this issue May 14, 2024 · 26 comments
Assignees

Comments

@rcavaliere
Copy link
Member

rcavaliere commented May 14, 2024

The BrennerLEC partners have noticed that specifically on the ML5 air quality stations of the Province ("A22 Egna - A22, corsia sud km 103") we have frequent data holes, which should not appear.

Check for example: link

Generally, the open data (period = 3600) the data seem to have timestamp UTC + 2, while it should UTC +1.

Affected Data Collectors to check:
https://github.com/noi-techpark/bdp-commons/tree/main/data-collectors/environment-appa-bz-tenminutes
https://github.com/noi-techpark/bdp-commons/tree/main/data-collectors/environment-appa-bz-opendata

@rcavaliere rcavaliere added the bug Something isn't working label May 14, 2024
@rcavaliere
Copy link
Member Author

@dulvui today I got the information that the Data Provider (APPABZ) has fixed an issue, now the data should be available always as UTC+1. Can you please check?

@dulvui
Copy link
Contributor

dulvui commented Jun 11, 2024

@rcavaliere For opendata I simplified now the timestamp conversion and it should be correct now. The sync triggers every morning at 10, so lets see tomorrow if the issues are solved.

@dulvui
Copy link
Contributor

dulvui commented Jun 13, 2024

There is still missing data for some days. I will setup now some logging to understand better if we loose the data or if the API doesn't update the data

@dulvui
Copy link
Contributor

dulvui commented Jun 24, 2024

@rcavaliere I just saw now that there might be a problem with the data provider, since the values for Egna are -1, and in that case values are not valid and get ignored by the data collector.
Here the url for the station at Egna http://dati.retecivica.bz.it/services/airquality/timeseries?station_code=ML5

Here an example response with -1 value, since this will change if called on another day.

[{"DATE":"2024-06-23T12:00:00+01:00","SCODE":"ML5","MCODE":"CO","TYPE":"1","VALUE":-1},{"DATE":"2024-06-23T12:00:00+01:00","SCODE":"ML5","MCODE":"NO2","TYPE":"1","VALUE":-1}]

I added now specific logging for the case that the value is -1, so I can see for which stations/data types this happens

@clezag clezag removed their assignment Jul 2, 2024
@rcavaliere
Copy link
Member Author

@dulvui are there any further news here?

@dulvui
Copy link
Contributor

dulvui commented Aug 7, 2024

I analyzed now the data again on analytics and saw that the data holes are nearly always at the same time. They always start at 8:00/9:00 in the morning and stop at 12:00, sometimes even over multiple days, but they always stop at 12:00.
I think there is something going on with the sensors or the infrastructure sending the data, during this periods.

https://analytics.opendatahub.com/#%7B%22active_tab%22:0,%22height%22:%22400px%22,%22auto_refresh%22:false,%22scale%22:%7B%22from%22:1719784800000,%22to%22:1722981600000%7D,%22graphs%22:%5B%7B%22category%22:%22Air%20quality%22,%22station%22:%22ML5%22,%22station_name%22:%22A22%20Egna%20-%20A22,%20corsia%20sud%20km%20103%22,%22data_type%22:%22NO2%20-%20Ossidi%20di%20azoto%22,%22unit%22:%22%5B%C2%B5g/m%C2%B3%5D%22,%22period%22:%223600%22,%22yaxis%22:1,%22color%22:0%7D%5D%7D

Here I over layed 2 different sensors and here we can see that there are similar outages, but in different timestamps for other sensors. This means that the data collector works and I'm pretty sure there is some problem with the sensors.
https://analytics.opendatahub.com/#%7B%22active_tab%22:0,%22height%22:%22400px%22,%22auto_refresh%22:false,%22scale%22:%7B%22from%22:1719784800000,%22to%22:1722981600000%7D,%22graphs%22:%5B%7B%22category%22:%22Air%20quality%22,%22station%22:%22ML5%22,%22station_name%22:%22A22%20Egna%20-%20A22,%20corsia%20sud%20km%20103%22,%22data_type%22:%22NO2%20-%20Ossidi%20di%20azoto%22,%22unit%22:%22%5B%C2%B5g/m%C2%B3%5D%22,%22period%22:%223600%22,%22yaxis%22:1,%22color%22:0%7D,%7B%22category%22:%22Air%20quality%22,%22station%22:%22AB3%22,%22station_name%22:%22A22%20Nord-Bx%20sud%20-%20Depuratore,%20Bx%22,%22data_type%22:%22NO2%20-%20Ossidi%20di%20azoto%22,%22unit%22:%22%5B%C2%B5g/m%C2%B3%5D%22,%22period%22:%223600%22,%22yaxis%22:1,%22color%22:1%7D%5D%7D

@rcavaliere
Copy link
Member Author

@dulvui CISMA has noticed that we don't have this situation.
For example look at this: https://analytics.opendatahub.com/#%7B%22active_tab%22:0,%22height%22:%22400px%22,%22auto_refresh%22:false,%22scale%22:%7B%22from%22:1722808800000,%22to%22:1723413600000%7D,%22graphs%22:%5B%7B%22category%22:%22Air%20quality%22,%22station%22:%22ML5%22,%22station_name%22:%22A22%20Egna%20-%20A22,%20corsia%20sud%20km%20103%22,%22data_type%22:%22NO2%20-%20Ossidi%20di%20azoto%22,%22unit%22:%22%5B%C2%B5g/m%C2%B3%5D%22,%22period%22:%223600%22,%22yaxis%22:1,%22color%22:0%7D%5D%7D

We don't have measurements for 10.8. On the other side if you see this: https://ambiente.provincia.bz.it/aria/misurazione-attuale-aria.asp?air_actn=4&air_station_code=ML5&air_type=2 measurements are there! So we are missing for some reasons valid data!

I have a doubt: could it be that for some reasons we miss the reading of the end-point http://dati.retecivica.bz.it/services/airquality/sensors? I think they publish there just the last data, so if for some reasons we don't make an API call it could be that we miss data. Can you check this?

@dulvui
Copy link
Contributor

dulvui commented Aug 14, 2024

@rcavaliere But this graphs shows the data on a daily basis, and we are importing the data of every hour. I think that makes a difference and if I check here http://dati.retecivica.bz.it/services/airquality/sensors I can find may sensors with value -1
image

A possible problem could be that we get the data too early, and the data is not ready yet. I checked the cron job and the it runs every day at 10:00 UTC, so 12:00 at lunchtime local time, so it could be that not everyday the data is ready at this time.
I will move the import to 12:30 in testing, so we can see already today after lunch, if we can see differences between prod and test.

We are using the following endpoints

@rcavaliere
Copy link
Member Author

@dulvui I think that's the issue, we are making too few API calls and I think the data is not there or not there anymore. Can we simply put the frequency at 10 minutes? For us it won't change nothing, if there is no new data we don't store nothing

@dulvui
Copy link
Contributor

dulvui commented Aug 14, 2024

@rcavaliere yes I'll try that now

@rcavaliere
Copy link
Member Author

@dulvui did you manage to find out in this example what happened?

@dulvui
Copy link
Contributor

dulvui commented Sep 12, 2024

@rcavaliere no I still don't know why this happens

@rcavaliere rcavaliere assigned clezag and unassigned dulvui Sep 20, 2024
@clezag
Copy link
Member

clezag commented Sep 20, 2024

@rcavaliere I'm trying to look into this.
Long term will this data source be maintained?
This data collector looks relatively straightforward, I think it could be migrated to the new infrastructure, which would also give us more insight into this issue (having all the raw data)

But right now for me too it looks like the API is just posting garbage:
e.g. currently, 20.09 5PM https://dati.retecivica.bz.it/services/airquality/timeseries?station_code=ML5&meas_code=NO2&type=1

[
  {
    "DATE": "2024-09-19T16:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": 32.504
  },
  {
    "DATE": "2024-09-19T17:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": 34.3841
  },
  {
    "DATE": "2024-09-19T18:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": 36.2961
  },
  {
    "DATE": "2024-09-19T19:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": 38.1125
  },
  {
    "DATE": "2024-09-19T20:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": 29.9547
  },
  {
    "DATE": "2024-09-19T21:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": 37.5708
  },
  {
    "DATE": "2024-09-19T22:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": 30.6239
  },
  {
    "DATE": "2024-09-19T23:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": 28.6163
  },
  {
    "DATE": "2024-09-20T00:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": 29.8591
  },
  {
    "DATE": "2024-09-20T01:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": 21.2551
  },
  {
    "DATE": "2024-09-20T02:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": 18.2277
  },
  {
    "DATE": "2024-09-20T03:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": 22.2748
  },
  {
    "DATE": "2024-09-20T04:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": 21.51
  },
  {
    "DATE": "2024-09-20T05:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": 28.7119
  },
  {
    "DATE": "2024-09-20T06:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": 32.6315
  },
  {
    "DATE": "2024-09-20T07:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": 32.8864
  },
  {
    "DATE": "2024-09-20T08:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": -1
  },
  {
    "DATE": "2024-09-20T09:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": -1
  },
  {
    "DATE": "2024-09-20T10:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": -1
  },
  {
    "DATE": "2024-09-20T11:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": -1
  },
  {
    "DATE": "2024-09-20T12:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": -1
  },
  {
    "DATE": "2024-09-20T13:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": -1
  },
  {
    "DATE": "2024-09-20T14:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": -1
  },
  {
    "DATE": "2024-09-20T15:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": -1
  }
]

You can clearly see that at 8 AM (as Simon already found) it just stops posting data (it's -1 which gets ignored).
Did we contact the maintainers about this issue?

@clezag
Copy link
Member

clezag commented Sep 20, 2024

@rcavaliere
Copy link
Member Author

@clezag this is way data could be (manually) invalidated by air quality experts, therefore it's like this. But I remember that we have cases in which the data visualized there is present, while in the Open Data Hub is not. I can find again these examples, but unfortunately we can not check in the past what happened...

@clezag
Copy link
Member

clezag commented Sep 21, 2024

@rcavaliere

[
  {
    "DATE": "2024-09-20T12:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": -1
  },
  {
    "DATE": "2024-09-20T13:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": -1
  },
  {
    "DATE": "2024-09-20T14:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": -1
  },
  {
    "DATE": "2024-09-20T15:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": -1
  },
  {
    "DATE": "2024-09-20T16:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": -1
  },
  {
    "DATE": "2024-09-20T17:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": -1
  },
  {
    "DATE": "2024-09-20T18:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": -1
  },
  {
    "DATE": "2024-09-20T19:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": -1
  },
  {
    "DATE": "2024-09-20T20:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": -1
  },
  {
    "DATE": "2024-09-20T21:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": -1
  },
  {
    "DATE": "2024-09-20T22:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": -1
  },
  {
    "DATE": "2024-09-20T23:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": -1
  },
  {
    "DATE": "2024-09-21T00:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": -1
  },
  {
    "DATE": "2024-09-21T01:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": -1
  },
  {
    "DATE": "2024-09-21T02:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": -1
  },
  {
    "DATE": "2024-09-21T03:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": -1
  },
  {
    "DATE": "2024-09-21T04:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": -1
  },
  {
    "DATE": "2024-09-21T05:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": -1
  },
  {
    "DATE": "2024-09-21T06:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": -1
  },
  {
    "DATE": "2024-09-21T07:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": -1
  },
  {
    "DATE": "2024-09-21T08:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": -1
  },
  {
    "DATE": "2024-09-21T09:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": -1
  },
  {
    "DATE": "2024-09-21T10:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": -1
  },
  {
    "DATE": "2024-09-21T11:00:00+01:00",
    "SCODE": "ML5",
    "MCODE": "NO2",
    "TYPE": "1",
    "VALUE": -1
  }
]

image

It's still like this.
Since we're always only getting the last 24 hours, even if they eventually fill in this hole, we will not get the missing data from yesterday.

@rcavaliere
Copy link
Member Author

@clezag yes this is not surprising. We are speaking here of validated data, done by humans. People don't work on week-ends, so next update in relation to Friday data will take place at the beginning of next week. I think that it is because of these delays that we somehow miss data...

@clezag
Copy link
Member

clezag commented Sep 23, 2024

@rcavaliere I think this is our issue.
If you look now on analytics we've got the data hole, but on their site it's there.

This is because the endpoint we are using only goes back 24h, and if they take longer than that to validate, we have a hole.

I don't think we can solve this on our end, if not with an additional endpoint or parameter that allows us to go further back than 24h

@rcavaliere
Copy link
Member Author

@clezag thanks for checking more! But what does it mean "goes back 24h"? Shouldn't we consider the last available data record in the DB and then import the new data available, sorting by timestamp?

@clezag
Copy link
Member

clezag commented Sep 24, 2024

@rcavaliere The endpoint we are using always returns the data of the last 24h. If the validation window is longer than that, we never receive the data that is older than 24h.
Their documentation does not mention any parameter to get data further back:
https://data.civis.bz.it/dataset/situazione-dell-aria/resource/2e96c2a2-d5d8-4d6e-99a1-b6dc651d3b9f

@rcavaliere
Copy link
Member Author

@clezag probably you are right, this could be the reason. Let me evaluate this with the Data Provider...

@sseppi sseppi added the blocked label Oct 4, 2024
@rcavaliere
Copy link
Member Author

Update: waiting for APPABZ explaining the data inconsistencies found

@rcavaliere rcavaliere removed the blocked label Oct 7, 2024
@rcavaliere
Copy link
Member Author

@clezag APPABZ has now published on their end-point the data of the last 10 days (see e.g. https://dati.retecivica.bz.it/services/airquality/timeseries?station_code=ML5&meas_code=NO2&type=1). Let's see if this will solve the issue!

@clezag
Copy link
Member

clezag commented Oct 16, 2024

There are still some issues, our graphs in analytics do not reflect the 10 day JSON we get today.

The time series writer API only accepts records if their timestamp is newer than the last one, depending on how data gets validated, this could be the reason (e.g. if they first release the updated data of sunday, and afterwards saturday, we never store saturday's data). I will look into it.

@ohnewein ohnewein removed the bug Something isn't working label Oct 18, 2024
@clezag
Copy link
Member

clezag commented Nov 15, 2024

I've set up a data collector that polls that endpoint (station ML5) every hour, having the full raw data history should help us finding out what's going on

@clezag
Copy link
Member

clezag commented Nov 20, 2024

With the raw data available, I checked again, and turns out it is indeed a bug on our side.
To be more precise, the original problem was the retrospective validation of data, but now that they increased the timeframe of the webservice from 1 day to 1 week, our data collector is implemented in a way that cannot handle more than 24h at a time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants