-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert DP+ to a CKAN extension #98
Comments
DP+ 1.x should also be more interactive - i.e. users have the ability to configure the DP+ job on a per resource basis. 99% of the time, the default configuration should work fine. But when the user wants to tweak DP+ configuration for that particular job (e.g ask for PII screening using a particular PII screening configuration, add summary stats, check for dupes, etc.), there should be some DP+ knobs exposed in the Datastore tab to change DP+ settings just for that resource. |
Hi @jqnatividad Any idea, when it will be ready for release ?? |
@sagargg its almost done, any testing on the |
What is the current state of this migration? I am considering setting up datapusher-plus, but would like to avoid to set up an external service if future updates require it to be an ckanext instead. Could I already use the dev-1.0 branch for this, and if so, how? |
@jhbruhn dev-1.0 can be used to set the DP+ as extension and you are more than welcome to give it a try To use it, just checkout to that branch and follow same steps as any other extension for installation |
Thanks. I installed the extension in my ckan 2.10 test instance and included it as the "datapusher" plugin (according to this Line 85 in 1cfba1c
CKAN_DATAPUSHER_URL as ckan would not start if not defined to the value I was using with the original datapusher, but stopped that service.
But when I upload a file and try to push it to the datastore (did happen automatically as well), I get this error in the Web Interface: |
You should start a job worker |
I did (now), but the result is still the same. Could the original, ckan internal datapusher be interfering with datapusher-plus here? Should I open another issue? |
If I provide my original datapusher configuration and service next to the activated dataplusher configuration, it uses the classic datapusher to push data to my resources. While a task in the workerqueue for datapusher-plus seems to be created, it is not used, and nothing is happening on my |
@jhbruhn in the plugins it should be added as |
That leads to this exception at startup:
Perhaps due to this commit? 1cfba1c The plugin is definitely installed, as some of it is run when including the |
@jhbruhn my bad, forgot that i had that changed Are you using dockerized environment for your setup ? |
Yes, this is the ckan Dockerfile I am using, based off of the official image: FROM ckan/ckan-base:2.10
### DCAT ###
ARG CKANEXT_DCAT_VERSION="1.5.1"
RUN pip3 install -e git+https://github.com/ckan/ckanext-dcat.git@v${CKANEXT_DCAT_VERSION}#egg=ckanext-dcat && \
pip3 install -r https://raw.githubusercontent.com/ckan/ckanext-dcat/v${CKANEXT_DCAT_VERSION}/requirements.txt
### LDAP with ignore referrals patch ###
ARG CKANEXT_LDAP_VERSION="add_ignore_referrals"
RUN apk add --no-cache \
openldap-dev
RUN pip3 install -e git+https://github.com/jhbruhn/ckanext-ldap.git@${CKANEXT_LDAP_VERSION}#egg=ckanext-ldap
### PDFView
ARG CKANEXT_PDFVIEW_VERSION="0.0.8"
RUN pip3 install -e git+https://github.com/ckan/ckanext-pdfview.git@${CKANEXT_PDFVIEW_VERSION}#egg=ckanext-pdfview
### Geoview
ARG CKANEXT_GEOVIEW_VERSION="0.1.0"
RUN pip3 install ckanext-geoview==${CKANEXT_GEOVIEW_VERSION}
### Spatial
ARG CKANEXT_SPATIAL_VERSION="2.1.1"
RUN apk add --no-cache \
proj-util \
proj-dev \
geos-dev
RUN pip3 install -e git+https://github.com/ckan/ckanext-spatial.git@v${CKANEXT_SPATIAL_VERSION}#egg=ckanext-spatial && \
pip3 install -r https://raw.githubusercontent.com/ckan/ckanext-spatial/v${CKANEXT_SPATIAL_VERSION}/requirements.txt
### Hierarchy
ARG CKANEXT_HIERARCHY_VERSION="1.2.1"
RUN pip3 install -e git+https://github.com/ckan/ckanext-hierarchy.git@v${CKANEXT_HIERARCHY_VERSION}#egg=ckanext-hierarchy && \
pip3 install -r https://raw.githubusercontent.com/ckan/ckanext-hierarchy/v${CKANEXT_HIERARCHY_VERSION}/requirements.txt
### Pages
ARG CKANEXT_PAGES_VERSION="0.5.2"
RUN pip3 install -e git+https://github.com/ckan/ckanext-pages.git@v${CKANEXT_PAGES_VERSION}#egg=ckanext-pages && \
pip3 install -r https://raw.githubusercontent.com/ckan/ckanext-pages/v${CKANEXT_PAGES_VERSION}/requirements.txt
### Datapusher Plus
ARG CKANEXT_DATAPUSHER_PLUS_VERSION="1cfba1c1373b6890f7338df1fa0f46a28297f0ef"
RUN pip3 install -e git+https://github.com/dathere/datapusher-plus.git@${CKANEXT_DATAPUSHER_PLUS_VERSION}#egg=datapusher-plus && \
pip3 install -r https://raw.githubusercontent.com/dathere/datapusher-plus/${CKANEXT_DATAPUSHER_PLUS_VERSION}/requirements.txt
# Activate plugins
ENV CKAN__PLUGINS "stats text_view image_view datatables_view video_view audio_view ldap pdf_view dcat dcat_json_interface structured_data envvars datapusher_plus datastore resource_proxy geo_view geojson_view shp_view wmts_view spatial_metadata spatial_query hierarchy_display hierarchy_form hierarchy_group_form pages" qsv is obviously still missing from that Dockerfile, but it doesn't event get that far to call datapusher_plus, it is only calling the classic datapusher functions as far as I can see. |
Thanks for the input, i will try to locate whats the issue with dockerized setup. Starting DP+ from source setup was working and it is testing on one of our development environments. I hope i could fix this soon and start using the DP+ as extension more often |
I've just looked into this again, and it seems that I am getting further than before, yay! I now get this error while executing a datapusher-plus job:
I have run the |
@jhbruhn thanks for testing the DP+ Yes, there is a issue with the migration script, i think i have a fix for it. Currently is on |
Thanks for your reply, that branch is indeed working quite well with my limited tests. |
@jqnatividad is there any roadmap for the I have been testing the Please let me know if there is anything else we can help with! |
@pdelboca , it's really on me. DP+ 0.x stalled a bit because the main implementation that was informing its build was paused for a few months. We started up again a month ago and we're currently writing specs for that implementation (we just actually submitted the "DP+ - Data Resource Upload First (DRUF) edition" specs for review by the customer today) Good thing is @tino097 has been working diligently on DP+ 1.0 regardless given that ckanext-serviceprovider has been deprecated, and I don't intend to keep working on 0.x. I'll generalize the spec and publish it here as it will functionally serve as the DP+ 1.0 roadmap. We are breaking down DP+ DRUF development into three phases. Phase 1 will be primarily adapted to the customer's requirements and is targeted for a 2024 summer release. Phase 2 is still being scoped out but its focus will be being able to apply DRUF metadata inferencing to spatial data formats; being able to specify formulas for computed metadata fields in the scheming yaml file, and using the iDataDictionaryForm to implement an expanded data dictionary with to store summary statistics and frequency data for each field. Phase 3 is even more aspirational, but its looking to do "smarter analysis" using machine learning techniques (perhaps, leveraging @amercader's work on related datasets; and maybe using I'm also thinking of using qsv's
The generated JSONschema file can be further fine-tuned by the Data Steward if required to apply some fairly complex validation rules as the crate I'm using fully implements JSONSchema validation and is widely deployed and used in production systems in the Rust ecosystem. Currently, And it's quite performant - on benchmarks using a 1 million row sample of NYC's 311 data with a non-trivial JSONschema, it validates the CSV in 1.3 seconds! And yeah, we would love to take your help! Perhaps, we can carve out some time at CSVConf next month? |
That's an exciting roadmap @jqnatividad ! Thanks for sharing and let's definitely catchup at the CSVConf :) As a suggestion/thought: In terms of releases, it would be nice to split architectural change and features into separated releases. It will be easier to develop, easier to deploy, easier to maintain and will give the community time to change. Mixing behavioral changes with architectural can be messy because new errors can come from two fronts :). In that sense, an initial release of DP+ v1.0 as a CKAN extension would be great. |
Since the DP+ has been a CKAN extension and we no longer need the built-in 2024-07-02 08:49:09,763 ERROR [ckan.lib.webassets_tools] Trying to include unknown asset:
<ckanext-datapusher/datapusher-css>
2024-07-02 08:49:09,823 ERROR [ckan.lib.webassets_tools] Trying to include unknown asset:
<ckanext-datapusher/datapusher> |
@jqnatividad we need to consider merging |
DP+ was forked from DP which uses ckanserviceprovider.
ckanserviceprovider was meant to be a general library for CKAN for making web services, and was originally created in 2013.
Right now, DP and DP+ are really the only users of the library, and in the 10 years since, a lot has changed.
DP+ 0.x will continue to use ckanserviceprovider so older CKAN installations using DataPusher can still use it.
However, especially now that CKAN v2.9 uses Python 3.8+ which DP+ requires to call qsv using subprocess options only available in Python 3.7+, v1.x of DP+ will work as a CKAN extension.
This has the following benefits:
This is already WIP in the https://github.com/dathere/datapusher-plus/tree/dev-v1.0 branch and we hope to the make the first release summer 2023.
The text was updated successfully, but these errors were encountered: