Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No handling for encoded URLs #46

Open
KrzysztofMadejski opened this issue Jun 6, 2017 · 1 comment
Open

No handling for encoded URLs #46

KrzysztofMadejski opened this issue Jun 6, 2017 · 1 comment

Comments

@KrzysztofMadejski
Copy link
Contributor

KrzysztofMadejski commented Jun 6, 2017

I have run into this issue here: https://danepubliczne.gov.pl/dataset/informacja-kwartalna-o-stanie-finansow-publicznych/resource/86454cff-556a-4162-aa65-433158c133f4

Basically the provider has linked external resource as: http://www.mf.gov.pl/documents/764034/1002163/Informacja+kwartalna++III+kwarta%C5%82+2016+r.. To make it more clear let's assume the filename is kwarta%C5%82+2016

This file is saved to disk as is, meaning kwarta%C5%82+2016.
It is then served by Apache escaping percents: kwarta%25C5%2582+2016 while CKAN links archived version as in orginal URL kwarta%C5%82+2016. That leads to 404 error on the archived link.

I think we should decode any incoming urls (below) or erase all encoded chars. What do you think?

    # ckanext/archiver/tasks.py:556
    try:
        file_name = parsed_url.path.split('/')[-1] or 'resource'
        file_name = urllib.unquote(file_name) # DECODING ADDED HERE
        file_name = file_name.strip()  # trailing spaces cause problems
        file_name = file_name.encode('ascii', 'ignore')  # e.g. u'\xa3' signs
KrzysztofMadejski added a commit to DanePubliczneGovPl/ckanext-archiver that referenced this issue Jun 6, 2017
@thorge
Copy link

thorge commented Sep 4, 2023

The archiver extension in CKAN appears to be unintentionally double percent-encoding URLs that are already percent encoded. For instance, a URL path like kwarta%C5%82+2016 is already percent-encoded, but the archiver extension is converting it to kwarta%25C5%2582+2016, causing issues.

According to RFC 3986:

"Under normal circumstances, the only time when octets within a URI are percent-encoded is during the process of producing the URI from its component parts. This is when an implementation determines which of the reserved characters are to be used as subcomponent delimiters and which can be safely used as data. Once produced, a URI is always in its percent-encoded form.

[...]

Because the percent ("%") character serves as the indicator for percent-encoded octets, it must be percent-encoded as "%25" for that octet to be used as data within a URI. Implementations must not percent-encode or decode the same string more than once, as decoding an already decoded string might lead to misinterpreting a percent data octet as the beginning of a percent-encoding, or vice versa in the case of percent-encoding an already percent-encoded string."

This means that your suggestion of always decoding incoming URLs is not in compliance with RFC standards. Instead, the percent character ("%") should be used as an indicator to determine whether decoding needs to be performed.

It's also worth considering related discussions in issue #91 for additional context and potential solutions to this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants