Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

En la sección "nomenclatura de archivos" agregar referencia a especificaciones técnicas #19

Open
fpetrone opened this issue Jun 29, 2017 · 0 comments

Comments

@fpetrone
Copy link
Contributor

Una de las organizaciones incluidas en el PAD había nomenclado sus archivos utilizando espacios, ej: "base_estadistica 2015.csv". Lo cual terminaba reflejandose en un URL de descarga que incluía espacios, inválida para la herramienta pydatajson. En nuestra Guía recomendamos no incluir espacios en la nomenclatura de los archivos, pero en base a este caso concreto se estudiaron los estándares aplicables a las URL.
Por lo anterior, entiendo que deberíamos agregar en la sección " nomenclatura de archivos" (como nota al pie o incluida en el texto) la referencia a las especificaciones técnicas. Pudiendo utilizar la totalidad o parte del texto de @capitantoto :

RFC 1738 (1994):
https://www.ietf.org/rfc/rfc1738.txt
Unsafe:
Characters can be unsafe for a number of reasons. The space character is unsafe because significant spaces may disappear and insignificant spaces may be introduced when URLs are transcribed or
typeset or subjected to the treatment of word-processing programs.
The characters "<" and ">" are unsafe because they are used as the delimiters around URLs in free text; the quote mark (""") is used to delimit URLs in some systems. The character "#" is unsafe and should
always be encoded because it is used in World Wide Web and in other systems to delimit a URL from a fragment/anchor identifier that might follow it. The character "%" is unsafe because it is used for
encodings of other characters. Other characters are unsafe because gateways and other transport agents are known to sometimes modify such characters. These characters are "{", "}", "|", "", "^", "~", "[", "]", and "`". All unsafe characters must always be encoded within a URL. For example, the character "#" must be encoded within URLs even in systems that do not normally deal with fragment or anchor identifiers, so that if the URL is copied into another system that does use them, it will not be necessary to change the URL encoding.

RFC 3986 (2005)
Definitivamente los espacios en blanco no son parte de una URL válida. No figuran en la notación regular del Apéndice A, y se sugiere eliminarlos de todo input de usuarios:
http://www.ietf.org/rfc/rfc3986.txt
In practice, URIs are delimited in a variety of ways, but usually within double-quotes
http://example.com/", angle brackets http://example.com/, or just by using whitespace. These wrappers do not form part of the URI.
[...]
In some cases, extra whitespace (spaces, line-breaks, tabs, etc.) may have to be added to break a long URI across lines. The whitespace should be ignored when the URI is extracted.
[...]
For robustness, software that accepts user-typed URI should attempt to recognize and strip both delimiters and embedded whitespace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant