-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom sphinx extension to generate docs from args spec #34
Conversation
Hey @davemfish, this isn't quite as ready as I hoped (still need some docstrings and more work on the README) but feel free to start looking at it if you want. If you check out my branch and run |
Putting this extension into use in #42, I realized it was just too tricky to intermix the recursively-generated documentation with other text. All the models I've looked at include some contextual info in the Data Needs section. That means that no model could be completely auto-documented (would still need to use the So I just took out the recursion. Please check out #42 for what this looks like in practice. Every arg and nested arg (column, field, etc). gets its own Pros:
Cons:
I believe this is the best option for now, but in the future it could be interesting to try out different ways of associating extra contextual information with args, given #40 and #41. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at a couple of the rendered chapters from #42 and made some comments here about formatting. Thanks for looking!
units = spec['units'] | ||
elif spec['type'] == 'raster' and spec['bands'][1]['type'] == 'number': | ||
units = spec['bands'][1]['units'] | ||
if units: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would vote to omit "units: none" from the documentation altogether when units are u.none
. I think in those cases it's usually obvious that the number is unitless so I doubt people will wonder about units. The benefit is less text for the reader to parse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I was concerned leaving it out that people would ask what the units are, but in most cases it is probably obvious.
extensions/investspec/investspec.py
Outdated
if spec['type'] == 'vector': | ||
in_parentheses.append( | ||
f'accepted geometries: ' | ||
f'{format_geometries_string(spec["geometries"])}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again in the interest of reducing the amount of text in the opening parenthetical, what about removing the 'accepted geometries' text and just printing the types? I think it's self-explanatory that having a geometry type listed means that input should be one of those types.
We could even move it over next to the vector
label so the whole thing looks like:
(vector: point, required)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I moved it next to the vector
label and separated multiple types with slashes:
(vector: point/multipoint, required)
extensions/investspec/investspec.py
Outdated
if spec['type'] != 'boolean': | ||
# get() returns None if the key doesn't exist in the dictionary | ||
required_string = format_required_string(spec.get('required')) | ||
in_parentheses.append(required_string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make this metadata easier to skim & parse, what about using italics for one of the elements? required / optional seems like a good candidate.
(number, required, units: m):
(number, required, units: m):
It might also make sense to keep "units" close to "number" (or raster) as they are directly related. Maybe required could always go first/last?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe required could always go first/last?
I like the approach of putting it last myself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be just used to how we have represented required strings in the UI historically, but I'd be inclined to have required be either first or last. On the one hand, whether something is required seems like a super important characteristic and maybe should be first. On the other hand, I could also see other information (like the datatype of the input) being more important and required being last.
In any case, I too think it'd be helpful to group the datatype and any related units together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good to me. I modified it so that the units go right after number
or raster
, the units are bolded, and then required is always last:
(number, units: m, required)
vector fields and geometries, option_string options, etc. | ||
""" | ||
type_string = format_type_string(spec['type']) | ||
in_parentheses = [type_string] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how this renders as hypertext - aside from getting the link, it also adds a bit of visual hierarchy in a sea of text.
extensions/investspec/README.md
Outdated
## What is documented | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this to be filled in later or is the What is not documented
section enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just removed this header
extensions/investspec/investspec.py
Outdated
def capitalize(name): | ||
|
||
def capitalize_word(word): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing docstrings?
Units | ||
~~~~~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this and other sub category headers I had a hard time realizing they were sub components. I think because they are capitalized and just as big as the parent header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea unfortunately RST doesn't let you skip header levels. The way they appear is determined by our custom theme. I'd love to see a new theme for the UG to address this and generally look more modern, but that's probably pretty low priority. Even the default theme looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, to piggyback on the subcomponent issue, I'm pretty sure RST detects whether a title is of a higher or lower priority based on the character used to underline and/or overline the title. The catch is that RST detects the hierarchy of titles based on whatever characters are used. So in this case, underline with ~~~~~
is a third-level title, but that's because it's preceded by ***/***
and ---
. In this case, I kind of prefer markdown's explicit title hierarchy of however many #
s there are, that's what <H>
level it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, having the option to use any characters for the title hierarchy just makes it harder to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey this looks great Emily! I'm marking 'request changes' because I think there might be a couple things in Makefile/conf.py, and I had a couple other technical details in input_types.rst
that I'd love to hear your thoughts on. Happy to talk through any of the above!
extensions/investspec/investspec.py
Outdated
def capitalize_word(word): | ||
if word in {'of', 'the'}: | ||
return word | ||
else: | ||
return word[0].upper() + word[1:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just learned of the string .capitalize()
and .title()
methods which might be useful here (certainly .capitalize
could take the place of line 137 if you wanted to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately .capitalize()
and .title()
also lowercase all the other letters:
>>> 'REDD land cover map'.capitalize()
'Redd land cover map'
>>> 'REDD land cover map'.title()
'Redd Land Cover Map'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The REDD
case is an interesting one because it's a bit more subtle ... I hadn't considered an acronym at all, and you're right! Could you add a comment to this to clarify why what you have here is the correct solution?
extensions/investspec/investspec.py
Outdated
if spec['type'] != 'boolean': | ||
# get() returns None if the key doesn't exist in the dictionary | ||
required_string = format_required_string(spec.get('required')) | ||
in_parentheses.append(required_string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be just used to how we have represented required strings in the UI historically, but I'd be inclined to have required be either first or last. On the one hand, whether something is required seems like a super important characteristic and maybe should be first. On the other hand, I could also see other information (like the datatype of the input) being more important and required being last.
In any case, I too think it'd be helpful to group the datatype and any related units together.
source/conf.py
Outdated
if not os.path.exists('../invest-sample-data'): | ||
print('make get sampledata') | ||
subprocess.run(['make', '-C', '..', 'get_sampledata']) | ||
print('done') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that the make get_sampledata
command goes through the full LFS fetching/filtering operation, which can be pretty time consuming. I had thought that we had moved the tables out of LFS, which, if so, would mean that we wouldn't need to use the LFS part of this at all ... am I forgetting a decision we made about tables in the sample data, or do we use some binary files as well? Sorry for my flaky memory!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emlys it looks like LFS cloning can be skipped via GIT_LFS_SKIP_SMUDGE=1 git clone https://bitbucket.org/natcap/invest-sample-data.git
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I bet you can trim down the clone even more by setting a checkout depth of 1 (or maybe it's 0?). In any case, only cloning the commit needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could not find a way to clone only the commit needed but with fetch
it works:
git init
git remote add origin https://bitbucket.org/natcap/invest-sample-data.git
git fetch --depth 1 origin <commit>
GIT_LFS_SKIP_SMUDGE=1 git checkout <commit>
source/conf.py
Outdated
print('make get sampledata') | ||
subprocess.run(['make', '-C', '..', 'get_sampledata']) | ||
print('done') | ||
if not os.path.exists('invest-sample-data/pollination/landcover_biophysical_table_modified.csv'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the make prep_sampledata
target creates 4 tables (and will overwrite any existing tables), might we just always run make prep_sampledata
? It seems like even if the tables don't need to be overwritten, overwriting them anyways should be cheap and shouldn't affect anything else. Doing so might also help us avoid some unnecessary build errors because we accidentally messed up some files and/or their locations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! I was trying to avoid running it twice, but you're right it is not a time consuming step.
Makefile
Outdated
@echo " get_sampledata to check out the invest-sample-data repo" | ||
@echo " prep_sampledata to create modified tables in invest-sample-data that display nicely" | ||
@echo " test_investspec to run unit tests for the custom Sphinx extension" | ||
@echo " demo_investspec to run a demo using the custom Sphinx extension" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add these 4 build targets to the .PHONY
list?
Any build targets in .PHONY
will always be executed when invoked, regardless of the state of the filesystem, so this is where we'd indicate that the make recipe is something that is more of a convenient shorthand rather than a true file-based recipe.
If, say, get_sampledata
is not in .PHONY
, then this is what would happen:
$ touch get_sampledata
$ make get_sampledata
make: `get_sampledata' is up to date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 always more to learn about makefiles!
Units | ||
~~~~~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, to piggyback on the subcomponent issue, I'm pretty sure RST detects whether a title is of a higher or lower priority based on the character used to underline and/or overline the title. The catch is that RST detects the hierarchy of titles based on whatever characters are used. So in this case, underline with ~~~~~
is a third-level title, but that's because it's preceded by ***/***
and ---
. In this case, I kind of prefer markdown's explicit title hierarchy of however many #
s there are, that's what <H>
level it is.
ratio | ||
----- | ||
A unitless proportion in the range 0 - 1, where 0 represents "none" and 1 represents "all". | ||
Some ratio inputs may be less than 0 or greater than 1, while others are strictly limited to the 0-1 range. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any cases where a negative ratio is allowed? And do we have any cases where a ratio greater than 1 is reasonable? Or would these necessarily be represented by a different input type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Annual rate of price change in the carbon model can be negative or greater than 1. The use case for a negative value is discussed in the user's guide, and while more than 100% annual price increase is unlikely, it's still valid.
source/input_types.rst
Outdated
|
||
text | ||
---- | ||
Freeform text. InVEST accepts any Unicode character. For best results, use Unicode character sets for non-Latin alphabets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we specify UTF-8? "Unicode" is actually implemented by a number of different encodings including but not limited to UTF-8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can say both? I feel like "unicode" is more recognizable to a non technical audience
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I'd be just fine with both. UTF-8 is probably the best-recognized (and probably synonymous to most people), but since GDAL only takes UTF-8 (and ASCII, a subset of UTF-8), clarifying that "when we say unicode, we really mean UTF-8" could be a prudent clarification.
The **uint8** type is sufficient for most discrete data that InVEST uses (land use/land cover classes, soil groups, and so on) which have fewer than 256 possible values. | ||
|
||
Here are all the standard raster data types and their ranges (ranges include the starting and ending values): | ||
|
||
- **byte** (**uint8**): any integer from 0 to 255 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arc users in particular are a source of signed byte rasters, with values from -128 to 127, and which GDAL does not always handle in a way that is expected since GDAL does not have a formal definition of an int8 type.
I think my recommendation here is to advise against using signed integer rasters in favor of unsigned byte rasters, but maybe we've fixed all those signed byte issues and my memory is out of date. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love to know the history of why GDAL uses that PIXELTYPE=SIGNEDBYTE
attribute instead of just having an int8 type! Seems so counterintuitive. I know new_raster_from_base
does handle signed bytes, and that still tripped me up in the Stormwater model. We do not normally use signed byte rasters in our tests so I'm sure there's some problems lurking out there.
I'll add a note advising against signed byte rasters. If we ever get around to natcap/invest#536, testing everything with signed byte rasters would be a good use for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love to know the history of why GDAL uses that PIXELTYPE=SIGNEDBYTE attribute instead of just having an int8 type!
Likewise! The GDAL commit history on github is unfortunately not very helpful, so my best guess is that it was added somewhere in the CVS history, before GDAL development moved to SVN and then eventually git.
I'll add a note advising against signed byte rasters. If we ever get around to natcap/invest#536, testing everything with signed byte rasters would be a good use for it.
That sounds good to me! And yes, I agree about testing against signed byte rasters if/when we get to that extra testing would be a really good use for it.
source/input_types.rst
Outdated
|
||
2. Type (**float** or **int**) | ||
|
||
Floating-point (float) types can store digits after the decimal point. There is no hard limit on how many decimal places they can store, but they are only accurate to a limited number of decimal places. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not completely sure we need to get fully into this in this document, but there is also a limit to the accuracy of the integers the floating-point spec can represent, not just decimal precision. That's the tradeoff with IEEE754 ... it can approximate a massive range of numbers, but we lose precision the farther away from 0 we get.
I think the important tidbit about integers is that if you need integer precision > 2^24, folks should use uint32.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that's wacky!
>>> numpy.array([16777217], dtype=numpy.int32)
array([16777217], dtype=int32)
>>> numpy.array([16777217], dtype=numpy.float32)
array([16777216.], dtype=float32)
So maybe the precision of 6 or 7 digits is really total digits, before or after the decimal point?
2^0+2^−23 = 1.0000001...
2^4+2^−19 = 16.000001...
2^7+2^-16 = 128.00001...
2^10+2^-13 = 1024.0001...
2^14+2^-9 = 16384.001...
2^17+2^-6 = 131072.01...
2^20+2^-3 = 1048576.1...
2^24+2^1 = 1677721x.
2^27+2^4 = 1342177xx.
2^30+2^7 = 1073741xxx.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a note to clarify this a little. I don't think we need to go into much detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! The design of single-precision floating-point (float32) numbers is to have about 7 significant decimal digits. Double-precision floating-point (float64) numbers have about 16 significant decimal digits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emlys I just found one trivial thing to change in the Makefile help
. Other than that I approve of these changes!
Makefile
Outdated
@echo " html to make standalone HTML files" | ||
@echo " changes to make an overview of all changed/added/deprecated items" | ||
@echo " linkcheck to check all external links for integrity" | ||
@echo " invest-sample-data to check out the invest-sample-data repo" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is no longer a target
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, as $(GIT_SAMPLE_DATA_REPO_PATH)
, but maybe filepath targets don't belong in the help section, not sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could clarify this by using @echo " $(GIT_SAMPLE_DATA_REPO_PATH) ...
in make help
... I think that would render correctly!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification. James' idea seems like a good one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! It does kind of throw off the spacing in the Makefile, since you have to adjust for the length difference between $(GIT_SAMPLE_DATA_REPO_PATH)
and invest-sample-data
, but that's not a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @emlys , I don't think I have anything else that Dave or James hasn't touched on. So I'll add my approval!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me! I'll add my approval here, but I'll leave it unmerged in case you'd like to make the one trivial change I suggested in Makefile
. If not, merge away!
Makefile
Outdated
@echo " html to make standalone HTML files" | ||
@echo " changes to make an overview of all changed/added/deprecated items" | ||
@echo " linkcheck to check all external links for integrity" | ||
@echo " invest-sample-data to check out the invest-sample-data repo" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could clarify this by using @echo " $(GIT_SAMPLE_DATA_REPO_PATH) ...
in make help
... I think that would render correctly!
mkdir $(GIT_SAMPLE_DATA_REPO_PATH) && cd $(GIT_SAMPLE_DATA_REPO_PATH) | ||
git -C $(GIT_SAMPLE_DATA_REPO_PATH) init | ||
git -C $(GIT_SAMPLE_DATA_REPO_PATH) remote add origin $(GIT_SAMPLE_DATA_REPO) | ||
git -C $(GIT_SAMPLE_DATA_REPO_PATH) fetch --depth 1 origin $(GIT_SAMPLE_DATA_REPO_REV) | ||
# GIT_LFS_SKIP_SMUDGE=1 prevents getting all the lfs files, we only need the CSVs | ||
GIT_LFS_SKIP_SMUDGE=1 git -C $(GIT_SAMPLE_DATA_REPO_PATH) checkout $(GIT_SAMPLE_DATA_REPO_REV) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
# this is for the ReadTheDocs build, where conf.py is the only place we can | ||
# run arbitrary commands such as checking out the sample data | ||
subprocess.run(['make', '-C', '..', 'prep_sampledata']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's helpful context in the comment there.
I made that change to the Makefile and tests are passing so I'm gonna merge it! |
Fixes #30
This PR introduces a custom Sphinx extension called
investspec
that generates documentation from a model'sARGS_SPEC
. The new fileextensions/investspec/README.md
has a lot of details about it.Here are the sphinx docs for an overview of extensions. This code is heavily modified from their example custom extensions and this blog post.
The generated docs link to a description of each input type in the new file
input_types.rst
. I started writing some guidance in theraster
section about nodata types etc. Going with our strategic priority of "expand the user base of InVEST to include users outside of research", I think that kind of info will have a place in our documentation. Writing that more fully, and a similar thing for vectors, could happen in a separate PR.