Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom sphinx extension to generate docs from args spec #34

Merged
merged 58 commits into from
Sep 16, 2021

Conversation

emlys
Copy link
Member

@emlys emlys commented Jun 16, 2021

Fixes #30
This PR introduces a custom Sphinx extension called investspec that generates documentation from a model's ARGS_SPEC. The new file extensions/investspec/README.md has a lot of details about it.

Here are the sphinx docs for an overview of extensions. This code is heavily modified from their example custom extensions and this blog post.

The generated docs link to a description of each input type in the new file input_types.rst. I started writing some guidance in the raster section about nodata types etc. Going with our strategic priority of "expand the user base of InVEST to include users outside of research", I think that kind of info will have a place in our documentation. Writing that more fully, and a similar thing for vectors, could happen in a separate PR.

@emlys emlys changed the title Example/generate docs Custom sphinx extension to generate docs from args spec Jun 29, 2021
@emlys emlys self-assigned this Jun 29, 2021
@emlys
Copy link
Member Author

emlys commented Jul 1, 2021

Hey @davemfish, this isn't quite as ready as I hoped (still need some docstrings and more work on the README) but feel free to start looking at it if you want. If you check out my branch and run make test, in addition to running the unit tests, it will generate a demo HTML file from extensions/investspec/test/index.rst that illustrates what's going on.

@emlys
Copy link
Member Author

emlys commented Aug 24, 2021

Putting this extension into use in #42, I realized it was just too tricky to intermix the recursively-generated documentation with other text. All the models I've looked at include some contextual info in the Data Needs section. That means that no model could be completely auto-documented (would still need to use the :investspec: role for each arg) so that text could be inserted between args. And in a few cases, this applies to nested args as well (we want to add some text after a description of a particular column in a CSV).

So I just took out the recursion. Please check out #42 for what this looks like in practice. Every arg and nested arg (column, field, etc). gets its own :investspec:, and formatting them into lists is not automated.

Pros:

  • simpler code
  • simpler usage of the :investspec: role
  • easy to intermix generated docs with plain text

Cons:

  • not completely automated (still need to update UG if we add/remove an arg)
  • still need to know something about the CSV structure, so that we can include a "Columns:" or "Rows:" heading above the list of columns/rows (this is no longer auto-generated).

I believe this is the best option for now, but in the future it could be interesting to try out different ways of associating extra contextual information with args, given #40 and #41.

Copy link
Contributor

@davemfish davemfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at a couple of the rendered chapters from #42 and made some comments here about formatting. Thanks for looking!

units = spec['units']
elif spec['type'] == 'raster' and spec['bands'][1]['type'] == 'number':
units = spec['bands'][1]['units']
if units:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would vote to omit "units: none" from the documentation altogether when units are u.none. I think in those cases it's usually obvious that the number is unitless so I doubt people will wonder about units. The benefit is less text for the reader to parse.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I was concerned leaving it out that people would ask what the units are, but in most cases it is probably obvious.

if spec['type'] == 'vector':
in_parentheses.append(
f'accepted geometries: '
f'{format_geometries_string(spec["geometries"])}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again in the interest of reducing the amount of text in the opening parenthetical, what about removing the 'accepted geometries' text and just printing the types? I think it's self-explanatory that having a geometry type listed means that input should be one of those types.

We could even move it over next to the vector label so the whole thing looks like:

(vector: point, required)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I moved it next to the vector label and separated multiple types with slashes:

(vector: point/multipoint, required)

if spec['type'] != 'boolean':
# get() returns None if the key doesn't exist in the dictionary
required_string = format_required_string(spec.get('required'))
in_parentheses.append(required_string)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make this metadata easier to skim & parse, what about using italics for one of the elements? required / optional seems like a good candidate.

(number, required, units: m):
(number, required, units: m):

It might also make sense to keep "units" close to "number" (or raster) as they are directly related. Maybe required could always go first/last?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe required could always go first/last?

I like the approach of putting it last myself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be just used to how we have represented required strings in the UI historically, but I'd be inclined to have required be either first or last. On the one hand, whether something is required seems like a super important characteristic and maybe should be first. On the other hand, I could also see other information (like the datatype of the input) being more important and required being last.

In any case, I too think it'd be helpful to group the datatype and any related units together.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me. I modified it so that the units go right after number or raster, the units are bolded, and then required is always last:

(number, units: m, required)

vector fields and geometries, option_string options, etc.
"""
type_string = format_type_string(spec['type'])
in_parentheses = [type_string]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how this renders as hypertext - aside from getting the link, it also adds a bit of visual hierarchy in a sea of text.

Comment on lines 65 to 66
## What is documented

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this to be filled in later or is the What is not documented section enough?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just removed this header

Comment on lines 131 to 133
def capitalize(name):

def capitalize_word(word):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing docstrings?

Comment on lines +13 to +14
Units
~~~~~
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this and other sub category headers I had a hard time realizing they were sub components. I think because they are capitalized and just as big as the parent header.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea unfortunately RST doesn't let you skip header levels. The way they appear is determined by our custom theme. I'd love to see a new theme for the UG to address this and generally look more modern, but that's probably pretty low priority. Even the default theme looks good to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, to piggyback on the subcomponent issue, I'm pretty sure RST detects whether a title is of a higher or lower priority based on the character used to underline and/or overline the title. The catch is that RST detects the hierarchy of titles based on whatever characters are used. So in this case, underline with ~~~~~ is a third-level title, but that's because it's preceded by ***/*** and ---. In this case, I kind of prefer markdown's explicit title hierarchy of however many #s there are, that's what <H> level it is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, having the option to use any characters for the title hierarchy just makes it harder to read.

Copy link
Member

@dcdenu4 dcdenu4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @emlys , I think everything looks great from a pass over level. I think some of my comments will be duplicated between this and #42 . Sorry about that.

Copy link
Member

@phargogh phargogh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey this looks great Emily! I'm marking 'request changes' because I think there might be a couple things in Makefile/conf.py, and I had a couple other technical details in input_types.rst that I'd love to hear your thoughts on. Happy to talk through any of the above!

Comment on lines 133 to 137
def capitalize_word(word):
if word in {'of', 'the'}:
return word
else:
return word[0].upper() + word[1:]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just learned of the string .capitalize() and .title() methods which might be useful here (certainly .capitalize could take the place of line 137 if you wanted to.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately .capitalize() and .title() also lowercase all the other letters:

>>> 'REDD land cover map'.capitalize()
'Redd land cover map'
>>> 'REDD land cover map'.title()
'Redd Land Cover Map'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The REDD case is an interesting one because it's a bit more subtle ... I hadn't considered an acronym at all, and you're right! Could you add a comment to this to clarify why what you have here is the correct solution?

if spec['type'] != 'boolean':
# get() returns None if the key doesn't exist in the dictionary
required_string = format_required_string(spec.get('required'))
in_parentheses.append(required_string)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be just used to how we have represented required strings in the UI historically, but I'd be inclined to have required be either first or last. On the one hand, whether something is required seems like a super important characteristic and maybe should be first. On the other hand, I could also see other information (like the datatype of the input) being more important and required being last.

In any case, I too think it'd be helpful to group the datatype and any related units together.

source/conf.py Outdated
Comment on lines 11 to 14
if not os.path.exists('../invest-sample-data'):
print('make get sampledata')
subprocess.run(['make', '-C', '..', 'get_sampledata'])
print('done')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that the make get_sampledata command goes through the full LFS fetching/filtering operation, which can be pretty time consuming. I had thought that we had moved the tables out of LFS, which, if so, would mean that we wouldn't need to use the LFS part of this at all ... am I forgetting a decision we made about tables in the sample data, or do we use some binary files as well? Sorry for my flaky memory!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emlys it looks like LFS cloning can be skipped via GIT_LFS_SKIP_SMUDGE=1 git clone https://bitbucket.org/natcap/invest-sample-data.git

Copy link
Member

@phargogh phargogh Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bet you can trim down the clone even more by setting a checkout depth of 1 (or maybe it's 0?). In any case, only cloning the commit needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not find a way to clone only the commit needed but with fetch it works:

git init
git remote add origin https://bitbucket.org/natcap/invest-sample-data.git
git fetch --depth 1 origin <commit>
GIT_LFS_SKIP_SMUDGE=1 git checkout <commit>

source/conf.py Outdated
print('make get sampledata')
subprocess.run(['make', '-C', '..', 'get_sampledata'])
print('done')
if not os.path.exists('invest-sample-data/pollination/landcover_biophysical_table_modified.csv'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the make prep_sampledata target creates 4 tables (and will overwrite any existing tables), might we just always run make prep_sampledata? It seems like even if the tables don't need to be overwritten, overwriting them anyways should be cheap and shouldn't affect anything else. Doing so might also help us avoid some unnecessary build errors because we accidentally messed up some files and/or their locations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I was trying to avoid running it twice, but you're right it is not a time consuming step.

Makefile Outdated
Comment on lines 23 to 26
@echo " get_sampledata to check out the invest-sample-data repo"
@echo " prep_sampledata to create modified tables in invest-sample-data that display nicely"
@echo " test_investspec to run unit tests for the custom Sphinx extension"
@echo " demo_investspec to run a demo using the custom Sphinx extension"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add these 4 build targets to the .PHONY list?

Any build targets in .PHONY will always be executed when invoked, regardless of the state of the filesystem, so this is where we'd indicate that the make recipe is something that is more of a convenient shorthand rather than a true file-based recipe.

If, say, get_sampledata is not in .PHONY, then this is what would happen:

$ touch get_sampledata
$ make get_sampledata
make: `get_sampledata' is up to date.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 always more to learn about makefiles!

Comment on lines +13 to +14
Units
~~~~~
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, to piggyback on the subcomponent issue, I'm pretty sure RST detects whether a title is of a higher or lower priority based on the character used to underline and/or overline the title. The catch is that RST detects the hierarchy of titles based on whatever characters are used. So in this case, underline with ~~~~~ is a third-level title, but that's because it's preceded by ***/*** and ---. In this case, I kind of prefer markdown's explicit title hierarchy of however many #s there are, that's what <H> level it is.

Comment on lines +24 to +27
ratio
-----
A unitless proportion in the range 0 - 1, where 0 represents "none" and 1 represents "all".
Some ratio inputs may be less than 0 or greater than 1, while others are strictly limited to the 0-1 range.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any cases where a negative ratio is allowed? And do we have any cases where a ratio greater than 1 is reasonable? Or would these necessarily be represented by a different input type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Annual rate of price change in the carbon model can be negative or greater than 1. The use case for a negative value is discussed in the user's guide, and while more than 100% annual price increase is unlikely, it's still valid.


text
----
Freeform text. InVEST accepts any Unicode character. For best results, use Unicode character sets for non-Latin alphabets.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we specify UTF-8? "Unicode" is actually implemented by a number of different encodings including but not limited to UTF-8.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can say both? I feel like "unicode" is more recognizable to a non technical audience

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I'd be just fine with both. UTF-8 is probably the best-recognized (and probably synonymous to most people), but since GDAL only takes UTF-8 (and ASCII, a subset of UTF-8), clarifying that "when we say unicode, we really mean UTF-8" could be a prudent clarification.

Comment on lines +113 to +117
The **uint8** type is sufficient for most discrete data that InVEST uses (land use/land cover classes, soil groups, and so on) which have fewer than 256 possible values.

Here are all the standard raster data types and their ranges (ranges include the starting and ending values):

- **byte** (**uint8**): any integer from 0 to 255
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arc users in particular are a source of signed byte rasters, with values from -128 to 127, and which GDAL does not always handle in a way that is expected since GDAL does not have a formal definition of an int8 type.

I think my recommendation here is to advise against using signed integer rasters in favor of unsigned byte rasters, but maybe we've fixed all those signed byte issues and my memory is out of date. Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to know the history of why GDAL uses that PIXELTYPE=SIGNEDBYTE attribute instead of just having an int8 type! Seems so counterintuitive. I know new_raster_from_base does handle signed bytes, and that still tripped me up in the Stormwater model. We do not normally use signed byte rasters in our tests so I'm sure there's some problems lurking out there.
I'll add a note advising against signed byte rasters. If we ever get around to natcap/invest#536, testing everything with signed byte rasters would be a good use for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to know the history of why GDAL uses that PIXELTYPE=SIGNEDBYTE attribute instead of just having an int8 type!

Likewise! The GDAL commit history on github is unfortunately not very helpful, so my best guess is that it was added somewhere in the CVS history, before GDAL development moved to SVN and then eventually git.

I'll add a note advising against signed byte rasters. If we ever get around to natcap/invest#536, testing everything with signed byte rasters would be a good use for it.

That sounds good to me! And yes, I agree about testing against signed byte rasters if/when we get to that extra testing would be a really good use for it.


2. Type (**float** or **int**)

Floating-point (float) types can store digits after the decimal point. There is no hard limit on how many decimal places they can store, but they are only accurate to a limited number of decimal places.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not completely sure we need to get fully into this in this document, but there is also a limit to the accuracy of the integers the floating-point spec can represent, not just decimal precision. That's the tradeoff with IEEE754 ... it can approximate a massive range of numbers, but we lose precision the farther away from 0 we get.

I think the important tidbit about integers is that if you need integer precision > 2^24, folks should use uint32.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's wacky!

>>> numpy.array([16777217], dtype=numpy.int32)
array([16777217], dtype=int32)
>>> numpy.array([16777217], dtype=numpy.float32)
array([16777216.], dtype=float32)

So maybe the precision of 6 or 7 digits is really total digits, before or after the decimal point?

2^0+2^−23 =         1.0000001...
2^4+2^−19 =        16.000001...
2^7+2^-16 =       128.00001...
2^10+2^-13 =     1024.0001...
2^14+2^-9 =     16384.001...
2^17+2^-6 =    131072.01...
2^20+2^-3 =   1048576.1...
2^24+2^1 =   1677721x.
2^27+2^4 =  1342177xx.
2^30+2^7 = 1073741xxx.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a note to clarify this a little. I don't think we need to go into much detail.

Copy link
Member

@phargogh phargogh Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! The design of single-precision floating-point (float32) numbers is to have about 7 significant decimal digits. Double-precision floating-point (float64) numbers have about 16 significant decimal digits.

Copy link
Contributor

@davemfish davemfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emlys I just found one trivial thing to change in the Makefile help. Other than that I approve of these changes!

Makefile Outdated
@echo " html to make standalone HTML files"
@echo " changes to make an overview of all changed/added/deprecated items"
@echo " linkcheck to check all external links for integrity"
@echo " invest-sample-data to check out the invest-sample-data repo"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is no longer a target

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, as $(GIT_SAMPLE_DATA_REPO_PATH), but maybe filepath targets don't belong in the help section, not sure

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could clarify this by using @echo " $(GIT_SAMPLE_DATA_REPO_PATH) ... in make help ... I think that would render correctly!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. James' idea seems like a good one

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! It does kind of throw off the spacing in the Makefile, since you have to adjust for the length difference between $(GIT_SAMPLE_DATA_REPO_PATH) and invest-sample-data, but that's not a problem.

Copy link
Member

@dcdenu4 dcdenu4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @emlys , I don't think I have anything else that Dave or James hasn't touched on. So I'll add my approval!

Copy link
Member

@phargogh phargogh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! I'll add my approval here, but I'll leave it unmerged in case you'd like to make the one trivial change I suggested in Makefile. If not, merge away!

Makefile Outdated
@echo " html to make standalone HTML files"
@echo " changes to make an overview of all changed/added/deprecated items"
@echo " linkcheck to check all external links for integrity"
@echo " invest-sample-data to check out the invest-sample-data repo"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could clarify this by using @echo " $(GIT_SAMPLE_DATA_REPO_PATH) ... in make help ... I think that would render correctly!

Comment on lines +66 to +71
mkdir $(GIT_SAMPLE_DATA_REPO_PATH) && cd $(GIT_SAMPLE_DATA_REPO_PATH)
git -C $(GIT_SAMPLE_DATA_REPO_PATH) init
git -C $(GIT_SAMPLE_DATA_REPO_PATH) remote add origin $(GIT_SAMPLE_DATA_REPO)
git -C $(GIT_SAMPLE_DATA_REPO_PATH) fetch --depth 1 origin $(GIT_SAMPLE_DATA_REPO_REV)
# GIT_LFS_SKIP_SMUDGE=1 prevents getting all the lfs files, we only need the CSVs
GIT_LFS_SKIP_SMUDGE=1 git -C $(GIT_SAMPLE_DATA_REPO_PATH) checkout $(GIT_SAMPLE_DATA_REPO_REV)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Comment on lines +9 to +11
# this is for the ReadTheDocs build, where conf.py is the only place we can
# run arbitrary commands such as checking out the sample data
subprocess.run(['make', '-C', '..', 'prep_sampledata'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's helpful context in the comment there.

@emlys
Copy link
Member Author

emlys commented Sep 16, 2021

I made that change to the Makefile and tests are passing so I'm gonna merge it!

@emlys emlys merged commit fe31914 into natcap:release/3.10 Sep 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Translations: Generate input descriptions from args specs
5 participants