Custom sphinx extension to generate docs from args spec #34

emlys · 2021-06-16T21:02:08Z

Fixes #30
This PR introduces a custom Sphinx extension called investspec that generates documentation from a model's ARGS_SPEC. The new file extensions/investspec/README.md has a lot of details about it.

Here are the sphinx docs for an overview of extensions. This code is heavily modified from their example custom extensions and this blog post.

The generated docs link to a description of each input type in the new file input_types.rst. I started writing some guidance in the raster section about nodata types etc. Going with our strategic priority of "expand the user base of InVEST to include users outside of research", I think that kind of info will have a place in our documentation. Writing that more fully, and a similar thing for vectors, could happen in a separate PR.

emlys · 2021-07-01T21:22:20Z

Hey @davemfish, this isn't quite as ready as I hoped (still need some docstrings and more work on the README) but feel free to start looking at it if you want. If you check out my branch and run make test, in addition to running the unit tests, it will generate a demo HTML file from extensions/investspec/test/index.rst that illustrates what's going on.

emlys · 2021-08-24T21:12:58Z

Putting this extension into use in #42, I realized it was just too tricky to intermix the recursively-generated documentation with other text. All the models I've looked at include some contextual info in the Data Needs section. That means that no model could be completely auto-documented (would still need to use the :investspec: role for each arg) so that text could be inserted between args. And in a few cases, this applies to nested args as well (we want to add some text after a description of a particular column in a CSV).

So I just took out the recursion. Please check out #42 for what this looks like in practice. Every arg and nested arg (column, field, etc). gets its own :investspec:, and formatting them into lists is not automated.

Pros:

simpler code
simpler usage of the :investspec: role
easy to intermix generated docs with plain text

Cons:

not completely automated (still need to update UG if we add/remove an arg)
still need to know something about the CSV structure, so that we can include a "Columns:" or "Rows:" heading above the list of columns/rows (this is no longer auto-generated).

I believe this is the best option for now, but in the future it could be interesting to try out different ways of associating extra contextual information with args, given #40 and #41.

davemfish

I looked at a couple of the rendered chapters from #42 and made some comments here about formatting. Thanks for looking!

davemfish · 2021-08-26T13:44:28Z

extensions/investspec/investspec.py

+        units = spec['units']
+    elif spec['type'] == 'raster' and spec['bands'][1]['type'] == 'number':
+        units = spec['bands'][1]['units']
+    if units:


I would vote to omit "units: none" from the documentation altogether when units are u.none. I think in those cases it's usually obvious that the number is unitless so I doubt people will wonder about units. The benefit is less text for the reader to parse.

Sure. I was concerned leaving it out that people would ask what the units are, but in most cases it is probably obvious.

davemfish · 2021-08-26T14:05:10Z

extensions/investspec/investspec.py

+    if spec['type'] == 'vector':
+        in_parentheses.append(
+            f'accepted geometries: '
+            f'{format_geometries_string(spec["geometries"])}')


Again in the interest of reducing the amount of text in the opening parenthetical, what about removing the 'accepted geometries' text and just printing the types? I think it's self-explanatory that having a geometry type listed means that input should be one of those types.

We could even move it over next to the vector label so the whole thing looks like:

(vector: point, required)

Sounds good. I moved it next to the vector label and separated multiple types with slashes:

(vector: point/multipoint, required)

davemfish · 2021-08-26T14:11:01Z

extensions/investspec/investspec.py

+    if spec['type'] != 'boolean':
+        # get() returns None if the key doesn't exist in the dictionary
+        required_string = format_required_string(spec.get('required'))
+        in_parentheses.append(required_string)


To make this metadata easier to skim & parse, what about using italics for one of the elements? required / optional seems like a good candidate.

(number, required, units: m):
(number, required, units: m):

It might also make sense to keep "units" close to "number" (or raster) as they are directly related. Maybe required could always go first/last?

Maybe required could always go first/last?

I like the approach of putting it last myself.

I might be just used to how we have represented required strings in the UI historically, but I'd be inclined to have required be either first or last. On the one hand, whether something is required seems like a super important characteristic and maybe should be first. On the other hand, I could also see other information (like the datatype of the input) being more important and required being last.

In any case, I too think it'd be helpful to group the datatype and any related units together.

That sounds good to me. I modified it so that the units go right after number or raster, the units are bolded, and then required is always last:

(number, units: m, required)

davemfish · 2021-08-26T14:12:49Z

extensions/investspec/investspec.py

+        vector fields and geometries, option_string options, etc.
+    """
+    type_string = format_type_string(spec['type'])
+    in_parentheses = [type_string]


I like how this renders as hypertext - aside from getting the link, it also adds a bit of visual hierarchy in a sea of text.

dcdenu4 · 2021-08-26T18:59:19Z

extensions/investspec/README.md

+## What is documented
+


Was this to be filled in later or is the What is not documented section enough?

I just removed this header

dcdenu4 · 2021-08-26T19:00:39Z

extensions/investspec/investspec.py

+def capitalize(name):
+
+    def capitalize_word(word):


Missing docstrings?

dcdenu4 · 2021-08-26T19:06:17Z

source/input_types.rst

+Units
+~~~~~


For this and other sub category headers I had a hard time realizing they were sub components. I think because they are capitalized and just as big as the parent header.

Yea unfortunately RST doesn't let you skip header levels. The way they appear is determined by our custom theme. I'd love to see a new theme for the UG to address this and generally look more modern, but that's probably pretty low priority. Even the default theme looks good to me.

Yeah, to piggyback on the subcomponent issue, I'm pretty sure RST detects whether a title is of a higher or lower priority based on the character used to underline and/or overline the title. The catch is that RST detects the hierarchy of titles based on whatever characters are used. So in this case, underline with ~~~~~ is a third-level title, but that's because it's preceded by ***/*** and ---. In this case, I kind of prefer markdown's explicit title hierarchy of however many #s there are, that's what <H> level it is.

I agree, having the option to use any characters for the title hierarchy just makes it harder to read.

dcdenu4

Hey @emlys , I think everything looks great from a pass over level. I think some of my comments will be duplicated between this and #42 . Sorry about that.

phargogh

Hey this looks great Emily! I'm marking 'request changes' because I think there might be a couple things in Makefile/conf.py, and I had a couple other technical details in input_types.rst that I'd love to hear your thoughts on. Happy to talk through any of the above!

phargogh · 2021-09-01T04:17:00Z

extensions/investspec/investspec.py

+    def capitalize_word(word):
+        if word in {'of', 'the'}:
+            return word
+        else:
+            return word[0].upper() + word[1:]


I just learned of the string .capitalize() and .title() methods which might be useful here (certainly .capitalize could take the place of line 137 if you wanted to.

unfortunately .capitalize() and .title() also lowercase all the other letters:

>>> 'REDD land cover map'.capitalize() 'Redd land cover map' >>> 'REDD land cover map'.title() 'Redd Land Cover Map'

The REDD case is an interesting one because it's a bit more subtle ... I hadn't considered an acronym at all, and you're right! Could you add a comment to this to clarify why what you have here is the correct solution?

phargogh · 2021-09-01T04:20:33Z

extensions/investspec/investspec.py

+    if spec['type'] != 'boolean':
+        # get() returns None if the key doesn't exist in the dictionary
+        required_string = format_required_string(spec.get('required'))
+        in_parentheses.append(required_string)


I might be just used to how we have represented required strings in the UI historically, but I'd be inclined to have required be either first or last. On the one hand, whether something is required seems like a super important characteristic and maybe should be first. On the other hand, I could also see other information (like the datatype of the input) being more important and required being last.

In any case, I too think it'd be helpful to group the datatype and any related units together.

phargogh · 2021-09-01T04:55:01Z

source/conf.py

+if not os.path.exists('../invest-sample-data'):
+    print('make get sampledata')
+    subprocess.run(['make', '-C', '..', 'get_sampledata'])
+    print('done')


I see that the make get_sampledata command goes through the full LFS fetching/filtering operation, which can be pretty time consuming. I had thought that we had moved the tables out of LFS, which, if so, would mean that we wouldn't need to use the LFS part of this at all ... am I forgetting a decision we made about tables in the sample data, or do we use some binary files as well? Sorry for my flaky memory!

@emlys it looks like LFS cloning can be skipped via GIT_LFS_SKIP_SMUDGE=1 git clone https://bitbucket.org/natcap/invest-sample-data.git

I bet you can trim down the clone even more by setting a checkout depth of 1 (or maybe it's 0?). In any case, only cloning the commit needed.

I could not find a way to clone only the commit needed but with fetch it works:

git init git remote add origin https://bitbucket.org/natcap/invest-sample-data.git git fetch --depth 1 origin <commit> GIT_LFS_SKIP_SMUDGE=1 git checkout <commit>

phargogh · 2021-09-01T04:58:41Z

source/conf.py

+    print('make get sampledata')
+    subprocess.run(['make', '-C', '..', 'get_sampledata'])
+    print('done')
+if not os.path.exists('invest-sample-data/pollination/landcover_biophysical_table_modified.csv'):


Because the make prep_sampledata target creates 4 tables (and will overwrite any existing tables), might we just always run make prep_sampledata? It seems like even if the tables don't need to be overwritten, overwriting them anyways should be cheap and shouldn't affect anything else. Doing so might also help us avoid some unnecessary build errors because we accidentally messed up some files and/or their locations.

Sure! I was trying to avoid running it twice, but you're right it is not a time consuming step.

phargogh · 2021-09-01T05:03:02Z

Makefile

 	@echo "  get_sampledata  to check out the invest-sample-data repo"
 	@echo "  prep_sampledata to create modified tables in invest-sample-data that display nicely"
+	@echo "  test_investspec to run unit tests for the custom Sphinx extension"
+	@echo "  demo_investspec to run a demo using the custom Sphinx extension"


Could you also add these 4 build targets to the .PHONY list?

Any build targets in .PHONY will always be executed when invoked, regardless of the state of the filesystem, so this is where we'd indicate that the make recipe is something that is more of a convenient shorthand rather than a true file-based recipe.

If, say, get_sampledata is not in .PHONY, then this is what would happen:

$ touch get_sampledata $ make get_sampledata make: `get_sampledata' is up to date.

👍 always more to learn about makefiles!

phargogh · 2021-09-01T17:03:40Z

source/input_types.rst

+Units
+~~~~~


Yeah, to piggyback on the subcomponent issue, I'm pretty sure RST detects whether a title is of a higher or lower priority based on the character used to underline and/or overline the title. The catch is that RST detects the hierarchy of titles based on whatever characters are used. So in this case, underline with ~~~~~ is a third-level title, but that's because it's preceded by ***/*** and ---. In this case, I kind of prefer markdown's explicit title hierarchy of however many #s there are, that's what <H> level it is.

phargogh · 2021-09-02T02:53:11Z

source/input_types.rst

+ratio
+-----
+A unitless proportion in the range 0 - 1, where 0 represents "none" and 1 represents "all".
+Some ratio inputs may be less than 0 or greater than 1, while others are strictly limited to the 0-1 range.


Are there any cases where a negative ratio is allowed? And do we have any cases where a ratio greater than 1 is reasonable? Or would these necessarily be represented by a different input type?

Annual rate of price change in the carbon model can be negative or greater than 1. The use case for a negative value is discussed in the user's guide, and while more than 100% annual price increase is unlikely, it's still valid.

phargogh · 2021-09-02T02:57:42Z

source/input_types.rst

+
+text
+----
+Freeform text. InVEST accepts any Unicode character. For best results, use Unicode character sets for non-Latin alphabets.


Could we specify UTF-8? "Unicode" is actually implemented by a number of different encodings including but not limited to UTF-8.

Maybe we can say both? I feel like "unicode" is more recognizable to a non technical audience

Personally, I'd be just fine with both. UTF-8 is probably the best-recognized (and probably synonymous to most people), but since GDAL only takes UTF-8 (and ASCII, a subset of UTF-8), clarifying that "when we say unicode, we really mean UTF-8" could be a prudent clarification.

phargogh · 2021-09-02T03:07:56Z

source/input_types.rst

+   The **uint8** type is sufficient for most discrete data that InVEST uses (land use/land cover classes, soil groups, and so on) which have fewer than 256 possible values.
+
+Here are all the standard raster data types and their ranges (ranges include the starting and ending values):
+
+- **byte** (**uint8**): any integer from 0 to 255


Arc users in particular are a source of signed byte rasters, with values from -128 to 127, and which GDAL does not always handle in a way that is expected since GDAL does not have a formal definition of an int8 type.

I think my recommendation here is to advise against using signed integer rasters in favor of unsigned byte rasters, but maybe we've fixed all those signed byte issues and my memory is out of date. Thoughts?

I would love to know the history of why GDAL uses that PIXELTYPE=SIGNEDBYTE attribute instead of just having an int8 type! Seems so counterintuitive. I know new_raster_from_base does handle signed bytes, and that still tripped me up in the Stormwater model. We do not normally use signed byte rasters in our tests so I'm sure there's some problems lurking out there.
I'll add a note advising against signed byte rasters. If we ever get around to natcap/invest#536, testing everything with signed byte rasters would be a good use for it.

I would love to know the history of why GDAL uses that PIXELTYPE=SIGNEDBYTE attribute instead of just having an int8 type!

Likewise! The GDAL commit history on github is unfortunately not very helpful, so my best guess is that it was added somewhere in the CVS history, before GDAL development moved to SVN and then eventually git.

I'll add a note advising against signed byte rasters. If we ever get around to natcap/invest#536, testing everything with signed byte rasters would be a good use for it.

That sounds good to me! And yes, I agree about testing against signed byte rasters if/when we get to that extra testing would be a really good use for it.

phargogh · 2021-09-02T03:16:29Z

source/input_types.rst

+
+2. Type (**float** or **int**)
+
+   Floating-point (float) types can store digits after the decimal point. There is no hard limit on how many decimal places they can store, but they are only accurate to a limited number of decimal places.


I'm not completely sure we need to get fully into this in this document, but there is also a limit to the accuracy of the integers the floating-point spec can represent, not just decimal precision. That's the tradeoff with IEEE754 ... it can approximate a massive range of numbers, but we lose precision the farther away from 0 we get.

I think the important tidbit about integers is that if you need integer precision > 2^24, folks should use uint32.

Oh that's wacky!

>>> numpy.array([16777217], dtype=numpy.int32) array([16777217], dtype=int32) >>> numpy.array([16777217], dtype=numpy.float32) array([16777216.], dtype=float32)

So maybe the precision of 6 or 7 digits is really total digits, before or after the decimal point?

2^0+2^−23 = 1.0000001... 2^4+2^−19 = 16.000001... 2^7+2^-16 = 128.00001... 2^10+2^-13 = 1024.0001... 2^14+2^-9 = 16384.001... 2^17+2^-6 = 131072.01... 2^20+2^-3 = 1048576.1... 2^24+2^1 = 1677721x. 2^27+2^4 = 1342177xx. 2^30+2^7 = 1073741xxx.

I added a note to clarify this a little. I don't think we need to go into much detail.

Yep! The design of single-precision floating-point (float32) numbers is to have about 7 significant decimal digits. Double-precision floating-point (float64) numbers have about 16 significant decimal digits.

davemfish

@emlys I just found one trivial thing to change in the Makefile help. Other than that I approve of these changes!

davemfish · 2021-09-07T14:01:37Z

Makefile

+	@echo "  html                to make standalone HTML files"
+	@echo "  changes             to make an overview of all changed/added/deprecated items"
+	@echo "  linkcheck           to check all external links for integrity"
+	@echo "  invest-sample-data  to check out the invest-sample-data repo"


It looks like this is no longer a target

It is, as $(GIT_SAMPLE_DATA_REPO_PATH), but maybe filepath targets don't belong in the help section, not sure

You could clarify this by using @echo " $(GIT_SAMPLE_DATA_REPO_PATH) ... in make help ... I think that would render correctly!

Thanks for the clarification. James' idea seems like a good one

Good idea! It does kind of throw off the spacing in the Makefile, since you have to adjust for the length difference between $(GIT_SAMPLE_DATA_REPO_PATH) and invest-sample-data, but that's not a problem.

dcdenu4

Hey @emlys , I don't think I have anything else that Dave or James hasn't touched on. So I'll add my approval!

phargogh

This looks good to me! I'll add my approval here, but I'll leave it unmerged in case you'd like to make the one trivial change I suggested in Makefile. If not, merge away!

phargogh · 2021-09-15T23:32:40Z

Makefile

+	@echo "  html                to make standalone HTML files"
+	@echo "  changes             to make an overview of all changed/added/deprecated items"
+	@echo "  linkcheck           to check all external links for integrity"
+	@echo "  invest-sample-data  to check out the invest-sample-data repo"


You could clarify this by using @echo " $(GIT_SAMPLE_DATA_REPO_PATH) ... in make help ... I think that would render correctly!

phargogh · 2021-09-15T23:33:03Z

Makefile

+	mkdir $(GIT_SAMPLE_DATA_REPO_PATH) && cd $(GIT_SAMPLE_DATA_REPO_PATH)
+	git -C $(GIT_SAMPLE_DATA_REPO_PATH) init
+	git -C $(GIT_SAMPLE_DATA_REPO_PATH) remote add origin $(GIT_SAMPLE_DATA_REPO)
+	git -C $(GIT_SAMPLE_DATA_REPO_PATH) fetch --depth 1 origin $(GIT_SAMPLE_DATA_REPO_REV)
+	# GIT_LFS_SKIP_SMUDGE=1 prevents getting all the lfs files, we only need the CSVs
+	GIT_LFS_SKIP_SMUDGE=1 git -C $(GIT_SAMPLE_DATA_REPO_PATH) checkout $(GIT_SAMPLE_DATA_REPO_REV)


phargogh · 2021-09-15T23:35:06Z

source/conf.py

+# this is for the ReadTheDocs build, where conf.py is the only place we can
+# run arbitrary commands such as checking out the sample data
+subprocess.run(['make', '-C', '..', 'prep_sampledata'])


Ah, that's helpful context in the comment there.

emlys · 2021-09-16T17:02:59Z

I made that change to the Makefile and tests are passing so I'm gonna merge it!

emlys and others added 15 commits March 29, 2021 15:51

custom invest spec documentation role working

bc34f70

draft of centralized data type info page

82e5534

start inputs page describing common model inputs

a710c19

data sources page

f5b9cb2

details about soil hydrologic groups

063b8c8

add input type descriptions and refs to them

6e52d0b

remove data sources draft

a56dbdb

start on test cases for generated rst

ce28c86

add make test command and call it in GHA job

cd1c1a1

add custom sphinx extension files

fa24138

Merge branch 'main' into example/generate-docs

fa26367

fix a couple broken links

7176ad3

add init.py file and reference to new page

2e38ee6

add pint to requirements.txt

f64623d

test for custom extension working

aece318

emlys changed the title ~~Example/generate docs~~ Custom sphinx extension to generate docs from args spec Jun 29, 2021

emlys self-assigned this Jun 29, 2021

emlys added 7 commits June 29, 2021 16:44

setup.py for test module is working

2619359

improve string formatting of units

48c4ad7

add config value and simplify test setup

440e8f7

refactor investspec extension logic, finish test case

5227866

add natcap.invest to requirements

7383de8

unit tests passing for each type

f67b07f

remove unneededtest rst file

f745f1b

emlys added 5 commits July 1, 2021 17:23

revert requirements

4698162

add pint

e79580a

expand extension readme

2373034

streamline rst formatting newlines and tabs

e416d2a

fix output with multiple sibling nodes

3c51938

update investspec and tests to be non recursive

b878340

emlys requested review from dcdenu4, davemfish and phargogh August 25, 2021 00:24

davemfish reviewed Aug 26, 2021

View reviewed changes

dcdenu4 reviewed Aug 26, 2021

View reviewed changes

move required string to end, remove 'accepted geometries:'

3f946f2

emlys mentioned this pull request Aug 30, 2021

Consolidate text between UG and args spec (part 1 of 3) #42

Merged

Merge branch 'release/3.10' into example/generate-docs

b5a3931

phargogh requested changes Sep 2, 2021

View reviewed changes

emlys added 4 commits September 2, 2021 11:25

add notes about signed bytes and about float precision digits

5470f74

update get_sampledata to not use LFS

507d666

fix make html to check out and prepare the sample data

1ec64a8

update readme

9747f73

emlys requested review from dcdenu4, phargogh and davemfish September 3, 2021 17:23

davemfish approved these changes Sep 7, 2021

View reviewed changes

dcdenu4 approved these changes Sep 7, 2021

View reviewed changes

emlys added 3 commits September 14, 2021 12:10

update investspec tests

5aac612

add test for multiple types

ef36e8d

support multi type args

c4d02a9

phargogh approved these changes Sep 15, 2021

View reviewed changes

use variable in help text'

840b5e9

emlys merged commit fe31914 into natcap:release/3.10 Sep 16, 2021


		2. Type (float or int)

		Floating-point (float) types can store digits after the decimal point. There is no hard limit on how many decimal places they can store, but they are only accurate to a limited number of decimal places.

		Units
		~~~~~

		Units
		~~~~~

Custom sphinx extension to generate docs from args spec #34

Custom sphinx extension to generate docs from args spec #34

Conversation

emlys commented Jun 16, 2021 • edited Loading

emlys commented Jul 1, 2021

emlys commented Aug 24, 2021

davemfish left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcdenu4 left a comment

Choose a reason for hiding this comment

phargogh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phargogh Sep 2, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phargogh Sep 2, 2021 • edited Loading

Choose a reason for hiding this comment

davemfish left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcdenu4 left a comment

Choose a reason for hiding this comment

phargogh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emlys commented Sep 16, 2021

emlys commented Jun 16, 2021 •

edited

Loading

phargogh Sep 2, 2021 •

edited

Loading

phargogh Sep 2, 2021 •

edited

Loading