Creating a parser to transform the Matlab documentation into a structure #20

gaspardcereza · 2020-06-26T14:33:31Z

Context

We are currently trying to define how to document our functions within the Matlab code. We also want to gather all these descriptions on our website.
Closes #19 .

Problem

What looks good in the Matlab code isn't always visually satisfying once on the website (and vice-versa).

Solution

Parsing the Matlab code to gather all the useful information (description, syntax, inputs, outputs...) as a structure or that will then be used to create the .md files for the website's generation (and displayed as we want).
Playing with the parser will also help us defining what could be the best convention for documenting our functions (related to this issue).

What has been done so far

I implemented a parser that reads a Matlab functions and returns all the information contained in the documentation section in a structure. I started from test.m as template for the documentation (though I think the final convention won't exactly look like this).

Here's what the Matlab documentation looks like:

%TEST Computes output1 and output 2 from arg1 and arg2.
%
% SYNTAX
%
%   output1 = test(arg1, arg2)
%   [output1, output2] = test(arg1, arg2)
%
% DESCRIPTION
%
%   Computes output1 as the sum of arg1 and arg2 and output2 as the
%   difference between arg1 and arg2.
%
% INPUTS
%
%   arg1
%     Scalar. This line is very long because i wanted to test if the input
%     description would be correctly parsed if it is longer than one line.
%
%   arg2
%     Scalar
%
% OUTPUTS
%
%   output1
%     Sum of arg1 and arg2  
%
%   output2
%     Difference between arg1 and arg2
%
% NOTES
%
% That function is destined to test the parsing of the function
% documentation (done by parse_doc.m).

And the returned structure contains the fields:

|.summary (string)
|
|.syntax (array of strings of size 1*nSyntaxes)
|
|.description (string)
|
|.inputs (struct)
|           \______ .names (array of strings of size 1*nInputs)
|            \______ .description (array of strings of size 1*nInputs)
|
|.outputs (struct)
|           \______ .names (array of strings of size 1*nInputs)
|            \______ .description (array of strings of size 1*nInputs)
|
|.notes (string)

Though I'm still not satisfied with the parser for 2 main reasons:

The way it it implemented right now requires a total respect of the template (e.g no "forgotten" spaces),
Won't work with field names different from SYNTAX, DESCRIPTION, etc...

I think there are many other defaults in the code but it might be a good start with some useful parsing Matlab features that I found.

The next step is to find a way to reorganize that structure as a .md file.
Feel free to comment if you have ideas, advice or anything...

jcohenadad · 2020-06-26T16:37:49Z

The way it it implemented right now requires a total respect of the template (e.g no "forgotten" spaces),

ok for now

All the fields (SYNTAX, DESCRIPTION, etc...) must be provided.

that's a problem-- but should easily be overcome

The next step is to find a way to reorganize that structure as a .md file.

indeed

@gaspardcereza: in general, the first message of the PR thread is supposed to be "final", without having to bring too many modifications to it. If you wish to report progress report (ie section "What has been done so far"), i suggest to do it in subsequent posts, so that:

you don't have to edit your main post once this is implemented
we keep a history of the progress in this PR thread.

gaspardcereza · 2020-07-02T14:15:17Z

@jcohenadad I think the parser is ready to be merged now that it can handle the absence of some sections. The .md generator could be the subject of another PR as part of your "bit by bit" policy on GitHub.

jcohenadad · 2020-07-02T15:11:49Z

parser/test.m

@@ -0,0 +1,40 @@
+function [output1, output2] = test(arg1, arg2)


isn't the convention to call test functions with suffix as the function to test, i.e.: test_parse_doc.m?

didn't know that. I'll change it

i don't know either-- i'm basing it just from python's pytest-- obviously things might be different in matlab-- please inform yourself how those tests are run-- @po09i @gab-berest @rtopfer might know

I'd say that test names (as for unit test if I understand correctly) should have the name of the tested function AND the result. Maybe with more complexe function you'll need to test if the function does the correct thing, but also exceptions. For example:
test_parse_header_complete_pass
test_parse_header_missing_field_pass
test_parse_header_wrong_field_fail
In this case, the first one check if a full header makes a good call and successful return, the second checks if a missing field still returns ok and the third one check that if you give a wring input it fails successfully.

It makes big function names, but when testing it will be clearer when something fails.

parser/test.m

jcohenadad

is test.m actually tested by the CI?

jcohenadad · 2020-07-02T15:13:45Z

parser/test.m

+output1 = arg1+arg2;
+% We don't want that kind of comments to appear in the parsed documentation.
+output2 = arg1-arg2; % Neither this kind of comments
+


what are the assertions for this test? how can it fail?

That test function is just some dummy function that will be parsed (it could even be simply deleted and we could test the parsing directly on parse_doc). It is not a unit test. I would like to do an general unit test that includes the .md part once everything is good to go.

bits-by-bits-- good practice is: for every code that is written, a test should go along it.

never "postpone" writing tests

things that we postpone might take a while to be implemented and then we forget about it

so the philosophy is: do it now, not tomorrow

also: a test that never fails is not a test

Ok I'll create a unit test right now then. Is it ok if I create a test folder at the root of the repo ?

@po09i is currently looking into this and can let us know what to do

parser/test.m

jcohenadad · 2020-07-02T15:20:53Z

@jcohenadad I think the parser is ready to be merged now that it can handle the absence of some sections. The .md generator could be the subject of another PR as part of your "bit by bit" policy on GitHub.

in that case, pls open the pending issues that will not be addressed by this PR, and reference those new issues in this PR-- otherwise it will be forgotten

po09i · 2020-07-02T19:15:09Z

@gaspardcereza @jcohenadad
Answering all the tests comments :
Based on what Mathieu did for his tests, we should have a /tests folder where our azure-pipeline.yml file will go as well as a script to launch tests manually. Then, we should have /tests/helpDocMd/parser/parse_doc_test.m where parse_doc_test.m is a class with this format.

CI is currently not set up on helpDocMd but will implement soon. You can also run the tests locally by calling the runTestSuite('Unit') once the file to run tests manually is created (first link in this post).

You can refer to shimming-toolbox : /tests to have a general view.

gab-berest

I said a lot of things in this review. some are more relevant to the way the documentation is done to make it more flexible.

parser/parse_doc.m

gab-berest · 2020-07-02T22:41:36Z

parser/parse_doc.m

+
+    switch header
+        case 'SYNTAX'
+            docStruct.syntax = erase(functionDoc(sectionStart:sectionEnd),'   ');


The code is really whitespace dependent and the problem is that there is a lot of characters that are refered as whitespaces (tab, space, null, ...). When someone will use a different number of spaces, everything will break. A good way of reducing this problem and making the documentation more flexible is adding a symbol in front of specific information ('->' in front of inputs, '<- in front of outputs', '@' in front of notes, '_' in front of types). This way you know that everything that is after a certain symbol is a certian type until you reach the next symbol.

Yes you're right but is it really easier to make sure everyone adds the right symbol in the right place or that they just respect the number of white spaces ? That's something I've been asking myself when I was coding but I don't have a clear preference for any of these options so I'm open to discussion.
Also, the way the code is implemented right now, there's no need for a symbol in description and notes because the code will just take everything that's in these sections and store it as a paragraph. I also don't see a genuine utility in having different symbols for the inputs and outputs as they are in different sections and thus can't be mixed up. But it might however be more visual to have these different arrows.

The problem with the whitespaces is actually the implementation (tabs vs spaces vs null characters, etc.). There's a lot of ways to add space in a text editor. However with a symbol it's easy to know which one.
The question of someone putting the right character is not relevant since it is the same for every way we find (number of spaces, etc.).
you are right for the description and notes, but I'm talking more about the input/outputs.

gab-berest · 2020-07-02T22:44:00Z

parser/parse_doc.m

@@ -0,0 +1,88 @@
+function docStruct = parse_doc(functionPath)
+%PARSE_DOC Generates a structure from the input function's documentation.


Maybe it's something that was agreed previously, but is it necessary to have a struct? Isn't it a better idea to have a dictionary (map) so you can input any information needed in the header like examples, etc. and make a test at the end of the parse to check that minimal sections are present?

I decided to go with a struct mainly because it was something I knew how to use. 😉 I'm not familiar with dictionaries but if you think that might be a better solution I'd be glad to discuss about it ! Is this what you are referring to ?

Yes, that's exactly what I'm thinking about. I think maps or set depending on what we want could be more versatile.

gab-berest · 2020-07-02T22:48:14Z

parser/parse_doc.m

+% Initialize a cell that will receive the lines in the description
+functionDoc = [];
+
+textLine = fgetl(functionTxt);


This line is duplicated with the one at line 55. could it be possible to fgetl at the begining of the while loop so you have it only once.

The condition for my while loop is length(textLine) >= 1 so I need to call it before entering the loop...

I see. I just really don't like duplicated code. You could also test in an if statement with a break command:
while 1
textLine = fgetl(functionTxt);
if length(textLine) >= 1
break;
end
...
end

gab-berest · 2020-07-02T22:50:15Z

tests/test_parse_doc.m

@@ -0,0 +1,19 @@
+function test_parse_doc


There should be for me other tests. This test coverage is really minimal. When we merge, we should have the certitude that every way to use the function or not use the function is tested and working (specially for parsing functions)

I just wanted to draft something here to understand how unit testing works (never did it before). But I totally agree that a maximum of things have to be covered during the test. 👍

gab-berest · 2020-07-02T22:54:49Z

tests/test_parse_doc.m

+%
+% That function makes sure that parse_doc is working properly and returns
+% the expected structure for the documentation.
+cd('../parser')


There will be an error if the function is run from another directory than the one tested (and in Matlab, it is possible to run a script or function from a different directory). Maybe you should check with a pwd function end extrapolate the correct way to get to the place you want. You could also check this link to know more about finding an item location in Matlab:
https://www.mathworks.com/help/matlab/ref/which.html
This way your code could be more robust.

I think you should consider removing this line entirely and make the code work whether this is called from anywhere.

Yes that's something I why trying to do but I could only find the path to the function using which if I was already in the right folder (not really useful in our case...). One solution would be to add helpDocMd/parser/ to the Matlab path but It would also require to know the full path toward it. @po09i do you know how this is usually done with unit testing ?

The which command in Matlab does this. It returns the absolute path of the file you asked. See the link I added in my previous comment.

There should be a startup.m script like the one in shimming-toolbox which adds all the source files to the Matlab path. This script would be launched when running the tests. So my point of view is you should assume all of helpDocMd is on the Matlab path. If it is not yet created in this repo, I think you should create it.

tests/test_parse_doc.m

gab-berest

I added some reviews.

parser/parse_doc.m

gab-berest · 2020-07-02T23:33:50Z

parser/parse_doc.m

+            docStruct.description = strjoin(strip(functionDoc(sectionStart:sectionEnd)),' ');
+        case 'INPUTS'
+            section = functionDoc(sectionStart:sectionEnd);
+            docStruct.inputs.names = strip(section(cellfun('isempty', strfind(section,'     '))));


Try to be carreful with embeding functions in other functions. A lot of parenthesis can make the code less readable. sometimes, it is better to separate in multiple lines.

parser/parse_doc.m

gaspardcereza added 4 commits June 25, 2020 19:46

created a script to parse the Matlab doc

fc3d0a9

deleted .asv files

8ff1554

documented parse_doc.m

4687abb

added spaces in the doc template

9d2cc67

gaspardcereza added the enhancement New feature or request label Jun 26, 2020

gaspardcereza self-assigned this Jun 26, 2020

Added capacity to handle uncomplete documentations

bc33759

rtopfer mentioned this pull request Jul 1, 2020

Established convention for docstrings shimming-toolbox/shimming-toolbox-matlab#145

Closed

gaspardcereza marked this pull request as ready for review July 2, 2020 14:13

jcohenadad reviewed Jul 2, 2020

View reviewed changes

parser/test.m Outdated Show resolved Hide resolved

jcohenadad reviewed Jul 2, 2020

View reviewed changes

parser/test.m Outdated Show resolved Hide resolved

jcohenadad mentioned this pull request Jul 2, 2020

Implement a function that generate the .md documentation file from the parsed structure. #21

Open

created unit test function for parse_doc

adf9958

gab-berest self-assigned this Jul 2, 2020

gab-berest reviewed Jul 2, 2020

View reviewed changes

gab-berest removed their assignment Jul 2, 2020

gaspardcereza added 3 commits July 3, 2020 09:42

changed the variable name functionTxt to functionFile

333a304

Added error messages in unit test

e6e3348

added variable type in string array declaration

005394e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creating a parser to transform the Matlab documentation into a structure #20

Creating a parser to transform the Matlab documentation into a structure #20

gaspardcereza commented Jun 26, 2020 •

edited

Loading

jcohenadad commented Jun 26, 2020

gaspardcereza commented Jul 2, 2020

jcohenadad Jul 2, 2020

gaspardcereza Jul 2, 2020

jcohenadad Jul 2, 2020

gab-berest Jul 2, 2020

gab-berest Jul 2, 2020

jcohenadad left a comment

jcohenadad Jul 2, 2020

gaspardcereza Jul 2, 2020

jcohenadad Jul 2, 2020

gaspardcereza Jul 2, 2020

jcohenadad Jul 2, 2020

jcohenadad commented Jul 2, 2020

po09i commented Jul 2, 2020 •

edited

Loading

gab-berest left a comment

gab-berest Jul 2, 2020 •

edited

Loading

gaspardcereza Jul 3, 2020

gab-berest Jul 3, 2020

gab-berest Jul 2, 2020

gaspardcereza Jul 3, 2020

gab-berest Jul 3, 2020

gab-berest Jul 2, 2020

gaspardcereza Jul 3, 2020

gab-berest Jul 3, 2020

gab-berest Jul 2, 2020

gaspardcereza Jul 3, 2020

gab-berest Jul 2, 2020 •

edited

Loading

po09i Jul 3, 2020

gaspardcereza Jul 3, 2020

gab-berest Jul 3, 2020

po09i Jul 6, 2020

gab-berest left a comment

gab-berest Jul 2, 2020

		@@ -0,0 +1,40 @@
		function [output1, output2] = test(arg1, arg2)

		@@ -0,0 +1,88 @@
		function docStruct = parse_doc(functionPath)
		%PARSE_DOC Generates a structure from the input function's documentation.

Creating a parser to transform the Matlab documentation into a structure #20

Are you sure you want to change the base?

Creating a parser to transform the Matlab documentation into a structure #20

Conversation

gaspardcereza commented Jun 26, 2020 • edited Loading

Context

Problem

Solution

What has been done so far

jcohenadad commented Jun 26, 2020

gaspardcereza commented Jul 2, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcohenadad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcohenadad commented Jul 2, 2020

po09i commented Jul 2, 2020 • edited Loading

gab-berest left a comment

Choose a reason for hiding this comment

gab-berest Jul 2, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gab-berest Jul 2, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gab-berest left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gaspardcereza commented Jun 26, 2020 •

edited

Loading

po09i commented Jul 2, 2020 •

edited

Loading

gab-berest Jul 2, 2020 •

edited

Loading

gab-berest Jul 2, 2020 •

edited

Loading