Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAOC 2023: replace libdparse with dmd #589

Draft
wants to merge 24 commits into
base: master
Choose a base branch
from

Conversation

snprajwal
Copy link

This PR tracks the commits for replacing libdparse with dmd. Unlike with other projects like D-Scanner, dfmt generates the AST only once (in the main function), and uses it across all 5 files. Since these files are tightly coupled together, making this change across multiple PRs is not possible as it will lead to a broken master branch. Instead, this PR will track every change in independent commits, and will be marked ready once dfmt+dmd works exactly how dfmt+libdparse previously did.

NOTE: This PR is work in progress, and will most likely be broken for the foreseeable future.

cc @RazvanN7

dub.json Outdated Show resolved Hide resolved
@snprajwal snprajwal force-pushed the saoc-2023 branch 2 times, most recently from c3a276f to 14a2326 Compare November 6, 2023 18:12
Copy link
Member

@WebFreak001 WebFreak001 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to keep the changes of formatter.d in there and not rename it to ast.d, so the diff can keep all the indentation related stuff if possible?

@snprajwal
Copy link
Author

is it possible to keep the changes of formatter.d in there and not rename it to ast.d

I've not renamed formatter.d, I've just removed the code that will not be used anymore. ast.d has been introduced as a new file since it deals with walking the AST and doesn't fit into any of the existing files. formatter.d currently only contains the driver code for invoking the AST walk.

so the diff can keep all the indentation related stuff if possible?

I'm not sure I understand what you're referring to as "indentation related"...

Signed-off-by: Prajwal S N <[email protected]>
- dfmt_space_before_function_parameters
- dfmt_space_after_cast
- dfmt_align_switch_statements
- dfmt_space_before_aa_colon

Signed-off-by: Prajwal S N <[email protected]>
@WebFreak001
Copy link
Member

ah sorry I didn't actually check the diff too much there and was thinking about the layout inside libdparse, where there is also a formatter that works through the visitor pattern like this one does.

I think in the end the changes here will be a useful tool, being accurate with the compiler library, but I don't think the new code should go into this dfmt repository. I think it would be good to have this as a second repository in parallel that could support the same settings and reuse code such as the editorconfig parsing from dfmt, however the differences in how the two formatters are formatting code are too big to upgrade to with old code bases. Additionally new language features will require upgrade and maintenance each time, since otherwise code will be completely unable to parse/format, which with dfmt often just results in new features not looking good, but the rest still working as expected, so in the future it might still be a desired stable formatter that just doesn't get the latest and greatest features like this dmd replacement does.

Instead I think this new dmd-based formatter is something we can introduce slowly to new projects once it's ready. It fundamentally works differently to dfmt since it's actually rewriting the code based on the AST instead of just inserting/removing whitespace like what dfmt does. This of course also adds new opportunities, like being able to reorder imports alphabetically while formatting and adding/removing braces / shortened methods to be consistent with project settings - features which would never make it into dfmt with its current architecture.

For now I think it's good to keep it in this PR here, just for people to also look at it, but eventually we will probably want a new repo for it, for example calling it dmdfmt.

@RazvanN7
Copy link
Contributor

RazvanN7 commented Nov 15, 2023

@WebFreak001 Our aim is to have the same functionality as the current dfmt, so there shouldn't be any breaking changes once this project is finished. However, I don't really understand how using dmd as a lib requires more upgrade & maintenance time as opposed to the status quo. Let's say a new feature is added in dmd. If you want to upgrade dfmt right now, you need to add some code to libdparse and then potentially add some code to dfmt to support the new case. With dmd as a lib, you will most likely need to update to a newer version and I suspect that in most cases you won't need to add anything new to the formatter or you would just need to update the dfmt code that uses the visitor classes a bit. All in all, I expect the changes to require less time than what you need to do now.

Now, regarding old code bases: dfmt gets updates from time to time. There is nothing that forces you as a user to use the latest version. If you have an old code base, then it makes sense to use an old compiler and an old version of the formatter. If you are using new features, then you need a newer compiler and a newer formatter. The versioning system in dub (or manual versioning) should take care of this.

My personal opinion is that we should just upgrade dfmt and make sure that people can select the proper version of the formatter that is in sync with the compiler version that they are using. However, if you fear breaking changes, one other way we can do it is to just create a branch with dmddfmt that goes in parallel with the old formatter. That way, people can select the branch that they are interested in. Having 2 separate repos that essentially do the same thing is not something that I am attracted to - we will need to always backport dfmt-frontend additions just to keep in sync.

@WebFreak001
Copy link
Member

However, I don't really understand how using dmd as a lib requires more upgrade & maintenance time as opposed to the status quo

currently dfmt needs almost no changes in case of a compiler change, since it mostly doesn't use AST information for formatting. (although it does need it to recognize some things like locations of struct initializers and AAs, where libdparse updates can help in case of new syntax preceding them)

With the new code every AST node seems to have a format function, so every introduction or change to the code how it can be represented as AST needs to be covered by the format functions here.

With old dfmt, new syntax would at worst break through whitespaces, but you could disable it with //dfmt off/on and the rest would still work. With this new approach we need to release a new formatter for every syntax change added, since it seems it needs to implement an AST -> formatted code output visitor method. Also this means that if we allow things like omitting parentheses, e.g. the new @int syntax, it might always be formatted into @(int) until new things are supported.

Since the formatting algorithm and procedure itself is really so different from what dfmt currently does, I would really say it would be better to have it a separate project because there are going to be a bunch of people who will want the more stable old token-based approach instead of something that literally rewrites the AST each time. Maintaining old versions is something we can't do just because of missing maintainers and we don't have the branches setup for it - no user of old dfmt wanting to change how it works will try to touch the old dfmt code branch if it's not master.

@maxhaton
Copy link
Contributor

maxhaton commented Dec 1, 2023

Is this working well enough to start getting stuff into the main tree?

@maxhaton
Copy link
Contributor

maxhaton commented Dec 1, 2023

@RazvanN7 I'd like to try getting the tools into the dmd repo at some point.

@snprajwal
Copy link
Author

Is this working well enough to start getting stuff into the main tree?

It largely works, but there's still some work left. It does not retain comments at the moment, and also does not allow all the configurations that dfmt supports. I'm working on hitting feature parity. I guess it's @RazvanN7's call to make about how this will be merged (into a different branch/a new repo/something else)

@WebFreak001
Copy link
Member

I think retaining comments is a very important issue to solve first before getting it to users, extending the feature support can be done later as well

@snprajwal
Copy link
Author

I think retaining comments is a very important issue to solve first before getting it to users, extending the feature support can be done later as well

Agreed, I've been working on it across the last couple of days. It requires some changes in DMD that I'll have to make before adding support into dmdfmt for it, so it'll take some time.

@maxhaton
Copy link
Contributor

maxhaton commented Dec 2, 2023

I think retaining comments is a very important issue to solve first before getting it to users, extending the feature support can be done later as well

Agreed, I've been working on it across the last couple of days. It requires some changes in DMD that I'll have to make before adding support into dmdfmt for it, so it'll take some time.

Can you put your findings either here or in the D slack. I'm pretty sure lots of people can help you (including me) do this so make sure you get in touch.

Feels like deja Vu, it's come up in the past.

@snprajwal
Copy link
Author

Here's the situation right now:

  • DMD lexes comments, but does not store normal comments. It creates a TOK.comment token and moves on with the lexing (see this line. It looks like it doesn't even create the token if we're using DMD as a library. It only stores the comment text if it is a doc comment (///, /** **/, or /++ ++/) -> see this line.
  • The doc comments stored in the previous step get added to the respective lang item nodes in the AST during parsing in ~78 places in parse.d

One way to store comments is to just change the lexing logic to store all types of comments rather than just doc comments. I don't think this is a feasible solution since it'll require a lot of changes across the rest of the compiler, especially in the DDoc generation logic. It also bloats the size of the AST during parsing with information that is effectively useless to the compiler.

Another way is to introduce a new field into struct Token, something like rawComment. It will store the raw comment strings that the lexer encounters, including the prefix and delimiter (//, /* */, etc). This will again require some change to be made to the parser to propagate this info into the AST nodes, but should not break anything that is currently present. This will allow us to simply dump the raw comment string into the output stream during formatting. I'm still not sure if this is the way to go, or if there is an even easier solution that I'm not able to think of.

The objectively "correct" way to solve this would be to introduce a new stage in the parsing phase where we build a concrete syntax tree (CST). This is the AST enriched with additional information including whitespaces, line breaks, comments, offset in the source file, etc. This CST can be used in the formatter to generate better context-aware output. The compiler will just have another pass that strips all of this info from the CST and reduces it to the current AST (before the semantic passes are run). Again, I'm not sure if this is the way to go, owing the the complexity of implementing this, potential of breaking the existing compiler pipeline, and adding additional overhead into the parsing phase.

@WebFreak001
Copy link
Member

in libdparse we solved this by Token having leading and trailing Trivia[] attached to it. Trivia is just tokens of type whitespace and comment (and some other skipped things)

in libdparse each AST node has an array of tokens it was constructed from, so you can eventually go all the way down to the trivia.

This is copied from Roslyn's design and makes it possibly to do a 1:1 reconstruction of the input code from the AST, even after modifying parts of it.

@maxhaton
Copy link
Contributor

maxhaton commented Dec 4, 2023

in libdparse we solved this by Token having leading and trailing Trivia[] attached to it. Trivia is just tokens of type whitespace and comment (and some other skipped things)

in libdparse each AST node has an array of tokens it was constructed from, so you can eventually go all the way down to the trivia.

This is copied from Roslyn's design and makes it possibly to do a 1:1 reconstruction of the input code from the AST, even after modifying parts of it.

The way Roslyn does this is the way this should be done indeed although I'm not sure if it's exactly the same as having some Trivia bolted on (or maybe it is in practice?)

@snprajwal
Copy link
Author

I discussed this with Razvan, and we both feel that the correct way for us to go ahead would be to store this data correctly during the parsing stage and provide it through the AST, similar to having a trivia[] field. I'm working on adding this into DMD, and will open a PR once I make sure it works with dfmt.

On a tangent, with regards to the question about merging the work in this PR, I don't think we've reached a consensus yet on whether it should go into a separate dmdfmt project repo, or a separate branch in this repo.

@WebFreak001
Copy link
Member

since the functionality is so drastically different from how dfmt works I would suggest creating a separate project. Or alternatively it might be better to just have both the libdparse and dmd based formatters in dfmt, but be able to switch between them using a CLI switch like --engine=dmd, since it going to support the dfmt formatting flags as well.

The non-conditional styles for `dfmt_template_constraint_style` are
supported for now. The conditional ones will require line length
tracking, which is yet to be implmented.

Signed-off-by: Prajwal S N <[email protected]>
@RazvanN7
Copy link
Contributor

@WebFreak001 I brought the issue of merging dmdlib with dfmt at our Foundation Monthly Meeting and the general consensus was to integrate this work into the mainline dfmt and have dfmt be part of the release archive. That way dfmt is going to be tied with a compiler version and everytime you update the compiler you also get the latest version of dfmt. How does that sound?

@maxhaton
Copy link
Contributor

maxhaton commented Dec 14, 2023 via email

@WebFreak001
Copy link
Member

as long as we keep both engines (libdparse and dmd) around for a while that sounds like a good idea, it's just that we will still want to support the old dfmt users for a while especially while early issues in the dmd issues are still being sorted out.

We don't want to just introduce completely different formatting behavior onto full projects that have relied on dfmt for a long time and are suddenly getting formatted differently (it will just make the git history of all these projects a mess as well)

src/dfmt/ast.d Outdated
import dmd.rootobject;
import dmd.target;
import dmd.root.string : toDString;
import dmd.optimize : optimize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is breaking CI for some reason.

src/dfmt/ast.d Outdated
import dmd.rootobject;
import dmd.target;
import dmd.root.string : toDString;
import dmd.optimize : optimize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
import dmd.optimize : optimize;

looks like it can be zapped

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's needed for calling Expression.optimize, which was moved out of expression.d into optimize.d in a recent PR. It breaks CI because I don't think there's been a release with the change yet, I compile dmdfmt with the master branch of dmd locally so I'm sometimes ahead of the CI pipeline. I'll update dub.json to use v2.107.0-beta.1, that should fix it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you call .optimize specifically?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and where

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's called here and here. It was introduced to fix https://issues.dlang.org/show_bug.cgi?id=7375, and was taken from hdrgen.d in DMD. Now that I look closer at it, it's a compile time check, so shouldn't be needed in dfmt. I'll get rid of it.

README.md Outdated
* Run ```git submodule update --init --recursive``` in the dfmt directory
* To compile with DMD, run ```make``` in the dfmt directory. To compile with
LDC, run ```make ldc``` instead. The generated binary will be placed in ```dfmt/bin/```.
* Run `git submodule update --init --recursive` in the dfmt directory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move these changes into a different PR if this is intentional.

.editorconfig Outdated
insert_final_newline = true
max_line_length = 120
tab_width = 8
dfmt_single_template_constraint_indent = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these changes here? Intentional?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the configs that were removed were redundant, because they're the defaults. The two lines that were added were to maintain consistency between multi-line formatting across the codebase (the default behaviour uses two tabs to indent, which is somewhat unidiomatic, imo). There's also a discussion on the forum about whether to keep this knob or not on dmdfmt.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its good practice (for good reason) to make PRs as small as possible. If you want to change this it should be a separate pull request for that reason.

Comment on lines +1 to +27
DMD_ROOT_SRC := \
$(shell find dmd/compiler/src/dmd/common -name "*.d")\
$(shell find dmd/compiler/src/dmd/root -name "*.d")
DMD_LEXER_SRC := \
dmd/compiler/src/dmd/console.d \
dmd/compiler/src/dmd/entity.d \
dmd/compiler/src/dmd/errors.d \
dmd/compiler/src/dmd/errorsink.d \
dmd/compiler/src/dmd/location.d \
dmd/compiler/src/dmd/file_manager.d \
dmd/compiler/src/dmd/globals.d \
dmd/compiler/src/dmd/id.d \
dmd/compiler/src/dmd/identifier.d \
dmd/compiler/src/dmd/lexer.d \
dmd/compiler/src/dmd/tokens.d \
dmd/compiler/src/dmd/utils.d \
$(DMD_ROOT_SRC)

DMD_PARSER_SRC := \
dmd/compiler/src/dmd/astbase.d \
dmd/compiler/src/dmd/parse.d \
dmd/compiler/src/dmd/parsetimevisitor.d \
dmd/compiler/src/dmd/transitivevisitor.d \
dmd/compiler/src/dmd/permissivevisitor.d \
dmd/compiler/src/dmd/strictvisitor.d \
dmd/compiler/src/dmd/astenums.d \
$(DMD_LEXER_SRC)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might dmd -i be useful here? This makes you quite brittle towards upstream changes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm not really sure about keeping these changes, because they refused to work once I started using ASTCodegen. I tried a bunch of things to fix it, but didn't manage to make it work. dmdfmt currently works with dub build, but the Makefile itself is pretty broken. Even with every single D file in DMD included, it would die during linking.

Comment on lines -9 to -12
import dparse.lexer;
import dparse.parser;
import dparse.rollback_allocator;
import dfmt.ast_info;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep libdparse around for now, abstract your additions into some kind of API.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean libdparse and dmd should be a switchable flag in dfmt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, both for redundancy and because it will make your diff smaller. It's much easier to merge when you are adding rather than replacing.

Some breaking changes in upstream borked the AST walker, it's been fixed
in this commit.

Signed-off-by: Prajwal S N <[email protected]>
Signed-off-by: Prajwal S N <[email protected]>
Signed-off-by: Prajwal S N <[email protected]>
_sigh_ should've done this at the beginning, but better late than never.

Signed-off-by: Prajwal S N <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants