[proposal] Introducing TextMate grammar system for syntax highlighting ? #9087

skywind3000 · 2021-11-04T05:23:45Z

Current problem

The current syntax highlighting system is very slow, and there are noticeable lags when scrolling large C++ files which contain complex syntax elements.

Issues of tree-sitter

Previously, most people suggest something like nvim-treesitter which will analyze source code in a background treesitter process and render keywords in the foreground with text-property.

But is it a good idea? I don't really think so,
at least 4 disadvantages for treesitter solutions:

power consumption: an extra background job is required, causing less battery life and more carbon dioxide.
buffer sync: code must be written very carefully to guarantee that the source code in two processes (vim/treesitter) are the same, coc/nvim-treesitter sends the whole buffer to the background every time changetick increase to prevent such things, which is a little flaky.
reliability: an external program installed by the user is not reliable enough, they are plenty of issues in version compatibilities and environment misconfig, people need to take extra efforts to get syntax highlighting work.
poor parser quality: treesitter is a great project, but language parsers are contributed by people all over the world and their quality is not under control (performance issues or inaccurate results in certain languages).

Background syntax highlighter is still immature, there are still many other strange issues in nvim-treesitter:

https://github.com/nvim-treesitter/nvim-treesitter/issues

If we introduce something like this, we shall take all these issues into account.

TextMate grammar system

Syntax highlighting is the most important part of an editor, better not rely on any uncontrollable external programs.

We need some new things that can satisfy such goals below:

good performance
robust and reliable
accuracy
low power consumption
native and work in the same process (not require external programs)

And TextMate's grammar engine is really a good candidate which is widely used in many IDE/editors, including vscode (see syntax-highlight-guide for details), sublime and many others.

VS Code uses TextMate grammars as the syntax tokenization engine. Invented for the TextMate editor, they have been adopted by many other editors and IDEs due to the large number of language bundles created and maintained by the Open Source community.

TextMate grammars rely on Oniguruma regular expressions and are typically written as a plist or JSON. You can find a good introduction to TextMate grammars here, and you can take a look at existing TextMate grammars to learn more about how they work.

The grammar can be defined in JSON, that means can be translated into viml or just plain JSON files.

Possible Solution

We can specify which grammar engine to use for the given buffer:

default engine: current vim's regex grammar
textmate engine: textmate grammar system.

And some new command can be used to change grammar engine:

:syntax grammar textmate
:syntax grammar default
:syntax load ~/.vim/syntax/cpp.json

for example, the snippet below can be included in the head of syntax files:

if has('textmate')
    syntax grammar textmate
    syntax load syntax/cpp.json
    finish
endif

....

And lots of existing vscode/textmate syntax files can be reused with minimal modification.

The text was updated successfully, but these errors were encountered:

brammool · 2021-11-04T11:18:05Z

Thank you for starting this discussion. I had a vague plan to look into integrating treesitter, it is good to know it also has disadvantages. Vscode is widely used, thus if it uses TextMate then there must be something good about it.

Comments welcome.

clason · 2021-11-04T12:59:02Z

I'll just comment that I would take these comments about tree-sitter with a significant heap of salt.

bfredl · 2021-11-04T13:18:53Z

The might be some misunderstanding here. Tree-sitter in neovim doesn't use an external process like coc.nvim. The parser runtime is a C library embedded into the editor itself (in total not more LOC:s than syntax.c + highlight.c in vim itself), and parses the buffer in memory without copy and produces a syntax tree that in-process plugins can use (for highlighting but also for other purposing like text objects).

imranZERO · 2021-11-04T17:09:01Z

Right now the biggest problem with syntax highlighting is how inconsistent and unpredictable it is among different languages. An unified interface will be more than worth the effort.

TextMate will probably be better for keeping the syntax system more integrated & backwards compatible than using something like treesitter. Also the modular and overengineered plugin architecture of treesitter would be a huge departure from the way it is done right now, so we should be a little cautious about how much functionality to reimplement.

bfrg · 2021-11-05T13:17:06Z

@bfredl How much longer does it take to load a larger file like src/evalfunc.c in Neovim when tree-sitter is enabled, compared to the default syntax highlighting? I'm assuming that default syntax highlighting is disabled for filetypes where tree-sitter is supported.

bfredl · 2021-11-05T13:59:24Z

@bfrg src/evalfunc.c from vim (10 000 lines) takes 80 ms more time with tree-sitter enabled for the initial parse (--startuptime 200ms compared to 120ms in my config)

mg979 · 2021-11-06T12:45:18Z

treesitter is more than just syntax highlighting, it's also useful for text objects for example.

TextMate system is old, Sublime Text has been mentioned but it left it years ago to use its own syntax engine. Does it make sense to adopt a system that is already waning? And how big is its library if it must be included?

Also when saying that a system is more performant, some source/benchmark should be provided. Is it TextMate more performant than treesitter? Who says so?

skywind3000 · 2021-11-06T17:09:44Z

@bfredl , thanks for figuring it out, and I made a new revision:

list of tree-sitter disadvantages for syntax highlighting:

power consumption: yes tree-sitter is powerful, it will generate AST in real-time, but I am just talking about syntax highlighting, not textobj or indentation. Do we really need an AST for just syntax highlighting ? If someone really care about the semantic highlighting, they can use coc & lsp. AST generation has its price (see Horrible input lag on 2MB C header file. nvim-treesitter/nvim-treesitter#1292).
reliability: nvim-treesitter need to load an external shared library as the parser for each language, the shared library must be downloaded and compiled into .so files (I know :TSInstall can simplify these steps), building progress can break if gcc/clang is not installed, the plugin or neovim itself may break due to any common dynamic link library problems, eg: version incompatible when the plugin has updated but parser .so files not, dependency conflict when loading the shared library.
poor parser quality: tree-sitter is a great project, but language parsers are contributed by people all over the world and their quality is not under control (performance issues or inconsistent behavior in different languages).

The biggest risk is the horrible parser quality, over 100+ open issues for parsers:

https://github.com/nvim-treesitter/nvim-treesitter/issues?q=is%3Aissue+is%3Aopen+parser

examples for quality issues:

examples for performance issues:

The parser quality problem is totally out of control, nearly impossible for us to fix all the parsers one by one.

It appears that the performance issues of traditional vim syntax highlighting are not fixed by tree-sitter, but more issues are introduced by it. Therefore, syntax highlighting should not rely on such unreliable parsers.

One more thing, parsers are not only hard to implement but also hard to extend/improve. most of the time, only the author can do this. But json syntax files are much easier for everyone.

One last thing, parser binaries are big in size:

The average size of each parser is around 500KB, vim 8.2.3582 has 644 syntax files, that means you need extra 322MB disk space to save the parse binaries.

Vim is shipped every where, including systems that do not have gcc/clang compilers installed, and systems with limited storage space, that means it is hard to build the parsers on demand or ship the pre-build binaries with vim itself.

skywind3000 · 2021-11-06T18:01:26Z

@mg979 For modern editors, regex based syntax highlighting is still the foundation, and semantic highlighting is just the decoration. I am talking about the foundation not the decoration.

Also when saying that a system is more performant, some source/benchmark should be provided. Is it TextMate more performant than treesitter? Who says so?

I think I made my point in the previous post: #9087 (comment)

Everyone knows, but nobody dares to say, that NeoVim users are struggling in the mud of treesitter parsers right now, no need for vim users to experience such horrible things again. If someone really care about semantic syntax highlighting, indentation or textobj, they can still use LSP as well, LSP does a good job for all these things. There are plenty of LSP solutions for both vim & neovim.

Let's back to textmate, the core part of textmate syntax system is oniguruma, which is open source and well maintained by the community.

known editors / ides supporting textmate grammar:

vscode
textmate itself
eclipse
jetbrains

The syntax rendering of those editors/IDEs above is really fluent and proved by both time and massive users, while treesitter are still in test and can freeze nvim when parsing large files. (also check the performance issues in the previous post)

Monarch was initially built to support languages in VS Code. Then, they decided to switch for TextMate as well because of reasons listed above: microsoft/vscode#174 (comment) .

Some details:

VS Code's tokenization engine is powered by TextMate grammars. TextMate grammars are a structured collection of regular expressions and are written as a plist (XML) or JSON files. VS Code extensions can contribute grammars through the grammar contribution point.

The TextMate tokenization engine runs in the same process as the renderer and tokens are updated as the user types. Tokens are used for syntax highlighting, but also to classify the source code into areas of comments, strings, regex.

Starting with release 1.43, VS Code also allows extensions to provide tokenization through a Semantic Token Provider. Semantic providers are typically implemented by language servers that have a deeper understanding of the source file and can resolve symbols in the context of the project. For example, a constant variable name can be rendered using constant highlighting throughout the project, not just at the place of its declaration.

Highlighting based on semantic tokens is considered an addition to the TextMate-based syntax highlighting. Semantic highlighting goes on top of the syntax highlighting. And as language servers can take a while to load and analyze a project, semantic token highlighting may appear after a short delay.

it is easy to implement textmate syntax highlighting

The tokenizer of vscode/textmate is:

https://github.com/kkos/oniguruma

And here is the wrapper in javascript, it's neatly written and not hard to understand:

https://github.com/microsoft/vscode-textmate

All we need to do is rewriting the javascript wrapper in C,

No more than 4854 lines (including comments) in javascript/typescript

And thousands of textmate syntax files are ready to use.

lacygoill · 2021-11-06T19:03:26Z

No more than 4854 lines (including comments) in javascript/typescript

Tests excluded, it's 3779 lines of code (source: cloc(1)).
Tests included, it's 5074 lines of code.

lacygoill · 2021-11-06T19:05:07Z

Why not Sublime grammar instead of TextMate grammar? It seems more powerful, and easier to read.

I think .sublime-syntax is more easy to write and readable.

source

Sublime text 3 has implemented a new grammar format that seems much better than the traditional textmate grammar.

source

Is it because there have been fewer .sublime-syntax files written than .tmLanguage ones? Is there a licensing issue with these files?

skywind3000 · 2021-11-06T19:11:35Z

@lacygoill maybe textmate grammar is a little easier ? because there are reference implementations:

But sublime is closed source ? we need write everything from scratch ??

edit: if sublime 3 grammar is also based on oniguruma, maybe things can become a little easier ? anyway, both textmate/sublime solution are better than tree-sitter.

lacygoill · 2021-11-06T19:30:44Z

But sublime is closed source ? we need write everything from scratch ??

Good point. I forgot that sublime was closed source.

Is TextMate much better (readibility, reliability, performance) than our current syntax highlighting mechanism?

Just for TypeScript alone, there have been 754 reported bugs, 41 remaining open currently.

Assuming we support TextMate, what would happen to our current issues related to syntax highlighting? Do we close them, and tell their authors to use the new syntax highlighting mechanism? If the users find issues in TextMate grammar files, do we accept their reports on this bug tracker? IOW, is it going to help reduce the number of remaining open issues here?

skywind3000 · 2021-11-06T19:47:49Z

Because TypeScript is a new language that evolves quickly ?

Oniguruma + json like config certainly has better performance and reliability than current vim's mechanism. People seldom encounter such issues in syntax highlighting when using textmate/sublime2/vscode/eclipse/jetbrains.

Sublime's grammar seems more readable and powerful than textmate, maybe oniguruma+config can achieve such thing ?

lacygoill · 2021-11-06T22:31:39Z

I remember an issue where Vim was very slow when adding/removing text properties on CursorMoved. It only occurred while the syntax highlighting was enabled. So, one might think that the latter was the culprit. It turns out that the syntax highlighting was fine; the issue was Vim redrawing the screen too much.

With regards to how people perceive the current syntax highlighting as being too slow, I wonder which part of the issue comes from the syntax highlighting itself, and which part from something else like (too much redraw).

People seldom encounter such issues in syntax highlighting when using textmate/sublime2/vscode/eclipse/jetbrains.

That's interesting. I hope it's really thanks to their own syntax highlighting mechanism, and not some other optimizations (like multithreading).

mg979 · 2021-11-07T11:42:37Z

A couple of remarks:

with vim system it's really easy to add custom groups to extend current syntax in after/syntax, would it be possible to do that with TextMate as well?
as @lacygoill said, there could be other bottlenecks (too much redrawing), that would limit TextMate performance in the same way, isn't it better to investigate those first?
programs with GUI use multithreading and this surely helps them
sometimes a slow syntax highlighting depends on how (bad) the syntax script is written (for example default vimscript syntax is obscenely slow), and it would be faster with some changes in the script

I think performance of vim syntax highlighting could be improved before trying alternatives, for example:

there are known problems with folding, it would help to fix those
how much of the syntax is recalculated in insert mode? I think only the part of text from the insertion point up to the last visible line in the window should be recalculated, is this the case or does vim do a full update on every keystroke?

theHamsta · 2021-11-09T00:11:51Z

I want to add that we currently have no safe-guards for tree-sitter that are applied for regex-based highlighting like limiting the line number or doing background parsing like Atom would do.

Background syntax highlighter is still immature

I think background syntax highlighting (if you refer to asynchronous or separate threads highlighting) is neither implemented for tree-sitter nor for traditional vim highlighting. The possibility to make a fast thread-safe copy of the parsing state for tree-sitter or any other kind of multithreading is not used at the moment in Neovim.

Many of the issues you cited complained about features missing due to missing :h syntax. It will always be difficult to transition from one syntax system to another especially when it is so widely supported like vim syntax/fold/indent files. Maybe it would be easier to maintain more compatibility with a system that works more similar.

About quality of the grammars, you surely have different trade-offs. VS-Code has significant more users than Atom and Nightly-Neovim. Tree-sitter parses the whole document which can help with complex syntax constructs and large-scale structure. However, it will easier get confused when it sees something that cannot be handled be the language grammar (preproc-constructs or non-standard language extensions) while regexes with a more local view are often still ok. The error recovering capabilities vary a lot on how the concrete grammar is written. Tree-sitter provides something in-between regex highlighting and LSP-like semantic highlighting, so it might not be necessary if the two latter are available for a language. Distributing binaries is another challenge for tree-sitter. Arbitrary code execution through custom scanners enables highest flexibility but may also pose a security risk though if the parsers are not self-generated and the scanner code is not reviewed.

andmis · 2021-11-09T19:55:09Z

For those who haven't seen it, this is an excellent introduction to Tree-sitter, by the author: https://www.youtube.com/watch?v=Jes3bD6P0To&ab_channel=StrangeLoopConference

tl;dr: Tree-sitter is a (portable, dependency-free) C library which (conceptually) takes a grammar (expressed in JavaScript) and a source file, and returns a parse tree for the source file with respect to the grammar. The big selling point is that TS (claims that it) can handle syntax errors well (still return a reasonable parse tree) and that it is incremental (returns new parse trees efficiently/quickly given some code edits and previous trees).

Parsers for different languages are provided by the community and while I haven't seen this first-hand, I find it easy to believe that many of them are not great. But the project is much younger than TextMate, and GitHub uses it for its on-web syntax highlighting so there might be some corporate support there.

Personally, the thing I would be most excited about seeing is Vim exposing a representation of the syntax tree which can be used not just for syntax coloring but also for semantic editing (expand visual selection one AST node up, copy function body, etc.). IDK how well the Vim architecture supports this today. But in theory you could then plug in whatever parse-tree-generator you choose (Tree-sitter or TextMate).

If you are using an LSP language server, it's true that the LS can give you a parse tree (one which is even more accurate, esp. in the case of context-sensitive grammars like C++), but a language server will always be slower (it will do more than a parser, for example it will resolve cross-file deps and so on) and therefore will have to be async and higher-latency. So I think there is room for both a fast incremental parse system (like Tree-sitter) and LSP support (for things like go-to-definition and find usage).

See also this discussion in the VSCode repo: microsoft/vscode#50140

fcurts · 2021-11-19T19:13:39Z

As someone who has spent months writing and maintaining TextMate and tree-sitter grammars for real-world languages, let me tell you that the TextMate grammar system is totally broken, at least from a 2021 perspective. TextMate grammars are a nightmare to maintain and impossible to get right. Out of desperation, I even developed my own macro system (just like the authors of TypeScript's TextMate grammar), and it was still a nightmare.

tree-sitter is in a completely different league. It's a top-notch incremental parser that can be used for accurate (!) syntax highlighting, code folding, code formatting, etc. tree-sitter grammars are dramatically easier to write and maintain, and it's actually possible to get them right. GitHub has been using tree-sitter for a while, and VSCode is also starting to use it (see https://github.com/microsoft/vscode-anycode).

Betting on TextMate grammars in 2021 would be an engineering crime.

imranZERO · 2021-11-20T12:18:34Z

As someone who has spent months writing and maintaining TextMate and tree-sitter grammars for real-world languages, let me tell you that the TextMate grammar system is totally broken, at least from a 2021 perspective. TextMate grammars are a nightmare to maintain and impossible to get right. Out of desperation, I even developed my own macro system (just like the authors of TypeScript's TextMate grammar), and it was still a nightmare.

tree-sitter is in a completely different league. It's a top-notch incremental parser that can be used for accurate (!) syntax highlighting, code folding, code formatting, etc. tree-sitter grammars are dramatically easier to write and maintain, and it's actually possible to get them right. GitHub has been using tree-sitter for a while, and VSCode is also starting to use it (see https://github.com/microsoft/vscode-anycode).

Betting on TextMate grammars in 2021 would be an engineering crime.

I am not sure how much of your hyperbolic speech can be deemed accurate, but from what I can see one of the biggest problem with tree-sitter is the general low quality of parsers contributed by different people as pointed out by the OP. "Top-notch" is not the way I would describe it. Which certainly needs to be taken into account as it would require a vast amount of effort to deal with these issues Vim would inherit as a result of undertaking the HUGE project of integrating tree-sitter.

I can't speak for textmate grammar for lack of familiarity. Personally my biggest problem with tree-sitter (at least the way neovim does it) is it's dependency on the environment (gcc/clang), large binary size and the do-it-all mentality which suits neovim but definitely does not feel like the "vim way".

brammool · 2021-11-20T12:22:57Z

As someone who has spent months writing and maintaining TextMate and tree-sitter grammars for real-world languages, let me tell you that the TextMate grammar system is totally broken, at least from a 2021 perspective. TextMate grammars are a nightmare to maintain and _impossible_ to get right. Out of desperation, I even developed my own macro system (just like the authors of TypeScript's TextMate grammar), and it was still a nightmare. tree-sitter is in a completely different league. It's a top-notch incremental parser that can be used for accurate (!) syntax highlighting, code folding, code formatting, etc. tree-sitter grammars are dramatically easier to write and maintain, and it's actually possible to get them right. GitHub has been using tree-sitter for a while, and VSCode is also starting to use it (see https://github.com/microsoft/vscode-anycode). Betting on TextMate grammars in 2021 would be an engineering crime.

Thanks for your opinion. Making it easier/simpler/better to write a parser is an important goal. So we should look at the best way to use tree-sitter. That it compiles each parser into an executable seems like a disadvantage. Perhaps this is OK for often used languages, but a way to add a parser at runtime would be really useful.

…

-- TIM: Too late. ARTHUR: What? TIM: There he is! [They all turn, and see a large white RABBIT lollop a few yards out of the cave. Accompanied by terrifying chord and jarring metallic monster noise.] "Monty Python and the Holy Grail" PYTHON (MONTY) PICTURES LTD /// Bram Moolenaar -- ***@***.*** -- http://www.Moolenaar.net \\\ /// \\\ \\\ sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ /// \\\ help me help AIDS victims -- http://ICCF-Holland.org ///

jgb · 2021-11-20T13:07:52Z

tree-sitter is in a completely different league. It's a top-notch incremental parser that can be used for accurate (!) syntax highlighting, code folding, code formatting, etc. tree-sitter grammars are dramatically easier to write and maintain, and it's actually possible to get them right. GitHub has been using tree-sitter for a while, and VSCode is also starting to use it (see https://github.com/microsoft/vscode-anycode).

If tree-sitter is top-notch, how come an ubiquitous and highly popular language like python is broken in it since quite a while?
When I tested neovim 0.5.1 with tree-sitter I ended up having to disable TS for python (which is the language I use the most) because the indenting and highlighting were unusable. Doesn't exactly inspire confidence.

clason · 2021-11-20T13:22:35Z

I think this discussion is devolving more and more from the purely technical and into prejudices. It is very important here to distinguish

tree-sitter (the engine, which I would agree with @fcurts is an excellent piece of software and fundamentally superior to other syntax engines);
Neovim's integration of tree-sitter, which is still marked "experimental" for a reason (and should be further separated into the fundamental integration and API in core -- which works rather well already -- and its use for syntax highlighting, folding, indentation etc. -- which is very much work in progress);
The individual language parsers (and queries), which are externally maintained.

I think Vim should at this stage focus on 1. to make a reasoned decision (while it of course makes good sense -- and would make me very happy -- to take Neovim's approach and decisions for 2. into account; admitting that the two projects have different needs).

And I find it highly disingenuous to point fingers at 3. while ignoring that the quality of TextMate grammars (and, indeed, Vim's bundled syntax files) varies wildly as well. It's clear that (just like Neovim) you cannot simply switch engines but have to support both (on a per-language basis) for some time until the replacement catches up.

fcurts · 2021-11-20T14:43:50Z

I was obviously talking about the engine, which is what matters in the long run. Regarding existing grammars, the difference is that tree-sitter grammars can be improved relatively easily because they can be reasoned about. On the other hand, improving real-world TextMate grammars is anywhere from difficult to impossible. (Often, fixing one problem causes an inexplicable problem somewhere else, which is only discovered later.)

I can't comment on integration aspects. I'm not even a Vim user. But as a language/tooling developer myself, I feel strongly that it's time to move past TextMate grammars, which is why I offered my insights. Good luck!

theHamsta · 2021-11-20T15:32:20Z

If tree-sitter is top-notch, how come an ubiquitous and highly popular language like python is broken in it since quite a while?
When I tested neovim 0.5.1 with tree-sitter I ended up having to disable TS for python (which is the language I use the most) because the indenting and highlighting were unusable. Doesn't exactly inspire confidence.

@jgb Indentation has nothing to do with tree-sitter itself. There is a very ad-hoc implementation of using the parsed tree as indentexpr. Python indentation is not working because this implementation just considers the current syntax node you are currently on which is nothing in case of the Python parser because the relevant syntax node ended in the previous line when you start a new one. One would have to add a rule that respects this case or tune the general logic at this point.

You always have to write some system that translates your parsed representation to indents. The quality of this translation says nothing about the quality of the representation itself.

Isopod · 2021-12-23T15:13:28Z

As someone who recently spent some time writing a TreeSitter grammar, I have also become less enthusiastic about the project. I watched the author’s presentation a while ago and it sounded like the greatest invention since sliced bread, but in practice it doesn’t always work that well.

The biggest obstacle in my opinion is languages with preprocessors (e.g. C and C++). This isn’t something I had considered initially, but it is simply impossible to parse those languages with TreeSitter because you’re dealing with a language within a language. Now before someone mentions this: I know TreeSitter supports injections, e.g. JavaScript in HTML, but that’s not the same thing because, as I understand, each injection is essentially its own “program”. It’s fundamentally not possible to parse pre-processed languages with a context-free grammar. If you think about it, conditional compilation is as context-sensitive as it gets.

I’m talking about constructs like this:

#if FLAG
  if (foo) {
#endif

  bar;

#if FLAG
  }
#endif

Or this:

#define BEGIN_FUNC void func() {
#define END_FUNC }
BEGIN_FUNC
  bla;
END_FUNC

Or this:

#define RENAME(x) renamed_ ## x
void RENAME(my_func) {
  bla;
}

How is TreeSitter supposed to generate an AST for such code if it doesn’t interpret the macros? It’s simply impossible. And often this will result in parse errors. Now, TreeSitter is in theory “fault tolerant”, so it should be able to recover from errors, but I’ve found that it often recovers in a weird, unpredictable way that causes syntax highlighting to be messed up. It gets even worse when we’re talking about using it for features like syntax-aware selections, indentations and folds: Just forget about it.

All TreeSitter grammars for preprocessed languages contain hacks to work around this issue, but they never work 100%. They just handle a few special cases, but blow up in the general case.

The next problem is that parsing is incredibly slow. I benchmarked parsing a 4 MB file and it took over a second. Depending on where you are coming from, that might not sound too bad, but 4 MB a second really isn’t impressive when you consider that modern RAM can handle tens of gigabytes per second. Quite frankly, I’m not sure this “incremental parsing” approach is all that useful when the implementation is so slow in practice. I guarantee I could write a hand-rolled parser that would just reparse the entire file on every edit and it would still be orders of magnitudes faster.

I’ve also found that syntactic highlighting doesn’t actually add that much value over a simple lexer, but it is significantly more complex. Semantic highlighting on the other hand is even more complex, but it also adds a lot of value. If I had to rate the cost-benefit relationship, I’d say: lexer > semantic > syntactic.

If I had to design a syntax highlighting system from scratch, I’d probably just go with a simple C API, something like this:

typedef enum {TOK_IDENT, TOK_STRING, TOK_OPERATOR, ...} Token;
void highlight_tokens(const char *buf, size_t len, Token *tokens, const void *input_state, void *output_state, size_t state_size);

You just pass a chunk of data to the parser and then it returns a buffer with a character class for each character (or maybe an array of ranges, see also LSP for a similar approach). This is the most general form, giving you the greatest amount of flexibility. You could hand-roll a parser, or build one based on regexes or TreeSitter grammars or whatever. It doesn’t restrict you to a particular system.

I’d even consider getting rid of the state persistence stuff and just pass one large buffer containing the entire file and reparse the whole file every time. Because in the general case, you have to do it anyway. Consider putting a comment /* at the beginning of a very large file. No matter what you do, sometimes, you’ll have to reparse everything, so I’m not sure it is even worth adding complexity to save time for only some edits. Better work on making the parser really fast. Computers are fast, it shouldn’t take that long to parse even a 100 MB file. And source files are usually much smaller than this.

skywind3000 · 2021-12-23T19:09:22Z

Anyone who eagerly promotes tree-sitter here should answer my questions above first. Repeating its advantages a thousand times do not mean that these fatal problems will disappear.

Tree-sitter is not a new thing, no need to be so excited. Remember that Atom has adopted tree-sitter early in 2018. Compared with excited neovim users, the atom community is very calm about this "new" feature.

I don't need a better highlighter at the cost of perfomance and flexibility. Because I am suffering performance issues right now and all I want is a fast & static regex-based highlighting system.

@lacygoill you claimed in this comment that the problem was caused by "drawing too much".

That's not true, I have done a bisect investigation in this problem here:

Syntax highlighting is extremely slow when scrolling up in recent version (v8.0.1599) #2712

And found that there was a big performance regression after 8.0.643 and 8.0.647. You can simply compare syntax highlighting speed difference in both vim 7.4 and the latest vim 8.3.xxxx and you will find that this is by no means a simple "drawing too much" problem.

Still unsolved today, see its latest comment !

girishji · 2023-05-31T06:08:08Z

And how does an end user find out what highlight group to use for an
item? Searching the files, what you can do with the current Vim syntax
plugins and a bit of guessing, seems too complicated.

Treesitter provides a tool called Playground. It displays the whole AST with highlight groups. A user has to open this window, place the cursor on the symbol in source file so that relevant part of AST is highlighted, and thus you know the highlight group, and the range of text (in source file) it encompasses.

On a general note, treesitter grammar bugs can be hard to fix. This gets harder if the author also used an external ad-hoc parser (written in C) in addition to treesitter's own parser. For non CFG like markdown using an external parser is not uncommon.

ghost · 2024-02-26T14:28:46Z

The current highlight groups are somewhat of a blunt instrument. Setting Special, for example, will affect a wide and varied selection of language-specific highlight links and achieving better granularity requires making highlight configurations for each language of interest (which can be many for a public colorscheme)... at the very least, it would be convenient for the 'standardized' highlight groups to reach a higher level of specificity with respect to the most universal language constructs (for example, "Import"/"Export" for import- and export-like keywords).

clason · 2024-02-26T15:52:24Z

For the record, this is why Neovim has added a (larger) number of standard groups for tree-sitter highlighting (inspired by TextMate/Sublime scopes, which Helix uses as well): https://neovim.io/doc/user/treesitter.html#treesitter-highlight-groups

All official tree-sitter queries are enforced by CI to use these (only) for consistency (although the fallback mechanism makes it possible to gracefully handle further specialization and provides an automatic language-specialization for colorschemes to use). Your needs are of course different, but I'm pretty happy with it, and it might contain helpful inspiration.

ghost · 2024-02-26T17:31:29Z

I'll comment only once more so as not to further disturb the great number of people likely following this conversation: my suggestion of expanding the standardized highlight group set was partially inspired by tree-sitter. I use Vim 9.1 (and vim9 script) so tree-sitter isn't a solution for me although I have not once felt the power and performance of the current syntax highlighting system to be insufficient (however, I do recognize some of the regex improvements suggested in comments above). In fact, Vim 9.1 performs exceptionally.

I wholeheartedly believe in Vim's highly considered approach, addressing shortcomings of pioneer solutions ("doing it right," as Bram had said). With that, I assume a major change (behind a feature flag or otherwise) is not around the corner. If, however, the interface (highlight groups, in this case) can be determined with the level of rigor we've come to expect from changes in the Vim project, syntax maintainers would ideally eliminate reliance on bespoke highlight groups.

This would result in a superior experience for both syntax maintainers and end users without the need for immediate introduction of a new parsing system. In the event of deeper systemic changes (introduction of systems akin to TextMate or tree-sitter), the highlight group interface would remain the same and end users would benefit immediately from any performance- or matching-related improvements.

icedman · 2024-05-06T00:43:32Z

In the meantime... textmate highlighter:

https://github.com/icedman/vim-textmate

icedman · 2024-05-06T05:37:09Z

checking on treesitter status.

sqlite3.c (222k lines of code)

neovim > file loads, after a couple of seconds; lags on editing - next to unsuable
helix > looks like it doesn't enable treesitter for very large files (or may not be working)

tinywl.c (1000 lines of code)

no problem with both

my own thoughts if the aim of VIM is..

vim as a simple, lightning fast editor - avoid treesitter which adds a lot of dependencies, and a degraded performance for very large files (at least for now). for simple highlighting - treesitter is also overkill
vim as an IDE, with all the bells an whistles of tree parsing - maybe too late for this. there is already neovim :)

pedrohgmacedo · 2024-05-14T23:25:34Z

sqlite3.c (222k lines of code)

idk man, but if you are manually editing a file with 200k loc of c code, instead of programatically generating it, feels like you gotta be doing something wrong imho. i tried opening the treesitter c parser, and yes, it gets a little laggy. i did :TSDisable and then it got fast. it has 600k loc.
on the other hand tree-sitter is pretty useful great on the 95% of the non 200k+ loc files you edit.

clason · 2024-05-15T07:02:33Z

Again, please evaluate the technology itself, not Neovim's (work in progress) implementation -- at least without educating yourself about the implementation (you can ask, you know!) In particular, we have not yet implemented a timeout such as the regex syntax engine has -- so you are comparing apples and oranges. It's on our todo list but not a very high priority since tree-sitter is so fast in general that you only really need it for "monster files" such as this. (Also make sure to always test the latest nightly version -- we provide appimages from our releases page -- since we make constant performance improvements.)

icedman · 2024-05-15T11:44:55Z

Just so you know, I do love treesitter and I use it with neovim. I've experimented it with vim also:
https://github.com/icedman/vim-treesitter
My comment is not entirely without off-hand or without thought as you seem to suggest.

clason · 2024-05-15T11:52:00Z

I'm not accusing you of anything; but as you can see from the discussion here, many comments are made without thought. And I do think "checking on treesitter status", followed by these examples, is misleading (even if not deliberately). So it's important to provide context if you want to have a meaningful discussion. If you don't do it, I will (to the limit of my ability and interest).

jarkkojs · 2024-07-15T07:43:32Z

The main drawback still applies: reliance on an external program.

LSP is worse tho because it depends on an external program and a build configuration... and the toolchain... and the build system ;-) Especially with Linux kernel tree this quickly becomes apparent and it starts show its ugly face.

TS is not as bad but is still quite faulty design given that it is not well integrated to the source code of vim.

jarkkojs · 2024-07-15T08:24:16Z

This is not to compare to neovim but it is fair to state that it does not have fully builtin TS support. You need a separate plugin nvim-treesitter for compiling parser and generally make it usable.

I use this only to point out that there is no a proven example to this day of vim or any of it s forks demonstrating fully integrated TS support.

Also lot of TS complexity comes from being cross-tool solution. If anything I'd love to see vim pursue to a non-generic designed for vim grammar solution, maybe something that would take advantage of the new vim9script.

errael · 2024-07-15T17:05:33Z

vim pursue to a non-generic designed for vim grammar solution, maybe something that would take advantage of the new vim9script.

Writing a c++ parser in vim9script?

girishji · 2024-07-16T05:53:03Z

This is not to compare to neovim but it is fair to state that it does not have fully builtin TS support. You need a separate plugin nvim-treesitter for compiling parser and generally make it usable.

I use this only to point out that there is no a proven example to this day of vim or any of it s forks demonstrating fully integrated TS support.

Integrating Treesitter fully into Vim is not a good design choice. Treesitter uses a different parser for each language, which needs to be generated using a grammar.js file. Typically, this parser is compiled into a .so library and placed in the runtime path. "Full integration" implies compiling all these language parsers, even though most of them may not be used, and maintaining them as the grammar.js files change. Additionally, some languages have buggy "custom parsers" built on top of what Treesitter generates, exacerbating the issue.

Treesitter seems to be an unsuitable solution for a lightweight editor like Vim. Its main selling point is the availability of an Abstract Syntax Tree (AST). While I wrote a small plugin to display devdoc and found that Treesitter could have made my job easier (I had to use Pandoc to generate an AST instead), it is still an overly complex solution if its primary role is to color syntax, a non-essential feature in my opinion. Other use cases, such as additional text objects or scaffolding for refactoring, are already covered by basic Vim functionalities.

On the other hand, TextMate also has its issues. Its grammar differs significantly from Vim's syntax grammar, which is closely aligned with Vim's regex grammar. Integrating TextMate into Vim feels like forcing a square peg into a round hole unless Vim's regex grammar artifacts are modified to reflect TextMate's regex grammar.

jarkkojs · 2024-07-16T14:26:34Z

vim pursue to a non-generic designed for vim grammar solution, maybe something that would take advantage of the new vim9script.

Writing a c++ parser in vim9script?

Why in the heck would you do that?

LunarWatcher · 2024-07-16T14:29:42Z

vim pursue to a non-generic designed for vim grammar solution, maybe something that would take advantage of the new vim9script.

Writing a c++ parser in vim9script?

Why in the heck would you do that?

For the memes, of course

jarkkojs · 2024-07-16T14:30:04Z

vim pursue to a non-generic designed for vim grammar solution, maybe something that would take advantage of the new vim9script.

Writing a c++ parser in vim9script?

Why in the heck would you do that?

For the memes, of course

Sorry bro, I ignore memes.

Obviously parser generator (just like with even flex and bison) needs to be written with something else than vim9script. Just thought that it might be usable for snippets of logic but obviously could be wrong too, since not a contributor.

jarkkojs · 2024-07-16T14:36:19Z

This is not to compare to neovim but it is fair to state that it does not have fully builtin TS support. You need a separate plugin nvim-treesitter for compiling parser and generally make it usable.
I use this only to point out that there is no a proven example to this day of vim or any of it s forks demonstrating fully integrated TS support.

Integrating Treesitter fully into Vim is not a good design choice. Treesitter uses a different parser for each language, which needs to be generated using a grammar.js file. Typically, this parser is compiled into a .so library and placed in the runtime path. "Full integration" implies compiling all these language parsers, even though most of them may not be used, and maintaining them as the grammar.js files change. Additionally, some languages have buggy "custom parsers" built on top of what Treesitter generates, exacerbating the issue.

Treesitter seems to be an unsuitable solution for a lightweight editor like Vim. Its main selling point is the availability of an Abstract Syntax Tree (AST). While I wrote a small plugin to display devdoc and found that Treesitter could have made my job easier (I had to use Pandoc to generate an AST instead), it is still an overly complex solution if its primary role is to color syntax, a non-essential feature in my opinion. Other use cases, such as additional text objects or scaffolding for refactoring, are already covered by basic Vim functionalities.

On the other hand, TextMate also has its issues. Its grammar differs significantly from Vim's syntax grammar, which is closely aligned with Vim's regex grammar. Integrating TextMate into Vim feels like forcing a square peg into a round hole unless Vim's regex grammar artifacts are modified to reflect TextMate's regex grammar.

Agree with you. Tree-sitter is made to work with multiple editors. Vim would benefit most something with similar high-level ideas but made only for vim. Lot of meat and cruft in tree-sitter comes from it being "too generic". Nice idea but the implementation is not that great and adds bunch of extra dependencies.

If I recall correctly, e.g. installing neovim pulled tree-sitter, which pulled whole nodejs as dep, which was just terrible. I mean for an "ubiquitos" test editor used in terminal you have to have nodejs installed to your machine. I'd like keep my computer nodejs crap free.

clason · 2024-07-16T14:37:41Z

If I recall correctly, e.g. installing neovim pulled tree-sitter, which pulled whole nodejs as dep, which was just terrible.

You don't. (Please, get your facts right before making a claim.)

jarkkojs · 2024-07-16T19:35:48Z

If I recall correctly, e.g. installing neovim pulled tree-sitter, which pulled whole nodejs as dep, which was just terrible.

You don't. (Please, get your facts right before making a claim.)

But some of its parsers AFAIK anyway. You are correct about package deps. I.e. you don't need it but make tree-sitter usable, system in practice needs to have npm, right? Or can you get 100% user experience without npm?

Getting facts exactly right is pretty hard in the case of neovim is hard because the tree-sitter implementation is half-broken, given that it is unusable without https://github.com/nvim-treesitter/nvim-treesitter. I guess one can sum that plugin as part of usable tree-sitter functionality.

clason · 2024-07-16T19:39:35Z

Yes, you can absolutely do without nodejs installed.

And you do not need that plugin, anymore than you need vim-polyglot in Vim.

(Source: Maintainer for both Neovim and nvim-treesitter.)

And I would appreciate you not throwing around words like "broken" without actual experience.

girishji · 2024-07-17T08:00:28Z

If I recall correctly, e.g. installing neovim pulled tree-sitter, which pulled whole nodejs as dep, which was just terrible.

Node.js is required to parse the JavaScript grammar file when creating a new parser, as noted in the dependencies section. However, end users only need the compiled object file or C source file containing the language-specific parser. Therefore, if my recollection is correct, end users don't need to install Node.js.

jarkkojs · 2024-07-17T10:57:55Z

Yes, you can absolutely do without nodejs installed.

And you do not need that plugin, anymore than you need vim-polyglot in Vim.

(Source: Maintainer for both Neovim and nvim-treesitter.)

And I would appreciate you not throwing around words like "broken" without actual experience.

I'm unfamiliar what polyglot is. I only use a few plugins and most of them are from tpope:

  Plug 'catppuccin/vim', { 'as': 'catppuccin' }
  Plug 'kaarmu/typst.vim', { 'as': 'typst' }
  Plug 'tpope/vim-flagship', { 'as': 'flagship' }
  Plug 'tpope/vim-fugitive', { 'as': 'fugitive' }
  Plug 'tpope/vim-vinegar', { 'as': 'vinegar' }
  Plug 'vim-scripts/git_patch_tags.vim', { 'as': 'git_patch_tags' }

Just hoping that neither TS or LSP will never land as any sort of features to Vim. I'm happy to let NeoVIM to keep them. Since not a developer I leave it here ;-) And IMHO Vim does not need to compete with NeoVIM anyhow. Many systems programmers (like me) appreciate it being as low-level as it is because ubiquitos property is more essential than TS and LSP combined togethter.

Shane-XB-Qian · 2024-07-17T15:45:42Z

Since not a developer I leave it here ;-) And IMHO Vim does not need to compete with NeoVIM anyhow. Many systems programmers (like me) appreciate it being as low-level as it is because ubiquitos property is more essential than TS and LSP combined togethter.

welcome you to continue to use vim native syntax hl method, it is mostly strong and stable, though maybe there some issues with specific terminal etc, maybe.
btw or fyi: there semantic highlight from lsp too, it is not xxxvim's dedicated feat, you can try to play it if you like, what i say from my side: it is not so "beautify" or "wonderful".
official vim welcome you to play, anyway, good luck.

egberts · 2024-08-30T16:40:20Z

This gentleman took a crack at a plugin for TextMate and Vim.

Look what he has accomplished: https://github.com/icedman/vim-textmate

clason · 2024-08-30T16:41:18Z

#9087 (comment)

icedman · 2024-10-07T23:20:21Z

This gentleman took a crack at a plugin for TextMate and Vim.

Look what he has accomplished: https://github.com/icedman/vim-textmate

I also took a crack at nvim plugin for textmate.. https://github.com/icedman/nvim-textmate

70+ more from the nvim users :)

skywind3000 added the enhancement label Nov 4, 2021

skywind3000 changed the title ~~[proposal] Can we introduce TextMate grammar system for syntax highlighting~~ [proposal] Can we introduce TextMate grammar system for syntax highlighting ? Nov 4, 2021

skywind3000 changed the title ~~[proposal] Can we introduce TextMate grammar system for syntax highlighting ?~~ [proposal] Introducing TextMate grammar system for syntax highlighting ? Nov 4, 2021

chrisbra mentioned this issue Jun 9, 2023

tree-sitter syntax highlight in vim like what it is in neovim. #12508

Closed

AndrewRadev mentioned this issue Jul 31, 2024

Option to disable looping AndrewRadev/sideways.vim#56

Closed

linrongbin16 mentioned this issue Oct 2, 2024

Coloring System: TextMate rsvim/rfc#20

Open

[proposal] Introducing TextMate grammar system for syntax highlighting ? #9087

[proposal] Introducing TextMate grammar system for syntax highlighting ? #9087

Comments

skywind3000 commented Nov 4, 2021 • edited Loading

Current problem

Issues of tree-sitter

TextMate grammar system

Possible Solution

brammool commented Nov 4, 2021

clason commented Nov 4, 2021

bfredl commented Nov 4, 2021 • edited Loading

imranZERO commented Nov 4, 2021 • edited Loading

bfrg commented Nov 5, 2021

bfredl commented Nov 5, 2021 • edited Loading

mg979 commented Nov 6, 2021

skywind3000 commented Nov 6, 2021 • edited Loading

skywind3000 commented Nov 6, 2021 • edited Loading

it is easy to implement textmate syntax highlighting

lacygoill commented Nov 6, 2021

lacygoill commented Nov 6, 2021

skywind3000 commented Nov 6, 2021 • edited Loading

lacygoill commented Nov 6, 2021 • edited Loading

skywind3000 commented Nov 6, 2021 • edited Loading

lacygoill commented Nov 6, 2021

mg979 commented Nov 7, 2021

theHamsta commented Nov 9, 2021 • edited Loading

andmis commented Nov 9, 2021 • edited Loading

fcurts commented Nov 19, 2021

imranZERO commented Nov 20, 2021 • edited Loading

brammool commented Nov 20, 2021 via email

jgb commented Nov 20, 2021

clason commented Nov 20, 2021 • edited Loading

fcurts commented Nov 20, 2021

theHamsta commented Nov 20, 2021

Isopod commented Dec 23, 2021 • edited Loading

skywind3000 commented Dec 23, 2021 • edited Loading

girishji commented May 31, 2023 • edited Loading

ghost commented Feb 26, 2024

clason commented Feb 26, 2024 • edited Loading

ghost commented Feb 26, 2024 • edited by ghost Loading

icedman commented May 6, 2024 • edited Loading

icedman commented May 6, 2024

pedrohgmacedo commented May 14, 2024 • edited Loading

clason commented May 15, 2024

icedman commented May 15, 2024

clason commented May 15, 2024

jarkkojs commented Jul 15, 2024 • edited Loading

jarkkojs commented Jul 15, 2024 • edited Loading

errael commented Jul 15, 2024

girishji commented Jul 16, 2024 • edited Loading

jarkkojs commented Jul 16, 2024

LunarWatcher commented Jul 16, 2024

jarkkojs commented Jul 16, 2024 • edited Loading

jarkkojs commented Jul 16, 2024

clason commented Jul 16, 2024

jarkkojs commented Jul 16, 2024 • edited Loading

clason commented Jul 16, 2024 • edited Loading

girishji commented Jul 17, 2024

jarkkojs commented Jul 17, 2024 • edited Loading

Shane-XB-Qian commented Jul 17, 2024

egberts commented Aug 30, 2024

clason commented Aug 30, 2024

icedman commented Oct 7, 2024 • edited Loading

skywind3000 commented Nov 4, 2021 •

edited

Loading

bfredl commented Nov 4, 2021 •

edited

Loading

imranZERO commented Nov 4, 2021 •

edited

Loading

bfredl commented Nov 5, 2021 •

edited

Loading

skywind3000 commented Nov 6, 2021 •

edited

Loading

skywind3000 commented Nov 6, 2021 •

edited

Loading

skywind3000 commented Nov 6, 2021 •

edited

Loading

lacygoill commented Nov 6, 2021 •

edited

Loading

skywind3000 commented Nov 6, 2021 •

edited

Loading

theHamsta commented Nov 9, 2021 •

edited

Loading

andmis commented Nov 9, 2021 •

edited

Loading

imranZERO commented Nov 20, 2021 •

edited

Loading

clason commented Nov 20, 2021 •

edited

Loading

Isopod commented Dec 23, 2021 •

edited

Loading

skywind3000 commented Dec 23, 2021 •

edited

Loading

girishji commented May 31, 2023 •

edited

Loading

clason commented Feb 26, 2024 •

edited

Loading

ghost commented Feb 26, 2024 •

edited by ghost

Loading

icedman commented May 6, 2024 •

edited

Loading

pedrohgmacedo commented May 14, 2024 •

edited

Loading

jarkkojs commented Jul 15, 2024 •

edited

Loading

jarkkojs commented Jul 15, 2024 •

edited

Loading

girishji commented Jul 16, 2024 •

edited

Loading

jarkkojs commented Jul 16, 2024 •

edited

Loading

jarkkojs commented Jul 16, 2024 •

edited

Loading

clason commented Jul 16, 2024 •

edited

Loading

jarkkojs commented Jul 17, 2024 •

edited

Loading

icedman commented Oct 7, 2024 •

edited

Loading