Feedback Wanted: On Parsers, Grammars, & Other Such Things

Hello Everyone,

In the year and a half (!) since Nova’s launch, there’s been a lot of discussion and questions around languages and grammars.

I’m here today to talk about this more in depth, discuss our goals for the future, and get some feedback from our community as we move forward. This directly concerns how both Panic and our extension developers create and maintain language extensions for Nova.

Note: This is a very long post. I am sorry it’s so long. If you do bear with me and read it, I do appreciate it.

First, Some Overly Verbose Backstory

Feel free to skip this section unless you’re interested in the How We Got Here.

Nova as a project originated from the codebase of Coda iOS (née Diet Coda, eventually Code Editor for iOS). Many of the prototypes for its more core features were started for a Coda iOS 3.0 (like indexing and Git), although they never ended up shipping as such before the focus shift to Nova and the former’s discontinuation. The biggest shared component, though, was the parse engine.

Backing up further—many of you may know that Coda 1 and Coda 2 used a parse engine licensed from SubEthaEdit, with modifications down the line made by Panic to support some more complex features. Its language grammars were directly based on those from this engine, which use an XML format. When it came time to build Diet Coda, though, we could not use it for several reasons (license being first, and restrictions of the iOS 5/6-ish SDK and porting the SubEthaEdit codebase being the second).

Thus, it was decided to write a new engine from scratch for this project, one that might eventually make its way back into a future Coda. It was also decided to not base our grammars on those from SubEthaEdit, for reasons that honestly are sort of lost to time (this was slightly before my time). It was also decided not to base the grammars on those of TextMate, as that format could not express all of the features that SubEthaEdit had (and we were trying to match in Diet Coda). So, the original Diet Coda 1.0 used a private property-list format for grammars, and only supported a few (five, I think?) at launch which slowly grew over time.

After Diet Coda 1.x is when I joined the project, which eventually lead to me taking it over when its original lead left Panic. Part of my work then became moving towards a “Diet Coda 2.0.” In so doing, the parse engine was heavily rewritten and upgraded, and the property list format was changed to an XML format that could be more expressive (and less verbose, lol). That is from where the basis for Nova’s grammar format hails.

Eventually, the culminated in porting the engine back to the Mac and starting the Nova project. Since it was already a relatively mature engine and grammar set for Coda iOS, we decided to use it as the basis for Nova. As such, through Nova’s nearly four year development cycle before release, the engine and grammar format were upgraded again and again until we ended up with what we have today.

Up until the point of launch, our primary focus was on maturing the built-in grammars in Nova and making it easiest for our team to build the features we wanted to build. This was the motivating factor for a custom in-house engine and custom in-house grammars. We could rapidly make changes to both the engine and its grammars as necessary to build cool things.

In the running up to Nova’s launch, though, our focus began to shift. We immediately recognized the importance of our developer ecosystem and were honestly humbled by the response from the nascent community. To be perfectly frank, Coda never quite had that response by extension devs that Nova has been blessed by, and in hindsight we weren’t expecting it.

The Stress of Building and Maintaining Language Grammars

For our team, maintaining our built-in language grammars has proven to be a time-consuming task, especially when we aren’t nearly as versed in some languages we ship ([cough] TypeScript). While the grammars have come a long way, there are still many cases where we are playing catch-up, only compounded by how fast many of these languages evolve (TypeScript, Python, etc.).

Thus we come to the most common question we get from extension developers: “How do I make an extension for {X} language?”

Around this, there are a few glaringly obvious pain points:

  • Nova uses a custom grammar format.
    • While it is conceptually similar to both SubEthaEdit & TextMate formats, it is not directly compatible.
    • Attempts have been made to write “converters,” but they are only a start to developing a grammar, and not a solution.
  • The languages most requested by users likely already have mature grammars in other formats out there: Go, Rust, C++, Swift, etc.
    • Most of the time, our extension developers (you all) aren’t interested in writing one from scratch.
    • Developers just want to hook Language A up to Nova, maybe add some additional cool things on top, and have it all just work.

This alone has led, we believe, to a good number of (potential and existing) extension developers to shying away from maintaining a custom language grammar. Because once the effort is put in to actually build it, you then have to maintain it; This is the same issue we face internally. It’s just compounded for you all, as this isn’t your day job.

Identifying Possible Solutions

This brings me to the actual point of this post: solving this issue for both us and you. This issue has weighed heavily on my mind ever since Nova’s launch.

We have been listening, even if it might not have felt as such, as considering such a wide-reaching change as this in addition to the development work we do otherwise is… a lot. For each time we at Panic would sit down and discuss this, we didn’t feel like it was necessarily the time to make a decision. But it’s now time we do.

There have been many suggestions put forth by the community over the last year and a half. I’d like to recap the most common ones and address each one before moving forward.

Suggestion 1: Switch to TextMate (or VSCode) Grammars

Both VSCode and Atom (before tree sitter, bear with me) can trace their base language grammars to a fork of those from TextMate. While now in different text formats (JSON, CSON, and Property List respectively), they are ultimately the same grammar format, diverged by tweaks and improvements made along the way by each editor. Therefore, why not use one of those?

That is an excellent suggestion. It would solve one problem: It will make it easier to integrate existing grammars from another editor into Nova.

However, it fails to solve another. Since Nova’s current parser does not support TextMate grammars, we’d need to extend it to support them. The TextMate format is… roughly documented (v1 is much more documented than v2), but this documentation is just a specification. We’d need to ensure that the parser works in all of the subtly different ways the format needs to be parsed over what we have now, all the while ensuring that Nova parsed these new grammars in roughly the same way as the other editor(s) that use them. This is a task in and of itself.

Another aspect is that, ultimately in this author’s opinion, TextMate and other formats which rely entirely on regular expressions to parse are flawed. These grammars can parse complex languages (it’s not just regular expressions, it’s the combination of scoping that make it work), but it’s still restricted by how much expressiveness can be made in the format itself. This is one reason we originally built the Nova (née Coda iOS) format, to make a few parts of this type of parsing easier than TextMate’s format can express.

If we adopt the TextMate (or VSCode) format, it will be increasingly difficult for us to add to this format to resolve these limitations, as it’s also used by other editors, and we’re right back to where we started.

Atom, before GitHub’s acquisition by Microsoft, had also clearly realized this, and began migrating to a different conceptual method of parsing (tree sitter, which we’ll get into in the second suggestion). This went quite well for them and the developer response was overall positive.

The VSCode team, on the other hand, has been conflicted on tree sitter support or otherwise replacing their TextMate grammars. They see the Language Server Protocol’s newer Semantic Tokenization (see the third suggestion), which supplements but does not replace the grammars as their path forward for the time being.

In any case, though, it would likely not be in our best interest to adopt TextMate grammars. If we do, and other editors move away from this style of parsing, we’re once again tethered to an aging format as well as maintaining our own custom parser to consume them. While the TextMate grammar space is mature, there is little area for improvement or advancement.

(Oh, and before someone asks “Why not take TextMate’s engine source?” — TextMate 2 is GPLv3, we can’t even legally look at it for reverse engineering.)

Suggestion 2: Switch to Tree Sitter

aka in which Logan learns to eat some crow.

Tree Sitter is a project developed independently from any specific editor over recent years which approaches text editor parsing from a different perspective. It does not use regular expressions, instead using an approach much closer to most compilers, albeit still in an abstracted, generalized way. I’ll not bore you with the strict details (feel free to read up on the project or watch their conference talks for more info).

The project probably gained the most attention when Atom began adopting it to replace most of its core language grammars several years ago. This was a long process that was done in steps as those grammars matured, staring with only a few and slowly supplanting many of the built-in languages, as well as growing new grammars in its third-party package ecosystem.

Ultimately, Atom’s development slowed after the GitHub acquisition, but tree-sitter remains as an independent project. It’s also experimentally supported by a NeoVim plugin. The project itself has healthy activity, and from what I can tell, is primary led by one core developer and several regular contributors, supplemented by many contributors for each of the language grammars themselves.

So, what are the pros and cons here?

Well, the biggest pro is that we’d be replacing not just our language grammars, but also the underlying core parsing engine. This means that the engine using the grammars is the exact same as used in Atom and NeoVim. There is no room for interpretive difference as is the case with TextMate grammars where each editor implemented their own parser for them.

Second, there are many, many tree sitter grammars out there already. Maybe not as many as TextMate grammars, but it’s more than enough to cover the most popular languages of which we can think.

Third, Tree Sitter, by approaching parsing from a fundamentally different direction, has the capability to have much richer syntax highlighting and parse tree logic than regular expression-based parsers. This could allow Nova to build much richer syntactic analysis tools on top of it going forward in ways that more complex language servers are currently necessary.

Okay, cons: Even more so, we’d be moving aside our current parsing engine to add Tree Sitter wholesale. We’d keep support for our existing grammars and parsing just as Atom did as it transitioned. But going forward, we’d want to direct development toward the former.

Further, Tree Sitter grammars have gained a small reputation for being overly complicated to build compared to the much more commonly understood regular expression style. While I understand this concern, it does seem that after some initial hesitation most developers working on them have adapted well.

Finally, we’d be giving up direct control of the parsing engine itself to the existing Tree Sitter community at large. This means we wouldn’t necessarily be able to make the wide-reaching changes that we might want to make rapidly. However, when all is said and done, it’s my opinion that anything we’d want to do in this area sits above the parse engine (in areas such as symbolication and syntactic analysis), and thus the risk is far outweighed by the rewards. Panic has also gotten very comfortable with making changes to and helping maintain open source projects (such as libssh2).

Suggestion 3: Adopt Language Server Protocol (LSP) Semantic Tokenization

We could take the same approach that VSCode has started to take: supplementing its parsing by using the new support in the Language Server Protocol for Semantic Tokenization.

To be clear: Semantic Tokens in LSP is not a replacement for parsing. It’s a supplement. It’s designed to make it so that certain tokens, such as class names, methods, etc. can be targeted for specific highlighting based on the knowledge that a specific language server has about them. It can do more, but most that implement this feature focus on supplementing the existing highlighting provided by TextMate-style parsing.

If we were to adopt this, we’d still have the same core problem: our underlying custom parse engine.

Also, there’s also nothing to prevent us from adopting semantic token highlighting on whatever engine with which we end up. Ultimately, I see this technology as icing to be considered later, not as a solution.

So, Where Do We Go From Here?

With all of this in mind, we at Panic are left with a decision on Nova’s future, one of which we most definitely want to make with the feedback of our amazing extension developers.

To recap in short, the most practical options going forward, in no particular order, are:

  • Keep our custom parsing engine and format, perhaps investigate better ways to convert from other formats
  • Adopt the TextMate format, which would require extensive changes to our parsing engine, and while helping our extension developers would not necessarily help us as much in making advancements in the future
  • Transition to an entirely different parsing engine and grammar format, the foremost possibility being Tree Sitter, thus helping both us and our extension developers and potentially making room for advancement in the future

At this point, I am not committing us to any one path, as this is still all just exploratory. But, we are actively discussing these things internally. We also have no specific timeframe for any of this, as there’s a lot of “ifs” abound.

Now, in my opinion and after laying all to bare in this way, Tree Sitter stands as the best path forward for both Nova’s team as well as our extension developers. Keep in mind, I know little about the actual performance of Tree Sitter against our current parser (both of which are marketed as very fast), and likely won’t until time is taken to actually try. There are more things to consider beyond the raw speed of its C source, as Nova’s features need to sit atop it and bridge to Swift, which may take time to get right. So, that’s all within a grain of salt.

If you made it this far, thank you. I’m here to listen. How do you all feel about all of this? Are there any major considerations I’ve missed or forgotten? I’d love your thoughts, as you are as much a part of the future of Nova as we are.

Well, that ended somewhat abruptly.

15 Likes

I’m so happy and excited about this!

I’m so excited for this! I’ve written language extensions for other editors (Atom, VSCode, et cetera) and when I tried to write one for Nova I really struggled. I think I probably started one five separate times and finally gave up completely.

Off-hand, it seems like the tree-sitter option makes the most sense, as it’s more agnostic and has a community/ecosystem around it.

From a developer perspective, I would love to be able to find an existing extension for another editor, fork/copy the theme file into a Nova extension, and have it Just Work™️.

Alternatively, if there was a system (like tree-sitter) that I could reference in order to write my own grammar/themes, that would also work.

Lastly, I think that no matter what is chosen (even doing nothing and sticking with the current custom grammar format), the real step change comes from how good the documentation is going to be. For example I read the tree-sitter documentation and it feels pretty technical to me. After reading it, it’s not really obvious to me how I would go about making a theme for it. Then I tried to find existing tree-sitter themes and tried to get a look at what the theme format actually looks like and even that I sort of struggled with. Great guides/walkthroughs around what developers commonly will want to do (e.g. taking a theme file from another editor’s language extention and porting it to Nova) is what I think the secret sauce is to creating a vibrant developer community.

Thanks for working on this!

5 Likes

Being struggling on my language extensions for Nova and glad to see the post.

From the extension developers’ view, the textmate grammar sucks, which is far away from the real syntax structure of a common programming language, hence requires tremendous efforts to make a not-that-bad syntax highlighting extension. It’s really hard to translate from a language specification which is always written in the format of PEG to the textmate format. Nova’s current XML-based syntax highlighting also suffers from the same problem in the textmate.

I also believe that LSP semantic tokenization is not the future of syntax highlighting of a code editor, it just works like a patch of the current poorly-designed syntax highlighting system.

Compared to the textmate, the Treesitter generates the entire syntax tree, and the grammar of it is more similar to the common language specification we found. The Treesitter is much more developer-friendly. I believe that the Treesitter is right way to go.

3 Likes

I don’t have a strong opinion on the path forward, but thought I’d share a few of my thoughts. I haven’t looked into Tree Sitter, so I don’t know much about it beyond your description above. It does sound technically interesting, and I’ll always applaud efforts to not let Microsoft run away with setting all the standards since it doesn’t have a good track record.

That said, Tree Sitter does seem to be the path most at odds with your The Stress of Building and Maintaining Language Grammars section above. If Nova is mostly alone as an actively-developed editor using Tree Sitter, then extensions would be less likely to benefit from the work of other editors in defining and maintaining language grammars. Many language extension devs could likely find themselves in the same situation as now of starting from scratch (though with better tools).

Speaking for myself, I wouldn’t mind the extra work of Tree Sitter if necessary. What’s most important to me is for Nova to work toward better LSP support, as language servers offer a lot of targeted help that would be difficult for me to replicate. I get a good number of issues inquiring why my extension lacks a particular LSP feature. If Tree Sitter is the path least likely to monopolize Panic’s focus, then that’s an important benefit in my opinion.

3 Likes

Hey Logan, thank you for this extensive insight into the state of things at Panic, and on your thinking. As someone who, for a year, has struggled to get a language extension for a deceivingly complex language off the ground, I would say that, in order of the options:

  1. The current language parsing engine of Nova is incredibly flexible and powerful. One of the reasons my extension still isn’t published is, in fact, that Nova’s engine allows me to get language parsing to a level of detail and correctness I haven’t encountered in published extensions yet. Sublime’s newest iteration of their engine maybe could enable this too, and a sufficiently complex Tree-sitter grammar no doubt could too, but other engines definitely cannot. I’ll get back to that in a ’mo. Now, as you clearly explain, this impressive technical feat comes with a lot of baggage re. maintenance and ecosystem integration. So, both despite and because of the huge amount of work I have sunk into creating a language grammar on par with “the best out there”, I think switching away from Nova’s current engine is warranted.

  2. However, both for the technical legacy reasons you name and for the fact they are technically incredibly, frustratingly limited in their ability to express non C-ish languages, Textmate grammars would be the worst choice. I know they are alluring due to VSCode’s continued support for them, but switching to such a legacy format would essentially downgrade Nova two or three notches in language support, from “best in class, small ecosystem”, to “lots of mostly crappy support”. When it comes to modern and off-beat languages, Textmate grammars simply sh*t the bed. It’s not their fault – age and neglect often come with incontinence –, but it is a fact.

  3. Which leaves Tree-sitter. Full disclosure: I like Tree-sitter, despite the fact the initial investment in creating a grammar is even higher than for Nova, and I like it particularly because you cannot create crappy “works for a handful of situations” parsers in it. Tree-sitter parsers are real parsers. In fact, I think I am on the record wishing Nova had used it off the bat [nope; seems I’m misremembering].

Tree-sitter has been quietly gathering steam since its inception for Atom, and its prognosed demise with the sunsetting of the latter has not come to pass:

BTW, for the latter: read Federico Viticci’s rave review on how Runestone handles huge, complex, faulty files without breaking stride. That is a non-dev noticing how well Tree-sitter implements its original goals of creating a workable AST out of anything thrown at it, correct or not. The number of grammars also has been quietly growing, including one dear to my heart.

If all of this reads like a pitch, you are not wrong: I think that if Panic decide to switch out Nova’s parser engine for something out there, Tree-sitter is the way to go. Compared to it, going TM grammars would be an admission of defeat (“we can’t support a good syntax engine, so here you get a crappy one you know and only love if you never looked at it in anger”).

[EDIT: backfilled links and corrected typos rather than fighting the Discourse app any longer; my heartfelt apologies to anybody whose notifications reflects my struggles to get this post out].

4 Likes

I am sure we can all agree how important the community/ecosystem is! I am not a LSP developer per se, but based on what has been discussed here and by @Logan, carrying on as it is, is out the way… its just an extra unnecessary step, patch work and reinventing the wheel sort of problem. Very hard to maintain and keep up. Just not efficient and I am sure you guys rather spend all that time to work on other important features than playing catch up games that to be honest never worked perfectly. Option one seems to be outdated and by reading the comments here not the favourite! Moving to tree sitter seems quite exciting. The fact that it is being actively developed and prob favoured/maintained by ex atom people (who hate Microsoft for ruining and lying about atom) sounds promising. As long a project is maintained and there is a “community” around it + open source I am sure you would be successful whatever parser/engine you choose even if not tree sitter!

Future looks promising!
Good luck! :crossed_fingers:

1 Like

The Tree Sitter implementation sounds the best for me.

1 Like

For me, the main pain point about Nova’s custom engine is that it has bad debugging support. To this day I haven’t found a solution to my problem writing a Zig grammar. A proper grammar engine should recognize the possibility of infinite recursion in a grammar and make this an error. Nova’s apparently doesn’t. Also because it’s a custom, proprietary engine, I have no possibility of inspecting what the engine does when this happens. This makes me unable to finish that plugin without direct support from Panic.

Here’s my first impression of Tree Sitter:

  • Tree Sitter says that it is robust, i.e. produces useful results even in case of errors. This is a huge improvement to the current engine, which requires grammar authors to manually add things like <cut-off> to handle errors graciously.
  • Regarding symbolication which Tree Sitter doesn’t support, I recommend having a look at what JetBrains does in their IDEs: The have a Program Structure Interface on top of their parsers which provides such features. This also serves as example that separating parsing from analysis works well in an IDE.
  • While the usage of JS for grammars (really?) is a questionable decision, the fact that these are compiled and not interpreted directly makes it quite likely that TS will be faster than the current regex-based engine.
  • Tree Sitter grammars are not overly complicated. Their complexity is necessary for doing a good job. Sure we all would like to just throw some keywords into a magic tool that produces a readily usable syntax that does everything we want, but that’s impossible. Decades of research into parsers and grammars have provided knowledge how to parse input quickly and handle error cases well, and the sad fact is that for the most part, this knowledge is ignored by overly simple syntax highlighting engines in editors. Those will get you quick adoption because writing plugins for additional languages is simple, but it will also impose a limit on features that your editor can support.

So yeah, I am wholly supporting the adoption of Tree Sitter.

4 Likes

I’ve worked on at least two different attempts at Clojure language support (syntax and navigation, more recently LSP), and Tree Sitter sounds like the best next step to me too. Thanks for this insight into the company’s thoughts and process and for gathering community feedback on this!

2 Likes

Hi Logan;

While I’ve not worked with Tree-Sitter, I’ve developed using the TextMate grammar as well as Nova, and while I’ve found Nova considerably more painful to develop for than TextMate (lack of complete documentation and lack of tooling to support grammar development being the biggest current pain points), it’s also considerably more flexible than TextMate ever was.

While implementing TextMate could provide some “free” language grammars, it’s also a significant limitation for the future, since, why develop a new grammar with the new features, when an existing TextMate grammar is right there?

I think what would be ideal here is to provide Tree-Sitter support and tooling to make it easier to port my grammar to Tree-Sitter, so that I’m not thrown straight into “I’d like new features but also doing Tree-Sitter right now is a bit too much.”

Looking forward to seeing what comes of this!

1 Like

I’m super excited by the idea of a parser that can benefit from other ecosystems and grow/evolve on its own. Obviously there are risks depending on someone else’s project, but as Logan and others have said, it seems the benefits outweigh the costs. Count me in on being excited about Tree-sitter, even if it means I need to rewrite a couple of my plugins.

Here are a couple other things I haven’t seen mentioned yet:

  • Parser combinators. Someone can tell me if this is exactly what tree sitter is and maybe it’s a moot point. But the idea is that you can build up extremely small rules (like parse a number, combined (combinator) with parse a symbol, to then parse a mathematical equation). It might be cool to have people build these parser combinators that run inside Nova to do the parsing.
  • ENBF. I’ve seen this format for describing a syntax a couple times. For example, here’s an ENBF for cooklang. Maybe there’s a tool out there to convert these into Tree-sitter or something else that makes language definitions more portable/approachable.

Anyway, not suggestions so much as other ideas in this space that I haven’t seen mentioned yet. Wonderful write-ups everyone. What a great and exciting discussion.

Hi there, I asked for tree-sitter over a year ago when I developed my Polis theme. I run into many issues like language in language theming (say a shell script containing JavaScript, or having Elixir in Markdown code blocks). The biggest issue to me was, and still is, tracking down variables and giving them the same color, making it easier to follow how data flows. The current engine doesn’t support that type of coloring.

// 'a' would always have the same color
// 'b' would also have the same color but different from 'a'
const getSmaler = (a, b) => {
    if ( a =< b) {
        return a
    } else {
        return b
    }
}
// Nova currently renders 'a' and 'b' with the same color

Another issue to me, like some mentioned, stated that creating a new grammar is a pain. I wanted to make a vlang.io/ grammar, but I didn’t got far and gave up.

(+1) Tree-Sitter

Allow me to follow up on this: Seeing that TreeSitter defines grammar rules as JS function that may refer to other functions, these are, in fact, parser combinators.

Concerning EBNF, yeah that’s been around a long time, particularly in computer science. I would argue that the format is subpar for this use-case because it misses a lot of convenience TreeSitter provides, e.g.

  • precedence
  • associativity
  • hiding rules in the syntax tree
  • naming subtrees
3 Likes

FYI, I posted up an extension for C using the tree-sitter grammar. I wrote that extension in about an hour, maybe less. Tree-Sitter is fantastic. Admittedly that extension only provides for syntax highlighting at the moment (I intend to add folding and symbolication as soon as I figure out the particular queries Nova needs), but still this is a huge leap forward with Nova 10.

I also have a D extension but I need to tweak it slightly before I post it.

One thing I can say, is that the captures that Nova uses are not particularly well suited for C family languages. I would like to see a few additional captures added that would help us with highlighting such languages and give more freedom to Theme authors. For example, there is no good way to tag primitive data types like “int”, and there isn’t particularly good support for highlighting include files or package imports. We wind up settling in those regards for a closest match.

3 Likes

Btw, if anyone comes across this – my C-Dragon extension is a lot more friendly for C, C++, and Objective-C. D-Velop does the same for D, and Go-Bee does it for Go. (All three were written by me.) There are still things I wish Nova would add to make highlighting / captures a bit better for these strongly typed languages though.

1 Like