Hello Everyone,
In the year and a half (!) since Nova’s launch, there’s been a lot of discussion and questions around languages and grammars.
I’m here today to talk about this more in depth, discuss our goals for the future, and get some feedback from our community as we move forward. This directly concerns how both Panic and our extension developers create and maintain language extensions for Nova.
Note: This is a very long post. I am sorry it’s so long. If you do bear with me and read it, I do appreciate it.
First, Some Overly Verbose Backstory
Feel free to skip this section unless you’re interested in the How We Got Here.
Nova as a project originated from the codebase of Coda iOS (née Diet Coda, eventually Code Editor for iOS). Many of the prototypes for its more core features were started for a Coda iOS 3.0 (like indexing and Git), although they never ended up shipping as such before the focus shift to Nova and the former’s discontinuation. The biggest shared component, though, was the parse engine.
Backing up further—many of you may know that Coda 1 and Coda 2 used a parse engine licensed from SubEthaEdit, with modifications down the line made by Panic to support some more complex features. Its language grammars were directly based on those from this engine, which use an XML format. When it came time to build Diet Coda, though, we could not use it for several reasons (license being first, and restrictions of the iOS 5/6-ish SDK and porting the SubEthaEdit codebase being the second).
Thus, it was decided to write a new engine from scratch for this project, one that might eventually make its way back into a future Coda. It was also decided to not base our grammars on those from SubEthaEdit, for reasons that honestly are sort of lost to time (this was slightly before my time). It was also decided not to base the grammars on those of TextMate, as that format could not express all of the features that SubEthaEdit had (and we were trying to match in Diet Coda). So, the original Diet Coda 1.0 used a private property-list format for grammars, and only supported a few (five, I think?) at launch which slowly grew over time.
After Diet Coda 1.x is when I joined the project, which eventually lead to me taking it over when its original lead left Panic. Part of my work then became moving towards a “Diet Coda 2.0.” In so doing, the parse engine was heavily rewritten and upgraded, and the property list format was changed to an XML format that could be more expressive (and less verbose, lol). That is from where the basis for Nova’s grammar format hails.
Eventually, the culminated in porting the engine back to the Mac and starting the Nova project. Since it was already a relatively mature engine and grammar set for Coda iOS, we decided to use it as the basis for Nova. As such, through Nova’s nearly four year development cycle before release, the engine and grammar format were upgraded again and again until we ended up with what we have today.
Up until the point of launch, our primary focus was on maturing the built-in grammars in Nova and making it easiest for our team to build the features we wanted to build. This was the motivating factor for a custom in-house engine and custom in-house grammars. We could rapidly make changes to both the engine and its grammars as necessary to build cool things.
In the running up to Nova’s launch, though, our focus began to shift. We immediately recognized the importance of our developer ecosystem and were honestly humbled by the response from the nascent community. To be perfectly frank, Coda never quite had that response by extension devs that Nova has been blessed by, and in hindsight we weren’t expecting it.
The Stress of Building and Maintaining Language Grammars
For our team, maintaining our built-in language grammars has proven to be a time-consuming task, especially when we aren’t nearly as versed in some languages we ship ([cough] TypeScript). While the grammars have come a long way, there are still many cases where we are playing catch-up, only compounded by how fast many of these languages evolve (TypeScript, Python, etc.).
Thus we come to the most common question we get from extension developers: “How do I make an extension for {X} language?”
Around this, there are a few glaringly obvious pain points:
- Nova uses a custom grammar format.
- While it is conceptually similar to both SubEthaEdit & TextMate formats, it is not directly compatible.
- Attempts have been made to write “converters,” but they are only a start to developing a grammar, and not a solution.
- The languages most requested by users likely already have mature grammars in other formats out there: Go, Rust, C++, Swift, etc.
- Most of the time, our extension developers (you all) aren’t interested in writing one from scratch.
- Developers just want to hook Language A up to Nova, maybe add some additional cool things on top, and have it all just work.
This alone has led, we believe, to a good number of (potential and existing) extension developers to shying away from maintaining a custom language grammar. Because once the effort is put in to actually build it, you then have to maintain it; This is the same issue we face internally. It’s just compounded for you all, as this isn’t your day job.
Identifying Possible Solutions
This brings me to the actual point of this post: solving this issue for both us and you. This issue has weighed heavily on my mind ever since Nova’s launch.
We have been listening, even if it might not have felt as such, as considering such a wide-reaching change as this in addition to the development work we do otherwise is… a lot. For each time we at Panic would sit down and discuss this, we didn’t feel like it was necessarily the time to make a decision. But it’s now time we do.
There have been many suggestions put forth by the community over the last year and a half. I’d like to recap the most common ones and address each one before moving forward.
Suggestion 1: Switch to TextMate (or VSCode) Grammars
Both VSCode and Atom (before tree sitter, bear with me) can trace their base language grammars to a fork of those from TextMate. While now in different text formats (JSON, CSON, and Property List respectively), they are ultimately the same grammar format, diverged by tweaks and improvements made along the way by each editor. Therefore, why not use one of those?
That is an excellent suggestion. It would solve one problem: It will make it easier to integrate existing grammars from another editor into Nova.
However, it fails to solve another. Since Nova’s current parser does not support TextMate grammars, we’d need to extend it to support them. The TextMate format is… roughly documented (v1 is much more documented than v2), but this documentation is just a specification. We’d need to ensure that the parser works in all of the subtly different ways the format needs to be parsed over what we have now, all the while ensuring that Nova parsed these new grammars in roughly the same way as the other editor(s) that use them. This is a task in and of itself.
Another aspect is that, ultimately in this author’s opinion, TextMate and other formats which rely entirely on regular expressions to parse are flawed. These grammars can parse complex languages (it’s not just regular expressions, it’s the combination of scoping that make it work), but it’s still restricted by how much expressiveness can be made in the format itself. This is one reason we originally built the Nova (née Coda iOS) format, to make a few parts of this type of parsing easier than TextMate’s format can express.
If we adopt the TextMate (or VSCode) format, it will be increasingly difficult for us to add to this format to resolve these limitations, as it’s also used by other editors, and we’re right back to where we started.
Atom, before GitHub’s acquisition by Microsoft, had also clearly realized this, and began migrating to a different conceptual method of parsing (tree sitter, which we’ll get into in the second suggestion). This went quite well for them and the developer response was overall positive.
The VSCode team, on the other hand, has been conflicted on tree sitter support or otherwise replacing their TextMate grammars. They see the Language Server Protocol’s newer Semantic Tokenization (see the third suggestion), which supplements but does not replace the grammars as their path forward for the time being.
In any case, though, it would likely not be in our best interest to adopt TextMate grammars. If we do, and other editors move away from this style of parsing, we’re once again tethered to an aging format as well as maintaining our own custom parser to consume them. While the TextMate grammar space is mature, there is little area for improvement or advancement.
(Oh, and before someone asks “Why not take TextMate’s engine source?” — TextMate 2 is GPLv3, we can’t even legally look at it for reverse engineering.)
Suggestion 2: Switch to Tree Sitter
aka in which Logan learns to eat some crow.
Tree Sitter is a project developed independently from any specific editor over recent years which approaches text editor parsing from a different perspective. It does not use regular expressions, instead using an approach much closer to most compilers, albeit still in an abstracted, generalized way. I’ll not bore you with the strict details (feel free to read up on the project or watch their conference talks for more info).
The project probably gained the most attention when Atom began adopting it to replace most of its core language grammars several years ago. This was a long process that was done in steps as those grammars matured, staring with only a few and slowly supplanting many of the built-in languages, as well as growing new grammars in its third-party package ecosystem.
Ultimately, Atom’s development slowed after the GitHub acquisition, but tree-sitter remains as an independent project. It’s also experimentally supported by a NeoVim plugin. The project itself has healthy activity, and from what I can tell, is primary led by one core developer and several regular contributors, supplemented by many contributors for each of the language grammars themselves.
So, what are the pros and cons here?
Well, the biggest pro is that we’d be replacing not just our language grammars, but also the underlying core parsing engine. This means that the engine using the grammars is the exact same as used in Atom and NeoVim. There is no room for interpretive difference as is the case with TextMate grammars where each editor implemented their own parser for them.
Second, there are many, many tree sitter grammars out there already. Maybe not as many as TextMate grammars, but it’s more than enough to cover the most popular languages of which we can think.
Third, Tree Sitter, by approaching parsing from a fundamentally different direction, has the capability to have much richer syntax highlighting and parse tree logic than regular expression-based parsers. This could allow Nova to build much richer syntactic analysis tools on top of it going forward in ways that more complex language servers are currently necessary.
Okay, cons: Even more so, we’d be moving aside our current parsing engine to add Tree Sitter wholesale. We’d keep support for our existing grammars and parsing just as Atom did as it transitioned. But going forward, we’d want to direct development toward the former.
Further, Tree Sitter grammars have gained a small reputation for being overly complicated to build compared to the much more commonly understood regular expression style. While I understand this concern, it does seem that after some initial hesitation most developers working on them have adapted well.
Finally, we’d be giving up direct control of the parsing engine itself to the existing Tree Sitter community at large. This means we wouldn’t necessarily be able to make the wide-reaching changes that we might want to make rapidly. However, when all is said and done, it’s my opinion that anything we’d want to do in this area sits above the parse engine (in areas such as symbolication and syntactic analysis), and thus the risk is far outweighed by the rewards. Panic has also gotten very comfortable with making changes to and helping maintain open source projects (such as libssh2).
Suggestion 3: Adopt Language Server Protocol (LSP) Semantic Tokenization
We could take the same approach that VSCode has started to take: supplementing its parsing by using the new support in the Language Server Protocol for Semantic Tokenization.
To be clear: Semantic Tokens in LSP is not a replacement for parsing. It’s a supplement. It’s designed to make it so that certain tokens, such as class names, methods, etc. can be targeted for specific highlighting based on the knowledge that a specific language server has about them. It can do more, but most that implement this feature focus on supplementing the existing highlighting provided by TextMate-style parsing.
If we were to adopt this, we’d still have the same core problem: our underlying custom parse engine.
Also, there’s also nothing to prevent us from adopting semantic token highlighting on whatever engine with which we end up. Ultimately, I see this technology as icing to be considered later, not as a solution.
So, Where Do We Go From Here?
With all of this in mind, we at Panic are left with a decision on Nova’s future, one of which we most definitely want to make with the feedback of our amazing extension developers.
To recap in short, the most practical options going forward, in no particular order, are:
- Keep our custom parsing engine and format, perhaps investigate better ways to convert from other formats
- Adopt the TextMate format, which would require extensive changes to our parsing engine, and while helping our extension developers would not necessarily help us as much in making advancements in the future
- Transition to an entirely different parsing engine and grammar format, the foremost possibility being Tree Sitter, thus helping both us and our extension developers and potentially making room for advancement in the future
At this point, I am not committing us to any one path, as this is still all just exploratory. But, we are actively discussing these things internally. We also have no specific timeframe for any of this, as there’s a lot of “ifs” abound.
Now, in my opinion and after laying all to bare in this way, Tree Sitter stands as the best path forward for both Nova’s team as well as our extension developers. Keep in mind, I know little about the actual performance of Tree Sitter against our current parser (both of which are marketed as very fast), and likely won’t until time is taken to actually try. There are more things to consider beyond the raw speed of its C source, as Nova’s features need to sit atop it and bridge to Swift, which may take time to get right. So, that’s all within a grain of salt.
If you made it this far, thank you. I’m here to listen. How do you all feel about all of this? Are there any major considerations I’ve missed or forgotten? I’d love your thoughts, as you are as much a part of the future of Nova as we are.
Well, that ended somewhat abruptly.