443 lines
14 KiB
Text
443 lines
14 KiB
Text
![]() |
STARTER GUIDE ON WRITTING MAJOR MODE WITH TREE-SITTER -*- org -*-
|
|||
|
|
|||
|
This document guides you on adding tree-sitter support to a major
|
|||
|
mode.
|
|||
|
|
|||
|
TOC:
|
|||
|
|
|||
|
- Building Emacs with tree-sitter
|
|||
|
- Install language definitions
|
|||
|
- Setup
|
|||
|
- Font-lock
|
|||
|
- Indent
|
|||
|
- Imenu
|
|||
|
- Navigation
|
|||
|
- Which-func
|
|||
|
- More features?
|
|||
|
- Common tasks (code snippets)
|
|||
|
- Manual
|
|||
|
|
|||
|
* Building Emacs with tree-sitter
|
|||
|
|
|||
|
You can either install tree-sitter by your package manager, or from
|
|||
|
source:
|
|||
|
|
|||
|
git clone https://github.com/tree-sitter/tree-sitter.git
|
|||
|
cd tree-sitter
|
|||
|
make
|
|||
|
make install
|
|||
|
|
|||
|
Then pull the tree-sitter branch (or the master branch, if it has
|
|||
|
merged) and rebuild Emacs.
|
|||
|
|
|||
|
* Install language definitions
|
|||
|
|
|||
|
Tree-sitter by itself doesn’t know how to parse any particular
|
|||
|
language. We need to install language definitions (or “grammars”) for
|
|||
|
a language to be able to parse it. There are a couple of ways to get
|
|||
|
them.
|
|||
|
|
|||
|
You can use this script that I put together here:
|
|||
|
|
|||
|
https://github.com/casouri/tree-sitter-module
|
|||
|
|
|||
|
You can also find them under this directory in /build-modules.
|
|||
|
|
|||
|
This script automatically pulls and builds language definitions for C,
|
|||
|
C++, Rust, JSON, Go, HTML, Javascript, CSS, Python, Typescript,
|
|||
|
and C#. Better yet, I pre-built these language definitions for
|
|||
|
GNU/Linux and macOS, they can be downloaded here:
|
|||
|
|
|||
|
https://github.com/casouri/tree-sitter-module/releases/tag/v2.1
|
|||
|
|
|||
|
To build them yourself, run
|
|||
|
|
|||
|
git clone git@github.com:casouri/tree-sitter-module.git
|
|||
|
cd tree-sitter-module
|
|||
|
./batch.sh
|
|||
|
|
|||
|
and language definitions will be in the /dist directory. You can
|
|||
|
either copy them to standard dynamic library locations of your system,
|
|||
|
eg, /usr/local/lib, or leave them in /dist and later tell Emacs where
|
|||
|
to find language definitions by setting ‘treesit-extra-load-path’.
|
|||
|
|
|||
|
Language definition sources can be found on GitHub under
|
|||
|
tree-sitter/xxx, like tree-sitter/tree-sitter-python. The tree-sitter
|
|||
|
organization has all the "official" language definitions:
|
|||
|
|
|||
|
https://github.com/tree-sitter
|
|||
|
|
|||
|
* Setting up for adding major mode features
|
|||
|
|
|||
|
Start Emacs, and load tree-sitter with
|
|||
|
|
|||
|
(require 'treesit)
|
|||
|
|
|||
|
Now check if Emacs is built with tree-sitter library
|
|||
|
|
|||
|
(treesit-available-p)
|
|||
|
|
|||
|
For your major mode, first create a tree-sitter switch:
|
|||
|
|
|||
|
#+begin_src elisp
|
|||
|
(defcustom python-use-tree-sitter nil
|
|||
|
"If non-nil, `python-mode' tries to use tree-sitter.
|
|||
|
Currently `python-mode' can utilize tree-sitter for font-locking,
|
|||
|
imenu, and movement functions."
|
|||
|
:type 'boolean)
|
|||
|
#+end_src
|
|||
|
|
|||
|
Then in other places, we decide on whether to enable tree-sitter by
|
|||
|
|
|||
|
#+begin_src elisp
|
|||
|
(and python-use-tree-sitter
|
|||
|
(treesit-can-enable-p))
|
|||
|
#+end_src
|
|||
|
|
|||
|
* Font-lock
|
|||
|
|
|||
|
Tree-sitter works like this: You provide a query made of patterns and
|
|||
|
capture names, tree-sitter finds the nodes that match these patterns,
|
|||
|
tag the corresponding capture names onto the nodes and return them to
|
|||
|
you. The query function returns a list of (capture-name . node). For
|
|||
|
font-lock, we use face names as capture names. And the captured node
|
|||
|
will be fontified in their capture name. The capture name could also
|
|||
|
be a function, in which case (START END NODE) is passed to the
|
|||
|
function for font-lock. START and END is the start and end the
|
|||
|
captured NODE.
|
|||
|
|
|||
|
** Query syntax
|
|||
|
|
|||
|
There are two types of nodes, named, like (identifier),
|
|||
|
(function_definition), and anonymous, like "return", "def", "(",
|
|||
|
"}". Parent-child relationship is expressed as
|
|||
|
|
|||
|
(parent (child) (child) (child (grand_child)))
|
|||
|
|
|||
|
Eg, an argument list (1, "3", 1) could be:
|
|||
|
|
|||
|
(argument_list "(" (number) (string) (number) ")")
|
|||
|
|
|||
|
Children could have field names in its parent:
|
|||
|
|
|||
|
(function_definition name: (identifier) type: (identifier))
|
|||
|
|
|||
|
Match any of the list:
|
|||
|
|
|||
|
["true" "false" "none"]
|
|||
|
|
|||
|
Capture names can come after any node in the pattern:
|
|||
|
|
|||
|
(parent (child) @child) @parent
|
|||
|
|
|||
|
The query above captures both parent and child.
|
|||
|
|
|||
|
["return" "continue" "break"] @keyword
|
|||
|
|
|||
|
The query above captures all the keywords with capture name
|
|||
|
"keyword".
|
|||
|
|
|||
|
These are the common syntax, see all of them in the manual
|
|||
|
("Parsing Program Source" section).
|
|||
|
|
|||
|
** Query references
|
|||
|
|
|||
|
But how do one come up with the queries? Take python for an
|
|||
|
example, open any python source file, evaluate
|
|||
|
|
|||
|
(treesit-parser-create 'python)
|
|||
|
|
|||
|
so there is a parser available, then enable ‘treesit-inspect-mode’.
|
|||
|
Now you should see information of the node under point in
|
|||
|
mode-line. Move around and you should be able to get a good
|
|||
|
picture. Besides this, you can consult the grammar of the language
|
|||
|
definition. For example, Python’s grammar file is at
|
|||
|
|
|||
|
https://github.com/tree-sitter/tree-sitter-python/blob/master/grammar.js
|
|||
|
|
|||
|
Neovim also has a bunch of queries to reference:
|
|||
|
|
|||
|
https://github.com/nvim-treesitter/nvim-treesitter/tree/master/queries
|
|||
|
|
|||
|
The manual explains how to read grammar files in the bottom of section
|
|||
|
"Tree-sitter Language Definitions".
|
|||
|
|
|||
|
** Debugging queires
|
|||
|
|
|||
|
If your query has problems, it usually cannot compile. In that case
|
|||
|
use ‘treesit-query-validate’ to debug the query. It will pop a buffer
|
|||
|
containing the query (in text format) and mark the offending part in
|
|||
|
red.
|
|||
|
|
|||
|
** Code
|
|||
|
|
|||
|
To enable tree-sitter font-lock, set ‘treesit-font-lock-settings’
|
|||
|
buffer-locally and call ‘treesit-font-lock-enable’. For example, see
|
|||
|
‘python--treesit-settings’ in python.el. Below I paste a snippet of
|
|||
|
it.
|
|||
|
|
|||
|
Note that like the current font-lock, if the to-be-fontified region
|
|||
|
already has a face (ie, an earlier match fontified part/all of the
|
|||
|
region), the new face is discarded rather than applied. If you want
|
|||
|
later matches always override earlier matches, use the :override
|
|||
|
keyword.
|
|||
|
|
|||
|
#+begin_src elisp
|
|||
|
(defvar python--treesit-settings
|
|||
|
(treesit-font-lock-rules
|
|||
|
:language 'python
|
|||
|
:override t
|
|||
|
`(;; Queries for def and class.
|
|||
|
(function_definition
|
|||
|
name: (identifier) @font-lock-function-name-face)
|
|||
|
|
|||
|
(class_definition
|
|||
|
name: (identifier) @font-lock-type-face)
|
|||
|
|
|||
|
;; Comment and string.
|
|||
|
(comment) @font-lock-comment-face
|
|||
|
|
|||
|
...)))
|
|||
|
#+end_src
|
|||
|
|
|||
|
Then in ‘python-mode’, enable tree-sitter font-lock:
|
|||
|
|
|||
|
#+begin_src elisp
|
|||
|
(treesit-parser-create 'python)
|
|||
|
;; This turns off the syntax-based font-lock for comments and
|
|||
|
;; strings. So it doesn’t override tree-sitter’s fontification.
|
|||
|
(setq-local font-lock-keywords-only t)
|
|||
|
(setq-local treesit-font-lock-settings
|
|||
|
python--treesit-settings)
|
|||
|
(treesit-font-lock-enable)
|
|||
|
#+end_src
|
|||
|
|
|||
|
Concretely, something like this:
|
|||
|
|
|||
|
#+begin_src elisp
|
|||
|
(define-derived-mode python-mode prog-mode "Python"
|
|||
|
...
|
|||
|
|
|||
|
(treesit-parser-create 'python)
|
|||
|
|
|||
|
(if (and python-use-tree-sitter
|
|||
|
(treesit-can-enable-p))
|
|||
|
;; Tree-sitter.
|
|||
|
(progn
|
|||
|
(setq-local font-lock-keywords-only t)
|
|||
|
(setq-local treesit-font-lock-settings
|
|||
|
python--treesit-settings)
|
|||
|
(treesit-font-lock-enable))
|
|||
|
;; No tree-sitter
|
|||
|
(setq-local font-lock-defaults ...))
|
|||
|
|
|||
|
...)
|
|||
|
#+end_src
|
|||
|
|
|||
|
You’ll notice that tree-sitter’s font-lock doesn’t respect
|
|||
|
‘font-lock-maximum-decoration’, major modes are free to set
|
|||
|
‘treesit-font-lock-settings’ based on the value of
|
|||
|
‘font-lock-maximum-decoration’, or provide more fine-grained control
|
|||
|
through other mode-specific means.
|
|||
|
|
|||
|
* Indent
|
|||
|
|
|||
|
Indent works like this: We have a bunch of rules that look like this:
|
|||
|
|
|||
|
(MATCHER ANCHOR OFFSET)
|
|||
|
|
|||
|
At the beginning point is at the BOL of a line, we want to know which
|
|||
|
column to indent this line to. Let NODE be the node at point, we pass
|
|||
|
this node to the MATCHER of each rule, one of them will match the node
|
|||
|
("this node is a closing bracket!"). Then we pass the node to the
|
|||
|
ANCHOR, which returns a point, eg, the BOL of the previous line. We
|
|||
|
find the column number of that point (eg, 4), add OFFSET to it (eg,
|
|||
|
0), and that is the column we want to indent the current line to (4 +
|
|||
|
0 = 4).
|
|||
|
|
|||
|
For MATHCER we have
|
|||
|
|
|||
|
(parent-is TYPE)
|
|||
|
(node-is TYPE)
|
|||
|
(query QUERY) => matches if querying PARENT with QUERY
|
|||
|
captures NODE.
|
|||
|
|
|||
|
(match NODE-TYPE PARENT-TYPE NODE-FIELD
|
|||
|
NODE-INDEX-MIN NODE-INDEX-MAX)
|
|||
|
|
|||
|
=> checks everything. If an argument is nil, don’t match that. Eg,
|
|||
|
(match nil nil TYPE) is the same as (parent-is TYPE)
|
|||
|
|
|||
|
For ANCHOR we have
|
|||
|
|
|||
|
first-sibling => start of the first sibling
|
|||
|
parent => start of parent
|
|||
|
parent-bol => BOL of the line parent is on.
|
|||
|
prev-sibling
|
|||
|
no-indent => don’t indent
|
|||
|
prev-line => same indent as previous line
|
|||
|
|
|||
|
There is also a manual section for indent: "Parser-based Indentation".
|
|||
|
|
|||
|
When writing indent rules, you can use ‘treesit-check-indent’ to
|
|||
|
check if your indentation is correct. To debug what went wrong, set
|
|||
|
‘treesit--indent-verboase’ to non-nil. Then when you indent, Emacs
|
|||
|
tells you which rule is applied in the echo area.
|
|||
|
|
|||
|
#+begin_src elisp
|
|||
|
(defvar typescript-mode-indent-rules
|
|||
|
(let ((offset typescript-indent-offset))
|
|||
|
`((typescript
|
|||
|
;; This rule matches if node at point is "}", ANCHOR is the
|
|||
|
;; parent node’s BOL, and offset is 0.
|
|||
|
((node-is "}") parent-bol 0)
|
|||
|
((node-is ")") parent-bol 0)
|
|||
|
((node-is "]") parent-bol 0)
|
|||
|
((node-is ">") parent-bol 0)
|
|||
|
((node-is ".") parent-bol ,offset)
|
|||
|
((parent-is "ternary_expression") parent-bol ,offset)
|
|||
|
((parent-is "named_imports") parent-bol ,offset)
|
|||
|
((parent-is "statement_block") parent-bol ,offset)
|
|||
|
((parent-is "type_arguments") parent-bol ,offset)
|
|||
|
((parent-is "variable_declarator") parent-bol ,offset)
|
|||
|
((parent-is "arguments") parent-bol ,offset)
|
|||
|
((parent-is "array") parent-bol ,offset)
|
|||
|
((parent-is "formal_parameters") parent-bol ,offset)
|
|||
|
((parent-is "template_substitution") parent-bol ,offset)
|
|||
|
((parent-is "object_pattern") parent-bol ,offset)
|
|||
|
((parent-is "object") parent-bol ,offset)
|
|||
|
((parent-is "object_type") parent-bol ,offset)
|
|||
|
((parent-is "enum_body") parent-bol ,offset)
|
|||
|
((parent-is "arrow_function") parent-bol ,offset)
|
|||
|
((parent-is "parenthesized_expression") parent-bol ,offset)
|
|||
|
...))))
|
|||
|
#+end_src
|
|||
|
|
|||
|
Then you set ‘treesit-simple-indent-rules’ to your rules, and set
|
|||
|
‘indent-line-function’:
|
|||
|
|
|||
|
#+begin_src elisp
|
|||
|
(setq-local treesit-simple-indent-rules typescript-mode-indent-rules)
|
|||
|
(setq-local indent-line-function #'treesit-indent)
|
|||
|
#+end_src
|
|||
|
|
|||
|
* Imenu
|
|||
|
|
|||
|
Not much to say except for utilizing ‘treesit-induce-sparse-tree’.
|
|||
|
See ‘python--imenu-treesit-create-index-1’ in python.el for an
|
|||
|
example.
|
|||
|
|
|||
|
Once you have the index builder, set ‘imenu-create-index-function’.
|
|||
|
|
|||
|
* Navigation
|
|||
|
|
|||
|
Mainly ‘beginning-of-defun-function’ and ‘end-of-defun-function’.
|
|||
|
You can find the end of a defun with something like
|
|||
|
|
|||
|
(treesit-search-forward-goto "function_definition" 'end)
|
|||
|
|
|||
|
where "function_definition" matches the node type of a function
|
|||
|
definition node, and ’end means we want to go to the end of that
|
|||
|
node.
|
|||
|
|
|||
|
Something like this should suffice:
|
|||
|
|
|||
|
#+begin_src elisp
|
|||
|
(defun xxx-beginning-of-defun (&optional arg)
|
|||
|
(if (> arg 0)
|
|||
|
;; Go backward.
|
|||
|
(while (and (> arg 0)
|
|||
|
(treesit-search-forward-goto
|
|||
|
"function_definition" 'start nil t))
|
|||
|
(setq arg (1- arg)))
|
|||
|
;; Go forward.
|
|||
|
(while (and (< arg 0)
|
|||
|
(treesit-search-forward-goto
|
|||
|
"function_definition" 'start))
|
|||
|
(setq arg (1+ arg)))))
|
|||
|
|
|||
|
(setq-local beginning-of-defun-function #'xxx-beginning-of-defun)
|
|||
|
#+end_src
|
|||
|
|
|||
|
And the same for end-of-defun.
|
|||
|
|
|||
|
* Which-func
|
|||
|
|
|||
|
You can find the current function by going up the tree and looking for
|
|||
|
the function_definition node. See ‘python-info-treesit-current-defun’
|
|||
|
in python.el for an example. Since Python allows nested function
|
|||
|
definitions, that function keeps going until it reaches the root node,
|
|||
|
and records all the function names along the way.
|
|||
|
|
|||
|
#+begin_src elisp
|
|||
|
(defun python-info-treesit-current-defun (&optional include-type)
|
|||
|
"Identical to `python-info-current-defun' but use tree-sitter.
|
|||
|
For INCLUDE-TYPE see `python-info-current-defun'."
|
|||
|
(let ((node (treesit-node-at (point)))
|
|||
|
(name-list ())
|
|||
|
(type nil))
|
|||
|
(cl-loop while node
|
|||
|
if (pcase (treesit-node-type node)
|
|||
|
("function_definition"
|
|||
|
(setq type 'def))
|
|||
|
("class_definition"
|
|||
|
(setq type 'class))
|
|||
|
(_ nil))
|
|||
|
do (push (treesit-node-text
|
|||
|
(treesit-node-child-by-field-name node "name")
|
|||
|
t)
|
|||
|
name-list)
|
|||
|
do (setq node (treesit-node-parent node))
|
|||
|
finally return (concat (if include-type
|
|||
|
(format "%s " type)
|
|||
|
"")
|
|||
|
(string-join name-list ".")))))
|
|||
|
#+end_src
|
|||
|
|
|||
|
* More features?
|
|||
|
|
|||
|
Obviously this list is just a starting point, if there are features in
|
|||
|
the major mode that would benefit a parse tree, adding tree-sitter
|
|||
|
support for that would be great. But in the minimal case, just adding
|
|||
|
font-lock is awesome.
|
|||
|
|
|||
|
* Common tasks
|
|||
|
|
|||
|
How to...
|
|||
|
|
|||
|
** Get the buffer text corresponding to a node?
|
|||
|
|
|||
|
(treesit-node-text node)
|
|||
|
|
|||
|
BTW ‘treesit-node-string’ does different things.
|
|||
|
|
|||
|
** Scan the whole tree for stuff?
|
|||
|
|
|||
|
(treesit-search-subtree)
|
|||
|
(treesit-search-forward)
|
|||
|
(treesit-induce-sparse-tree)
|
|||
|
|
|||
|
** Move to next node that...?
|
|||
|
|
|||
|
(treesit-search-forward-goto)
|
|||
|
|
|||
|
** Get the root node?
|
|||
|
|
|||
|
(treesit-buffer-root-node)
|
|||
|
|
|||
|
** Get the node at point?
|
|||
|
|
|||
|
(treesit-node-at (point))
|
|||
|
|
|||
|
* Manual
|
|||
|
|
|||
|
I suggest you read the manual section for tree-sitter in Info. The
|
|||
|
section is Parsing Program Source. Typing
|
|||
|
|
|||
|
C-h i d m elisp RET g Parsing Program Source RET
|
|||
|
|
|||
|
will bring you to that section. You can also read the HTML version
|
|||
|
under /html-manual in this directory. I find the HTML version easier
|
|||
|
to read. You don’t need to read through every sentence, just read the
|
|||
|
text paragraphs and glance over function names.
|