The lexer used for syntax highlighting can be invoked incrementally to process only the changed part of a file, whereas lexers used in other contexts are always called to process an entire file, or a complete language construction embedded in a file in a different language.
A lexer that can be used incrementally may need to return its state, which means the context corresponding to each position in a file. For example, a Java lexer could have separate states for top level context, comment context and string literal context. An important requirement for a syntax highlighting lexer , required for incremental lexing, is that its state must be represented by a single integer number (returned from
Lexer.getState()). That state will be passed to the Lexer.start() method, along with the start offset of the fragment to process, when lexing is resumed from the middle of a file. Lexers used in other contexts can always return 0 from the getState() method if their state has a more complex internal representation.
The easiest way to create a lexer for a custom language plugin is to use JFlex. IDEA contains adapter classes (FlexLexer and FlexAdapter) that adapt JFlex lexers to the IDEA lexer API. The source code of IntelliJ IDEA Community Edition includes a patched version of JFlex 1.4.1 (tools/lexer/jflex-1.4) and lexer skeleton file (tools/lexer/idea-flex.skeleton) which can be used for creating lexers compatible with FlexAdapter. The patched version of JFlex provides a new command line option --charat which changes the JFlex generated code so that it works with the IDEA skeleton (which passes the source data for lexing as a CharSequence and not as an array of characters).
Note that lexers, and in particular JFlex-based lexers, need to be created in such a way that they always match the entire contents of the file, without any gaps between tokens, and generate special tokens for characters which are not valid at their location. Lexers must never abort prematurely because of an invalid character.
Types of tokens for lexers used in IDEA are defined by instances of IElementType. A number of token types common for all languages are defined in the TokenType interface; custom language plugins should reuse these token types wherever applicable. For all other token types, the plugin needs to create new IElementType instances and associate with the language in which the token type is used. The same IElementType instance should be returned every time a particular token type is encountered by the lexer.
An important feature which can be implemented at lexer level is mixing languages within a file (for example, embedding fragments of Java code in some template language). If a language supports embedding its fragments in another language, it needs to define the chameleon token types for different types of fragments which can be embedded, and these token types need to implement the ILazyParseableElementType interface. The lexer of the enclosing language needs to return the entire fragment of the embedded language as a single chameleon token, of the type defined by the embedded language. To parse the contents of the chameleon token, IDEA will call the parser of the embedded language through a call to ILazyParseableElementType.parseContents().
Parsing files in IDEA is a two-step process. First, an abstract syntax tree (AST) is built, defining the structure of the program. AST nodes are created internally by IDEA and are represented by instances of the ASTNode class. Each AST node has an associated element type (IElementType instance), and the element types are defined by the language plugin. The top-level node of the AST tree for a file needs to have a special element type, implementing the IFileElementType interface.
Second, a PSI (Program Structure Interface) tree is built on top of the AST, adding semantics and methods for manipulating specific language constructs. Nodes of the PSI tree are represented by classes implementing the PsiElement interface and are created by the language plugin in the ParserDefinition.createElement() method. The top-level node of the PSI tree for a file needs to implement the PsiFile interface, and is created in the ParserDefinition.createFile() method.The process of building the AST and PSI trees for a file is invoked on demand, when some component of IDEA tries to access the PSI for a file. The PSI is always built when a file is opened in the editor, and it can also be built when a file is affected by a multi-file operation (inspection, refactoring, batch reformat and so on). After a document has been changed, building a new PSI is initiated by
committing the document. Methods for committing a specific document or all documents are found in the PsiDocumentManager interface. The documents are committed by all IDEA components which need to access the PSI, and in particular by the thread which performs background highlightingExample: ParserDefinition for Properties language
The lifecycle of the PSI is described in more detail in IntelliJ IDEA Architectural Overview.
The base classes for the PSI implementation (PsiFileBase, the base implementation of PsiFile, and ASTWrapperPsiElement, the base implementation of PsiElement) are provided by IDEA. However, these classes are coupled to the internal implementation of IDEA and are located in idea.jar. Because of this, every custom language plugin needs to include idea.jar in its classpath. (If the plugin is built as a DevKit project, idea.jar must be if you're using IntelliJ IDEA 10.5 or an earlier version to develop the plugin, you need to make sure that idea.jar is added to the list of JARs in the classpath of the IDEA SDK, and not added as a separate module or project library. Otherwise, the plugin will not work correctly. (IntelliJ IDEA 11 adds idea.jar to the classpath automatically.)
IntelliJ IDEA currently does not provide a ready way to reuse existing language grammars (for example, from ANTLR) for creating custom language parsers. The parsers need to be coded manually, as a recursive descent implementation.
The language plugin provides the parser implementation as an implementation of the PsiParser interface, returned from ParserDefinition.createParser(). The parser receives an instance of the PsiBuilder class, which is used to get the stream of tokens from the lexer and to hold the intermediate state of the AST being built. The parser must process all tokens returned by the lexer up to the end of stream (until
PsiBuilder.getTokenType() returns null), even if the tokens are not valid according to the language syntax.
The parser works by setting pairs of markers (PsiBuilder.Marker instances) within the stream of tokens received from the lexer. Each pair of markers defines the range of lexer tokens for a single node in the AST tree. If a pair of markers is nested in another pair (starts after its start and ends before its end), it becomes the child node of the outer pair.
In order to better understand the process of building a PSI tree for a simple expression, you can refer to the attached following diagram.:
In general, there is no single right way to implement a PSI for a custom language, and the plugin author can choose the PSI structure and set of methods which are the most convenient for the code which uses the PSI (error analysis, refactorings and so on). However, there is one base interface which needs to be used by a custom language PSI implementation in order to support features like rename and find usages. Every element which can be renamed or referenced (a class definition, a method definition and so on) needs to implement the PsiNamedElement interface, with methods