The set of IntelliJ IDEA features which are supported for custom languages includes:
In addition, IDEA provides a powerful framework on which additional intelligence features, like refactorings and code analysis, can be implemented.
If you have any questions or comments related to the Language API or any other aspects of IntelliJ IDEA plugin development, feel free to ask them in the jetbrains.intellij.openapi newsgroup on the news.jetbrains.com news server, or in the corresponding Web forum. The newsgroup is monitored by JetBrains developers who will be able to help you with the development.
The information in this document has been updated to cover the API changes and new features of IntelliJ IDEA 8.0.
The first step in developing a custom language plugin is registering a file type the language will be associated with. IDEA determines the type of a file by looking at its file name. Thus, a custom language can only be associated with specific file names or extensions - it is not currently possible to create a language which will be applied to files with specific content, like, for example, a specific XML root namespace.
A custom language file type is a class derived from LanguageFileType, which passes a Language implementation class to its base class constructor. To register a file type, the plugin developer provides an implementation of the FileTypeFactory interface, which is registered via the com.intellij.fileTypeFactory extension point.
To verify that the file type is indeed registered correctly, you can implement the LanguageFileType.getIcon() method and verify that the correct icon is displayed for files which have the extension associated with your file type.
The lexer (lexical analyzer) defines how the contents of a file is broken into tokens. The lexer serves as a foundation for nearly all of the features of custom language plugins, from basic syntax highlighting to advanced code analysis features. The API for the lexer is defined by the Lexer interface.
IDEA invokes the lexer in three main contexts, and the plugin can provide different lexer implementations for these contexts:
The lexer used for syntax highlighting can be invoked incrementally to process only the changed part of a file, whereas lexers used in other contexts are always called to process an entire file, or a complete language construction embedded in a file in a different language. An important requirement for a syntax highlighting lexer, required for incremental lexing, is that its state must be represented by a single integer number (returned from Lexer.getState()). That state will be passed to the Lexer.start() method, along with the start offset of the fragment to process, when lexing is resumed from the middle of a file. Lexers used in other contexts can always return 0 from the getState() method if their state has a more complex internal representation.
The easiest way to create a lexer for a custom language plugin is to use JFlex. IDEA contains adapter classes (FlexLexer and FlexAdapter) that adapt JFlex lexers to the IDEA lexer API. The Plugin Development package includes a patched version of JFlex 1.4.1 (tools/jflex) and lexer skeleton file (tools/jflex/idea-flex.skeleton) which can be used for creating lexers compatible with FlexAdapter. The patched version of JFlex provides a new command line option --charat which changes the JFlex generated code so that it works with the IDEA skeleton (which passes the source data for lexing as a CharSequence and not as an array of characters).
For developing lexers using JFlex, the JFlex Support plugin can be useful. It provides syntax highlighting and other useful features for editing JFlex files.
Note that lexers, and in particular JFlex-based lexers, need to be created in such a way that they always match the entire contents of the file, without any gaps between tokens, and generate special tokens for characters which are not valid at their location. Lexers must never abort prematurely because of an invalid character.
Types of tokens for lexers used in IDEA are defined by instances of IElementType. A number of token types common for all languages are defined in the TokenType interface; custom language plugins should reuse these token types wherever applicable. For all other token types, the plugin needs to create new IElementType instances and associate with the language in which the token type is used. The same IElementType instance should be returned every time a particular token type is encountered by the lexer.
An important feature which can be implemented at lexer level is mixing languages within a file (for example, embedding fragments of Java code in some template language). If a language supports embedding its fragments in another language, it needs to define the chameleon token types for different types of fragments which can be embedded, and these token types need to implement the IChameleonElementType interface. The lexer of the enclosing language needs to return the entire fragment of the embedded language as a single chameleon token, of the type defined by the embedded language. To parse the contents of the chameleon token, IDEA will call the parser of the embedded language through a call to IChameleonElementType.parseContents().
Parsing files in IDEA is a two-step process. First, an abstract syntax tree (AST) is built, defining the structure of the program. AST nodes are created internally by IDEA and are represented by instances of the ASTNode class. Each AST node has an associated element type (IElementType instance), and the element types are defined by the language plugin. The top-level node of the AST tree for a file needs to have a special element type, implementing the IFileElementType interface.
The AST nodes have a direct mapping to text ranges in the underlying document (the bottom-most nodes of the AST match individual tokens returned by the lexer, and higher level nodes match multiple-token fragments). Operations performed on nodes of the AST tree (inserting, removing, reordering nodes and so on) are immediately reflected as changes to the text of the underlying document.
Second, a PSI (Program Structure Interface) tree is built on top of the AST, adding semantics and methods for manipulating specific language constructs. Nodes of the PSI tree are represented by classes implementing the PsiElement interface and are created by the language plugin in the ParserDefinition.createElement() method. The top-level node of the PSI tree for a file needs to implement the PsiFile interface, and is created in the ParserDefinition.createFile() method.
The process of building the AST and PSI trees for a file is invoked on demand, when some component of IDEA tries to access the PSI for a file. The PSI is always built when a file is opened in the editor, and it can also be built when a file is affected by a multi-file operation (inspection, refactoring, batch reformat and so on). After a document has been changed, building a new PSI is initiated by committing the document. Methods for committing a specific document or all documents are found in the PsiDocumentManager interface. The documents are committed by all IDEA components which need to access the PSI, and in particular by the thread which performs background highlighting.
The base classes for the PSI implementation (PsiFileBase, the base implementation of PsiFile, and ASTWrapperPsiElement, the base implementation of PsiElement) are provided by IDEA. However, these classes are coupled to the internal implementation of IDEA and are located in idea.jar. Because of this, every custom language plugin needs to include idea.jar in its classpath. (If the plugin is built as a DevKit project, idea.jar must be added to the list of JARs in the classpath of the IDEA SDK, and not added as a separate module or project library. Otherwise, the plugin will not work correctly.)
IDEA currently does not provide a ready way to reuse existing language grammars (for example, from ANTLR) for creating custom language parsers. The parsers need to be coded manually, as a recursive descent implementation.
The language plugin provides the parser implementation as an implementation of the PsiParser interface, returned from ParserDefinition.createParser(). The parser receives an instance of the PsiBuilder class, which is used to get the stream of tokens from the lexer and to hold the intermediate state of the AST being built. The parser must process all tokens returned by the lexer up to the end of stream (until PsiBuilder.getTokenType() returns null), even if the tokens are not valid according to the language syntax.
The parser works by setting pairs of markers (PsiBuilder.Marker instances) within the stream of tokens received from the lexer. Each pair of markers defines the range of lexer tokens for a single node in the AST tree. If a pair of markers is nested in another pair (starts after its start and ends before its end), it becomes the child node of the outer pair.
The element type for the marker pair (and for the AST node created from it) is specified when the end marker is set (by a call to PsiBuilder.Marker.done()). Also, it is possible to drop a start marker before its end marker has been set. The drop() method drops only a single start marker without affecting any markers added after it, and the rollbackTo() method drops the start marker and all markers added after it and reverts the lexer position to the start marker. These methods can be used to implement lookahead when parsing.
The method PsiBuilder.marker.precede() is useful for right-to-left parsing when you don't know how many markers you need at a certain position until you read more input. For example, a binary expression a+b+c needs to be parsed as
( (a+b) + c ). Thus, two start markers are needed at the position of the token 'a', but that is not known until the token 'c' is read. When the parser reaches the '+' token following 'b', it can call precede() to duplicate the start marker at 'a' position, and then put its matching end marker after 'c'.
An important feature of PsiBuilder is its handling of whitespace and comments. The types of tokens which are treated as whitespace or comments are defined by the methods getWhitespaceTokens() and getCommentTokens() in the ParserDefinition class. PsiBuilder automatically omits whitespace and comment tokens from the stream of tokens it passes to PsiParser, and adjusts the token ranges of AST nodes so that leading and trailing whitespace tokens are not included in the node.
The token set returned from
ParserDefinition.getCommentTokens() is also used to search for TO DO items.
In order to better understand the process of building a PSI tree for a simple expression, you can refer to the attached diagram.
In general, there is no single right way to implement a PSI for a custom language, and the plugin author can choose the PSI structure and set of methods which are the most convenient for the code which uses the PSI (error analysis, refactorings and so on). However, there is one base interface which needs to be used by a custom language PSI implementation in order to support features like rename and find usages. Every element which can be renamed or referenced (a class definition, a method definition and so on) needs to implement the PsiNamedElement interface, with methods
A number of functions which can be used for implementing and using the PSI can be found in the com.intellij.psi.util package, and in particular in the PsiUtil and PsiTreeUtil classes.
A very helpful tool for debugging the PSI implementation is the PsiViewer plugin. It can show you the structure of the PSI built by your plugin, the properties of every PSI element and highlight the text range of every PSI element.
The class used in IDEA to specify how a particular range of text should be highlighted is called TextAttributesKey. An instance of this class is created for every distinct type of item which should be highlighted (keyword, number, string and so on). The TextAttributesKey defines the default attributes which are applied to items of the corresponding type (for example, keywords are bold, numbers are blue, strings are bold and green). The mapping of the TextAttributesKey to specific attributes used in an editor is defined by the EditorColorsScheme class, and can be configured by the user if the plugin provides an appropriate configuration interface. Highlighting from multiple TextAttributeKey items can be overlaid - for example, one key may define an item's boldness and another its color.
The syntax and error highlighting is performed on multiple levels. The first level of syntax highlighting is based on the lexer output, and is provided through the SyntaxHighlighter interface. The syntax highligher returns the TextAttributeKey instances for each token type which needs special highlighting. For highlighting lexer errors, the standard TextAttributeKey for bad characters (HighligherColors.BAD_CHARACTER) can be used.
The second level of error highlighting happens during parsing. If a particular sequence of tokens is invalid according to the grammar of the language, the PsiBuilder.error() method can be used to highlight the invalid tokens and display an error message showing why they are not valid.
The third level of highlighting is performed through the Annotator interface. A plugin can register one or more annotators in the
com.intellij.annotator extension point, and these annotators are called during the background highlighting pass to process the elements in the PSI tree of the custom language. Annotators can analyze not only the syntax, but also the semantics of the text in the language, and thus can provide much more complex syntax and error highlighting logic. The annotator can also provide quick fixes to problems it detects.
When the file is changed, the annotator is called incrementally to process only changed elements in the PSI tree.
To highlight a region of text as a warning or error, the annotator calls
createWarningAnnotation() on the AnnotationHolder object passed to it, and optionally calls
registerFix() on the returned Annotation object to add a quick fix for the error or warning. To apply additional syntax highlighting, the annotator can call
AnnotationHolder.createInfoAnnotation() with an empty message and then call
Annotation.setTextAttributes() to specify the text attributes key for the highlighting.
Finally, if the custom language employs external tools for validating files in the language (for example, uses the Xerces library for XML schema validation), it can provide an implementation of the
ExternalAnnotator interface and register it in
com.intellij.externalAnnotator extension point. The ExternalAnnotator highlighting has the lowest priority and is invoked only after all other background processing has completed. It uses the same AnnotationHolder interface for converting the output of the external tool into editor highlighting.
The plugin can also provide a configuration interface to allow the user to configure the colors used for highlighting specific items. In order to do that, it should provide an implementation of
ColorSettingPage and register it in the
com.intellij.colorSettingsPage extension point.
The "Export to HTML" feature of IDEA uses the same syntax highlighting mechanism as the editor, so it will work automatically for custom languages which provide a syntax highlighter.
One of the most important and tricky parts in the implementation of a custom language PSI is resolving references. Resolving references means the ability to go from the usage of an element (access of a variable, call of a method and so on) to the declaration of the element (the variable definition, the method declaration and so on). This is obviously needed in order to support the IDEA "Go to Declaration" action (Ctrl-B and Ctrl-Click), and it is also a pre-requisite for the Find Usages action, the Rename refactoring and the code completion.
All PSI elements which work as references (for which the Go to Declaration action applies) need to implement the PsiElement.getReference() method and to return a PsiReference implementation from that method. The PsiReference interface can be implemented by the same class as PsiElement, or by a different class. An element can also contain multiple references (for example, a string literal can contain multiple substrings which are valid full-qualified class names), in which case it can implement PsiElement.getReferences() and return the references as an array.
The main method of the PsiReference interface is resolve(), which returns the element to which the reference points, or null if it was not possible to resolve the reference to a valid element (for example, it points to an undefined class). A counterpart to this method is isReferenceTo(), which checks if the reference resolves to the specified element. The latter method can be implemented by calling resolve() and comparing the result with the passed PSI element, but additional optimizations (for example, performing the tree walk only if the text of the element is equal to the text of the reference) are possible.
IDEA provides a set of interfaces which can be used as a base for implementing resolve support, namely the PsiScopeProcessor interface and the PsiElement.processDeclarations() method. These interfaces have a number of extra complexities which are not necessary for most custom languages (like support for substituting Java generics types), but they are required if the custom language can have references to Java code. If Java interoperability is not required, the plugin can forgo the standard interfaces and provide its own, different implementation of resolve.
The implementation of resolve based on the standard IDEA helper classes contains of the following components:
FilenameIndex.getFilesByName()(if the file name is known) or by iterating through all custom language files in the project (iterateContent() in the FileIndex interface obtained from ProjectRootManager.getFileIndex()).
An extension of the PsiReference interface, which allows a reference to resolve to multiple targets, is the PsiPolyVariantReference interface. The targets to which the reference resolves are returned from the multiResolve() method. The Go to Declaration action for such references allows the user to choose the target to navigate to. The implementation of multiResolve can be also based on PsiScopeProcessor, and can collect all valid targets for the reference instead of stopping when the first valid target is found.
IDEA's "Quick Definition Lookup" action is based on the same mechanism as "Go to Declaration", so it becomes automatically available for all references which can be resolved by the language plugin.
There are two main types of code completion that can be provided by custom language plugins: reference completion and contributor-based completion.
Reference completion is easier to implement, but supports only the basic completion action. Contributor-based completion provides more features, supports all three completion types (basic, smart and class name) and can be used, for example, to implement keyword completion.
To fill the completion list, IDEA calls
PsiReference.getVariants() either on the reference at the caret location or on a dummy reference that would be placed at the caret. This method needs to return an array of objects containing either strings, PsiElement instances or instances of the
LookupElement class. If a PsiElement instance is returned in the array, the completion list shows the icon for the element.
The most common way to implement getVariants() is to use the same function for walking up the tree as in PsiReference.resolve(), and a different implementation of PsiScopeProcessor which collects all declarations passed to its processDeclarations() method and returns them as an array for filling the completion list.
The Find Usages action in IDEA is a multi-step process, and each step of the process requires involvement from the custom language plugin. The language plugin participates in the Find Usages process by registering an implementation of FindUsagesProvider in the
com.intellij.lang.findUsagesProvider extension point, and through the PSI implementation (PsiNamedElement and PsiReference interfaces).
The steps of the Find Usages action are the following:
FindUsagesProvider.getWordsScanner(), IDEA loads the contents of every file and passes it to the words scanner, along with the words consumer. The words scanner breaks the text into words, defines the context for each word (code, comments or literals) and passes the word to the consumer. The simplest way to implement the words scanner is to use the DefaultWordsScanner implementation, passing to it the sets of lexer token types which are treated as identifiers, literals and comments. The default words scanner will use the lexer to break the text into tokens, and will handle breaking the text of comment and literal tokens into individual words.
FindUsagesProvider.getDescriptiveName()to determine how the element should be presented to the user.
to be continued