Compiler Services: Using the F# tokenizer
This tutorial demonstrates how to call the F# language tokenizer. Given F#source code, the tokenizer generates a list of source code lines that containinformation about tokens on each line. For each token, you can get the typeof the token, exact location as well as color kind of the token (keyword,identifier, number, operator, etc.).
NOTE: The FSharp.Compiler.Service API is subject to change when later versions of the nuget package are published
Creating the tokenizer
To use the tokenizer, referenceFSharp.Compiler.Service.dll
and open theFSharp.Compiler.Tokenization
namespace:
#r"FSharp.Compiler.Service.dll"openFSharp.Compiler.Tokenization
Now you can create an instance ofFSharpSourceTokenizer
. The class takes twoarguments - the first is the list of defined symbols and the second is thefile name of the source code. The defined symbols are required because thetokenizer handles#if
directives. The file name is required only to specifylocations of the source code (and it does not have to exist):
letsourceTok=FSharpSourceTokenizer([],Some"C:\\test.fsx",Some"PREVIEW",None)
Using thesourceTok
object, we can now (repeatedly) tokenize lines ofF# source code.
Tokenizing F# code
The tokenizer operates on individual lines rather than on the entire sourcefile. After getting a token, the tokenizer also returns new state (asint64
value).This can be used to tokenize F# code more efficiently. When source code changes,you do not need to re-tokenize the entire file - only the parts that have changed.
Tokenizing single line
To tokenize a single line, we create aFSharpLineTokenizer
by callingCreateLineTokenizer
on theFSharpSourceTokenizer
object that we created earlier:
lettokenizer=sourceTok.CreateLineTokenizer("let answer=42")
Now, we can write a simple recursive function that callsScanToken
on thetokenizer
until it returnsNone
(indicating the end of line). When the function succeeds, itreturns anFSharpTokenInfo
object with all the interesting details:
/// Tokenize a single line of F# codeletrectokenizeLine(tokenizer:FSharpLineTokenizer)state=matchtokenizer.ScanToken(state)with|Sometok,state->// Print token nameprintf"%s "tok.TokenName// Tokenize the rest, in the new statetokenizeLinetokenizerstate|None,state->state
The function returns the new state, which is needed if you need to tokenize multiple linesand an earlier line ends with a multi-line comment. As an initial state, we can use0L
:
tokenizeLinetokenizerFSharpTokenizerLexState.Initial
The result is a sequence of tokens with names LET, WHITESPACE, IDENT, EQUALS and INT32.There is a number of interesting properties onFSharpTokenInfo
including:
CharClass
andColorClass
return information about the token category thatcan be used for colorizing F# code.LeftColumn
andRightColumn
return the location of the token inside the line.TokenName
is the name of the token (as defined in the F# lexer)
Note that the tokenizer is stateful - if you want to tokenize single line multiple times,you need to callCreateLineTokenizer
again.
Tokenizing sample code
To run the tokenizer on a longer sample code or an entire file, you need to read thesample input as a collection ofstring
values:
letlines=""" // Hello world let hello() = printfn "Hello world!" """.Split('\r','\n')
To tokenize multi-line input, we again need a recursive function that keeps the currentstate. The following function takes the lines as a list of strings (together with line numberand the current state). We create a new tokenizer for each line and calltokenizeLine
using the state from theend of the previous line:
/// Print token names for multiple lines of codeletrectokenizeLinesstatecountlines=matchlineswith|line::lines->// Create tokenizer & tokenize single lineprintfn"\nLine%d"countlettokenizer=sourceTok.CreateLineTokenizer(line)letstate=tokenizeLinetokenizerstate// Tokenize the rest using new statetokenizeLinesstate(count+1)lines|[]->()
The function simply callstokenizeLine
(defined earlier) to print the names of allthe tokens on each line. We can call it on the previous input with0L
as the initialstate and1
as the number of the first line:
lines|>List.ofSeq|>tokenizeLinesFSharpTokenizerLexState.Initial1
Ignoring some unimportant details (like whitespace at the beginning of each line andthe first line which is just whitespace), the code generates the following output:
|
It is worth noting that the tokenizer yields multipleLINE_COMMENT
tokens and multipleSTRING_TEXT
tokens for each single comment or string (roughly, one for each word), soif you want to get the entire text of a comment/string, you need to concatenate thetokens.
namespace FSharp
--------------------
namespace Microsoft.FSharp
type FSharpSourceTokenizer = new: conditionalDefines: string list * fileName: string option * langVersion: string option * strictIndentation: bool option -> FSharpSourceTokenizer member CreateBufferTokenizer: bufferFiller: (char array * int * int -> int) -> FSharpLineTokenizer member CreateLineTokenizer: lineText: string -> FSharpLineTokenizer
<summary> Tokenizer for a source file. Holds some expensive-to-compute resources at the scope of the file.</summary>
--------------------
new: conditionalDefines: string list * fileName: string option * langVersion: string option * strictIndentation: bool option -> FSharpSourceTokenizer
Tokenize a single line of F# code
<summary> Object to tokenize a line of F# source code, starting with the given lexState. The lexState should be FSharpTokenizerLexState.Initial for the first line of text. Returns an array of ranges of the text and two enumerations categorizing the tokens and characters covered by that range, i.e. FSharpTokenColorKind and FSharpTokenCharKind. The enumerations are somewhat adhoc but useful enough to give good colorization options to the user in an IDE. A new lexState is also returned. An IDE-plugin should in general cache the lexState values for each line of the edited code.</summary>
<summary> Provides additional information about the token</summary>
<summary> Represents encoded information for the end-of-line continuation of lexing</summary>
Print token names for multiple lines of code
module Listfrom Microsoft.FSharp.Collections
--------------------
type List<'T> = | op_Nil | op_ColonColon of Head: 'T * Tail: 'T list interface IReadOnlyList<'T> interface IReadOnlyCollection<'T> interface IEnumerable interface IEnumerable<'T> member GetReverseIndex: rank: int * offset: int -> int member GetSlice: startIndex: int option * endIndex: int option -> 'T list static member Cons: head: 'T * tail: 'T list -> 'T list member Head: 'T with get member IsEmpty: bool with get member Item: index: int -> 'T with get ...