“Clang” CFE Internals Manual

Introduction

This document describes some of the more important APIs and internal designdecisions made in the Clang C front-end. The purpose of this document is toboth capture some of this high level information and also describe some of thedesign decisions behind it. This is meant for people interested in hacking onClang, not for end-users. The description below is categorized by libraries,and does not describe any of the clients of the libraries.

LLVM Support Library

The LLVMlibSupport library provides many underlying libraries anddata-structures, includingcommand line option processing, various containers and a system abstractionlayer, which is used for file system access.

The Clang “Basic” Library

This library certainly needs a better name. The “basic” library contains anumber of low-level utilities for tracking and manipulating source buffers,locations within the source buffers, diagnostics, tokens, target abstraction,and information about the subset of the language being compiled for.

Part of this infrastructure is specific to C (such as theTargetInfoclass), other parts could be reused for other non-C-based languages(SourceLocation,SourceManager,Diagnostics,FileManager).When and if there is future demand we can figure out if it makes sense tointroduce a new library, move the general classes somewhere else, or introducesome other solution.

We describe the roles of these classes in order of their dependencies.

The Diagnostics Subsystem

The Clang Diagnostics subsystem is an important part of how the compilercommunicates with the human. Diagnostics are the warnings and errors producedwhen the code is incorrect or dubious. In Clang, each diagnostic produced has(at the minimum) a unique ID, an English translation associated with it, aSourceLocation to “put the caret”, and a severity(e.g.,WARNING orERROR). They can also optionally include a number ofarguments to the diagnostic (which fill in “%0“‘s in the string) as well as anumber of source ranges that related to the diagnostic.

In this section, we’ll be giving examples produced by the Clang command linedriver, but diagnostics can berendered in many different ways depending on how theDiagnosticConsumer interface isimplemented. A representative example of a diagnostic is:

t.c:38:15: error: invalid operands to binary expression ('int *' and '_Complex float')P = (P-42) + Gamma*4;    ~~~~~~ ^ ~~~~~~~

In this example, you can see the English translation, the severity (error), youcan see the source location (the caret (”^”) and file/line/column info),the source ranges “~~~~”, arguments to the diagnostic (”int*” and“_Complexfloat”). You’ll have to believe me that there is a unique IDbacking the diagnostic :).

Getting all of this to happen has several steps and involves many movingpieces, this section describes them and talks about best practices when addinga new diagnostic.

TheDiagnostic*Kinds.td files

Diagnostics are created by adding an entry to one of theclang/Basic/Diagnostic*Kinds.td files, depending on what library will beusing it. From this file,tblgen generates the unique ID of thediagnostic, the severity of the diagnostic and the English translation + formatstring.

There is little sanity with the naming of the unique ID’s right now. Somestart witherr_,warn_,ext_ to encode the severity into the name.Since the enum is referenced in the C++ code that produces the diagnostic, itis somewhat useful for it to be reasonably short.

The severity of the diagnostic comes from the set {NOTE,REMARK,WARNING,EXTENSION,EXTWARN,ERROR}. TheERROR severity is used fordiagnostics indicating the program is never acceptable under any circumstances.When an error is emitted, the AST for the input code may not be fully built.TheEXTENSION andEXTWARN severities are used for extensions to thelanguage that Clang accepts. This means that Clang fully understands and canrepresent them in the AST, but we produce diagnostics to tell the user theircode is non-portable. The difference is that the former are ignored bydefault, and the later warn by default. TheWARNING severity is used forconstructs that are valid in the currently selected source language but thatare dubious in some way. TheREMARK severity provides generic informationabout the compilation that is not necessarily related to any dubious code. TheNOTE level is used to staple more information onto previous diagnostics.

Theseseverities are mapped into a smaller set (theDiagnostic::Levelenum, {Ignored,Note,Remark,Warning,Error,Fatal}) ofoutputlevels by the diagnostics subsystem based on various configuration options.Clang internally supports a fully fine grained mapping mechanism that allowsyou to map almost any diagnostic to the output level that you want. The onlydiagnostics that cannot be mapped areNOTEs, which always follow theseverity of the previously emitted diagnostic andERRORs, which can onlybe mapped toFatal (it is not possible to turn an error into a warning, forexample).

Diagnostic mappings are used in many ways. For example, if the user specifies-pedantic,EXTENSION maps toWarning, if they specify-pedantic-errors, it turns intoError. This is used to implementoptions like-Wunused_macros,-Wundef etc.

Mapping toFatal should only be used for diagnostics that are considered sosevere that error recovery won’t be able to recover sensibly from them (thusspewing a ton of bogus errors). One example of this class of error are failureto#include a file.

Diagnostic Wording

The wording used for a diagnostic is critical because it is the only way for auser to know how to correct their code. Use the following suggestions whenwording a diagnostic.

  • Diagnostics in Clang do not start with a capital letter and do not end withpunctuation.

    • This does not apply to proper nouns likeClang orOpenMP, toacronyms likeGCC orARC, or to language standards likeC23orC++17.

    • A trailing question mark is allowed. e.g.,unknownidentifier%0;didyoumean%1?.

  • Appropriately capitalize proper nouns likeClang,OpenCL,GCC,Objective-C, etc and language standard versions likeC11 orC++11.

  • The wording should be succinct. If necessary, use a semicolon to combinesentence fragments instead of using complete sentences. e.g., prefer wordinglike'%0'isdeprecated;itwillberemovedinafuturereleaseofClangover wording like'%0'isdeprecated.ItwillberemovedinafuturereleaseofClang.

  • The wording should be actionable and avoid using standards terms or grammarproductions that a new user would not be familiar with. e.g., prefer wordinglikemissingsemicolon over wording likesyntaxerror (which is notactionable) orexpectedunqualified-id (which uses standards terminology).

  • The wording should clearly explain what is wrong with the code rather thanrestating what the code does. e.g., prefer wording liketype%0requiresavalueintherange%1to%2 over wording like%0isinvalid.

  • The wording should have enough contextual information to help the useridentify the issue in a complex expression. e.g., prefer wording likebothsidesofthe%0binaryoperatorareidentical over wording likeidenticaloperandstobinaryoperator.

  • Use single quotes to denote syntactic constructs or command line argumentsnamed in a diagnostic message. e.g., prefer wording like'this'pointercannotbenullinwell-definedC++code over wording likethispointercannotbenullinwell-definedC++code.

  • Prefer diagnostic wording without contractions whenever possible. The singlequote in a contraction can be visually distracting due to its use withsyntactic constructs and contractions can be harder to understand for non-native English speakers.

The Format String

The format string for the diagnostic is very simple, but it has some power. Ittakes the form of a string in English with markers that indicate where and howarguments to the diagnostic are inserted and formatted. For example, here aresome simple format strings:

"binary integer literals are an extension""format string contains '\\0' within the string body""more '%%' conversions than data arguments""invalid operands to binary expression (%0 and %1)""overloaded '%0' must be a %select{unary|binary|unary or binary}2 operator"" (has %1 parameter%s1)"

These examples show some important points of format strings. You can use anyplain ASCII character in the diagnostic string except “%” without aproblem, but these are C strings, so you have to use and be aware of all the Cescape sequences (as in the second example). If you want to produce a “%”in the output, use the “%%” escape sequence, like the third diagnostic.Finally, Clang uses the “%...[digit]” sequences to specify where and howarguments to the diagnostic are formatted.

Arguments to the diagnostic are numbered according to how they are specified bythe C++ code thatproduces them, and arereferenced by%0 ..%9. If you have more than 10 arguments to yourdiagnostic, you are doing something wrong :). Unlikeprintf, there is norequirement that arguments to the diagnostic end up in the output in the sameorder as they are specified, you could have a format string with “%1%0”that swaps them, for example. The text in between the percent and digit areformatting instructions. If there are no instructions, the argument is justturned into a string and substituted in.

Here are some “best practices” for writing the English format string:

  • Keep the string short. It should ideally fit in the 80 column limit of theDiagnosticKinds.td file. This avoids the diagnostic wrapping whenprinted, and forces you to think about the important point you are conveyingwith the diagnostic.

  • Take advantage of location information. The user will be able to see theline and location of the caret, so you don’t need to tell them that theproblem is with the 4th argument to the function: just point to it.

  • Do not capitalize the diagnostic string, and do not end it with a period.

  • If you need to quote something in the diagnostic string, use single quotes.

Diagnostics should never take random English strings as arguments: youshouldn’t use “youhaveaproblemwith%0” and pass in things like “yourargument” or “yourreturnvalue” as arguments. Doing this preventstranslating the Clang diagnostics to otherlanguages (because they’ll get random English words in their otherwiselocalized diagnostic). The exceptions to this are C/C++ language keywords(e.g.,auto,const,mutable, etc) and C/C++ operators (/=).Note that things like “pointer” and “reference” are not keywords. On the otherhand, youcan include anything that comes from the user’s source code,including variable names, types, labels, etc. The “select” format can beused to achieve this sort of thing in a localizable way, see below.

Formatting a Diagnostic Argument

Arguments to diagnostics are fully typed internally, and come from a coupledifferent classes: integers, types, names, and random strings. Depending onthe class of the argument, it can be optionally formatted in different ways.This gives theDiagnosticConsumer information about what the argument meanswithout requiring it to use a specific presentation (consider this MVC forClang :).

It is really easy to add format specifiers to the Clang diagnostics system, butthey should be discussed before they are added. If you are creating a lot ofrepetitive diagnostics and/or have an idea for a useful formatter, please bringit up on the cfe-dev mailing list.

Here are the different diagnostic argument formats currently supported byClang:

“s” format

Example:

"requires%0parameter%s0"

Class:

Integers

Description:

This is a simple formatter for integers that is useful when producing Englishdiagnostics. When the integer is 1, it prints as nothing. When the integeris not 1, it prints as “s”. This allows some simple grammatical forms tobe to be handled correctly, and eliminates the need to use gross things like"requires%1parameter(s)". Note, this only handles adding a simple“s” character, it will not handle situations where pluralization is morecomplicated such as turningfancy intofancies ormouse intomice. You can use the “plural” format specifier to handle such situations.

“select” format

Example:

"mustbea%select{unary|binary|unaryorbinary}0operator"

Class:

Integers

Description:

This format specifier is used to merge multiple related diagnostics togetherinto one common one, without requiring the difference to be specified as anEnglish string argument. Instead of specifying the string, the diagnosticgets an integer argument and the format string selects the numbered option.In this case, the “%0” value must be an integer in the range [0..2]. Ifit is 0, it prints “unary”, if it is 1 it prints “binary” if it is 2, itprints “unary or binary”. This allows other language translations tosubstitute reasonable words (or entire phrases) based on the semantics of thediagnostic instead of having to do things textually. The selected stringdoes undergo formatting.

“enum_select format

Example:

unknownfrobblingofa%enum_select<FrobbleKind>{%VarDecl{variabledeclaration}|%FuncDecl{functiondeclaration}}0whenblarging

Class:

Integers

Description:

This format specifier is used exactly like aselect specifier, except itadditionally generates a namespace, enumeration, and enumerator list based onthe format string given. In the above case, a namespace is generated namedFrobbleKind that has an unscoped enumeration with the enumeratorsVarDecl andFuncDecl which correspond to the values 0 and 1. Thispermits a clearer use of theDiag in source code, as the above could becalled as:Diag(Loc,diag::frobble)<<diag::FrobbleKind::VarDecl.

“plural” format

Example:

"youhave%0%plural{1:mouse|:mice}0connectedtoyourcomputer"

Class:

Integers

Description:

This is a formatter for complex plural forms. It is designed to handle eventhe requirements of languages with very complex plural forms, as many Balticlanguages have. The argument consists of a series of expression/form pairs,separated by “:”, where the first form whose expression evaluates to true isthe result of the modifier.

An expression can be empty, in which case it is always true. See the exampleat the top. Otherwise, it is a series of one or more numeric conditions,separated by “,”. If any condition matches, the expression matches. Eachnumeric condition can take one of three forms.

  • number: A simple decimal number matches if the argument is the same as thenumber. Example:"%plural{1:mouse|:mice}0"

  • range: A range in square brackets matches if the argument is within therange. Then range is inclusive on both ends. Example:"%plural{0:none|1:one|[2,5]:some|:many}0"

  • modulo: A modulo operator is followed by a number, and equals sign andeither a number or a range. The tests are the same as for plain numbersand ranges, but the argument is taken modulo the number first. Example:"%plural{%100=0:evenhundred|%100=[1,50]:lowerhalf|:everythingelse}1"

The parser is very unforgiving. A syntax error, even whitespace, will abort,as will a failure to match the argument against any expression.

“ordinal” format

Example:

"ambiguityin%ordinal0argument"

Class:

Integers

Description:

This is a formatter which represents the argument number as an ordinal: thevalue1 becomes1st,3 becomes3rd, and so on. Values lessthan1 are not supported. This formatter is currently hard-coded to useEnglish ordinals.

“human” format

Example:

"totalsizeis%human0bytes"

Class:

Integers

Description:

This is a formatter which represents the argument number in a human readableformat: the value123 stays123,12345 becomes12.34k,6666666`becomes``6.67M, and so on for ‘G’ and ‘T’.

“objcclass” format

Example:

"method%objcclass0notfound"

Class:

DeclarationName

Description:

This is a simple formatter that indicates theDeclarationName correspondsto an Objective-C class method selector. As such, it prints the selectorwith a leading “+”.

“objcinstance” format

Example:

"method%objcinstance0notfound"

Class:

DeclarationName

Description:

This is a simple formatter that indicates theDeclarationName correspondsto an Objective-C instance method selector. As such, it prints the selectorwith a leading “-“.

“q” format

Example:

"candidatefoundbynamelookupis%q0"

Class:

NamedDecl*

Description:

This formatter indicates that the fully-qualified name of the declarationshould be printed, e.g., “std::vector” rather than “vector”.

“diff” format

Example:

"noknownconversion%diff{from$to$|fromargumenttypetoparametertype}1,2"

Class:

QualType

Description:

This formatter takes twoQualTypes and attempts to print a templatedifference between the two. If tree printing is off, the text inside thebraces before the pipe is printed, with the formatted text replacing the $.If tree printing is on, the text after the pipe is printed and a type tree isprinted after the diagnostic message.

“sub” format

Example:

Given the following record definition of typeTextSubstitution:

def select_ovl_candidate : TextSubstitution<  "%select{function|constructor}0%select{| template| %2}1">;

which can be used as

def note_ovl_candidate : Note<  "candidate %sub{select_ovl_candidate}3,2,1 not viable">;

and will act as if it was written"candidate%select{function|constructor}3%select{|template|%1}2notviable".

Description:

This format specifier is used to avoid repeating strings verbatim in multiplediagnostics. The argument to%sub must name aTextSubstitution tblgenrecord. The substitution must specify all arguments used by the substitution,and the modifier indexes in the substitution are re-numbered accordingly. Thesubstituted text must itself be a valid format string before substitution.

Producing the Diagnostic

Now that you’ve created the diagnostic in theDiagnostic*Kinds.td file, youneed to write the code that detects the condition in question and emits the newdiagnostic. Various components of Clang (e.g., the preprocessor,Sema,etc.) provide a helper function named “Diag”. It creates a diagnostic andaccepts the arguments, ranges, and other information that goes along with it.

For example, the binary expression error comes from code like this:

if(variousthingsthatarebad)Diag(Loc,diag::err_typecheck_invalid_operands)<<lex->getType()<<rex->getType()<<lex->getSourceRange()<<rex->getSourceRange();

This shows that use of theDiag method: it takes a location (aSourceLocation object) and a diagnostic enum value(which matches the name fromDiagnostic*Kinds.td). If the diagnostic takesarguments, they are specified with the<< operator: the first argumentbecomes%0, the second becomes%1, etc. The diagnostic interfaceallows you to specify arguments of many different types, includingint andunsigned for integer arguments,constchar* andstd::string forstring arguments,DeclarationName andconstIdentifierInfo* for names,QualType for types, etc.SourceRanges are also specified with the<< operator, but do not have a specific ordering requirement.

As you can see, adding and producing a diagnostic is pretty straightforward.The hard part is deciding exactly what you need to say to help the user,picking a suitable wording, and providing the information needed to format itcorrectly. The good news is that the call site that issues a diagnostic shouldbe completely independent of how the diagnostic is formatted and in whatlanguage it is rendered.

Fix-It Hints

In some cases, the front end emits diagnostics when it is clear that some smallchange to the source code would fix the problem. For example, a missingsemicolon at the end of a statement or a use of deprecated syntax that iseasily rewritten into a more modern form. Clang tries very hard to emit thediagnostic and recover gracefully in these and other cases.

However, for these cases where the fix is obvious, the diagnostic can beannotated with a hint (referred to as a “fix-it hint”) that describes how tochange the code referenced by the diagnostic to fix the problem. For example,it might add the missing semicolon at the end of the statement or rewrite theuse of a deprecated construct into something more palatable. Here is one suchexample from the C++ front end, where we warn about the right-shift operatorchanging meaning from C++98 to C++11:

test.cpp:3:7: warning: use of right-shift operator ('>>') in template argument                       will require parentheses in C++11A<100 >> 2> *a;      ^  (       )

Here, the fix-it hint is suggesting that parentheses be added, and showingexactly where those parentheses would be inserted into the source code. Thefix-it hints themselves describe what changes to make to the source code in anabstract manner, which the text diagnostic printer renders as a line of“insertions” below the caret line.Other diagnostic clients might choose to render the code differently (e.g., asmarkup inline) or even give the user the ability to automatically fix theproblem.

Fix-it hints on errors and warnings need to obey these rules:

  • Since they are automatically applied if-Xclang-fixit is passed to thedriver, they should only be used when it’s very likely they match the user’sintent.

  • Clang must recover from errors as if the fix-it had been applied.

  • Fix-it hints on a warning must not change the meaning of the code.However, a hint may clarify the meaning as intentional, for example by addingparentheses when the precedence of operators isn’t obvious.

If a fix-it can’t obey these rules, put the fix-it on a note. Fix-its on notesare not applied automatically.

All fix-it hints are described by theFixItHint class, instances of whichshould be attached to the diagnostic using the<< operator in the same waythat highlighted source ranges and arguments are passed to the diagnostic.Fix-it hints can be created with one of three constructors:

  • FixItHint::CreateInsertion(Loc,Code)

    Specifies that the givenCode (a string) should be inserted before thesource locationLoc.

  • FixItHint::CreateRemoval(Range)

    Specifies that the code in the given sourceRange should be removed.

  • FixItHint::CreateReplacement(Range,Code)

    Specifies that the code in the given sourceRange should be removed,and replaced with the givenCode string.

TheDiagnosticConsumer Interface

Once code generates a diagnostic with all of the arguments and the rest of therelevant information, Clang needs to know what to do with it. As previouslymentioned, the diagnostic machinery goes through some filtering to map aseverity onto a diagnostic level, then (assuming the diagnostic is not mappedto “Ignore”) it invokes an object that implements theDiagnosticConsumerinterface with the information.

It is possible to implement this interface in many different ways. Forexample, the normal ClangDiagnosticConsumer (namedTextDiagnosticPrinter) turns the arguments into strings (according to thevarious formatting rules), prints out the file/line/column information and thestring, then prints out the line of code, the source ranges, and the caret.However, this behavior isn’t required.

Another implementation of theDiagnosticConsumer interface is theTextDiagnosticBuffer class, which is used when Clang is in-verifymode. Instead of formatting and printing out the diagnostics, thisimplementation just captures and remembers the diagnostics as they fly by.Then-verify compares the list of produced diagnostics to the list ofexpected ones. If they disagree, it prints out its own output. Fulldocumentation for the-verify mode can be found atVerifying Diagnostics.

There are many other possible implementations of this interface, and this iswhy we prefer diagnostics to pass down rich structured information inarguments. For example, an HTML output might want declaration names belinkified to where they come from in the source. Another example is that a GUImight let you click on typedefs to expand them. This application would want topass significantly more information about types through to the GUI than asimple flat string. The interface allows this to happen.

Adding Translations to Clang

Not possible yet! Diagnostic strings should be written in UTF-8, the client cantranslate to the relevant code page if needed. Each translation completelyreplaces the format string for the diagnostic.

TheSourceLocation andSourceManager classes

Strangely enough, theSourceLocation class represents a location within thesource code of the program. Important design points include:

  1. sizeof(SourceLocation) must be extremely small, as these are embeddedinto many AST nodes and are passed around often. Currently it is 32 bits.

  2. SourceLocation must be a simple value object that can be efficientlycopied.

  3. We should be able to represent a source location for any byte of any inputfile. This includes in the middle of tokens, in whitespace, in trigraphs,etc.

  4. ASourceLocation must encode the current#include stack that wasactive when the location was processed. For example, if the locationcorresponds to a token, it should contain the set of#includes activewhen the token was lexed. This allows us to print the#include stackfor a diagnostic.

  5. SourceLocation must be able to describe macro expansions, capturing boththe ultimate instantiation point and the source of the original characterdata.

In practice, theSourceLocation works together with theSourceManagerclass to encode two pieces of information about a location: its spellinglocation and its expansion location. For most tokens, these will be thesame. However, for a macro expansion (or tokens that came from a_Pragmadirective) these will describe the location of the characters corresponding tothe token and the location where the token was used (i.e., the macroexpansion point or the location of the_Pragma itself).

The Clang front-end inherently depends on the location of a token being trackedcorrectly. If it is ever incorrect, the front-end may get confused and die.The reason for this is that the notion of the “spelling” of aToken inClang depends on being able to find the original input characters for thetoken. This concept maps directly to the “spelling location” for the token.

SourceRange andCharSourceRange

Clang represents most source ranges by [first, last], where “first” and “last”each point to the beginning of their respective tokens. For example considertheSourceRange of the following statement:

x = foo + bar;^first    ^last

To map from this representation to a character-based representation, the “last”location needs to be adjusted to point to (or past) the end of that token witheitherLexer::MeasureTokenLength() orLexer::getLocForEndOfToken(). Forthe rare cases where character-level source ranges information is needed we usetheCharSourceRange class.

The Driver Library

The clang Driver and library are documentedhere.

Precompiled Headers

Clang supports precompiled headers (PCH), which uses aserialized representation of Clang’s internal data structures, encoded with theLLVM bitstream format.

The Frontend Library

The Frontend library contains functionality useful for building tools on top ofthe Clang libraries, for example several methods for outputting diagnostics.

Compiler Invocation

One of the classes provided by the Frontend library isCompilerInvocation,which holds information that describe current invocation of the Clang-cc1frontend. The information typically comes from the command line constructed bythe Clang driver or from clients performing custom initialization. The datastructure is split into logical units used by different parts of the compiler,for examplePreprocessorOptions,LanguageOptions orCodeGenOptions.

Command Line Interface

The command line interface of the Clang-cc1 frontend is defined alongsidethe driver options inclang/Driver/Options.td. The information making up anoption definition includes its prefix and name (for example-std=), form andposition of the option value, help text, aliases and more. Each option maybelong to a certain group and can be marked with zero or more flags. Optionsaccepted by the-cc1 frontend are marked with theCC1Option flag.

Command Line Parsing

Option definitions are processed by the-gen-opt-parser-defs tablegenbackend during early stages of the build. Options are then used for querying aninstancellvm::opt::ArgList, a wrapper around the command line arguments.This is done in the Clang driver to construct individual jobs based on thedriver arguments and also in theCompilerInvocation::CreateFromArgs functionthat parses the-cc1 frontend arguments.

Command Line Generation

Any validCompilerInvocation created from a-cc1 command line can bealso serialized back into semantically equivalent command line in adeterministic manner. This enables features such as implicitly discovered,explicitly built modules.

Adding new Command Line Option

When adding a new command line option, the first place of interest is the headerfile declaring the corresponding options class (e.g.CodeGenOptions.h forcommand line option that affects the code generation). Create new membervariable for the option value:

 class CodeGenOptions : public CodeGenOptionsBase {+   /// List of dynamic shared object files to be loaded as pass plugins.+   std::vector<std::string> PassPlugins; }

Next, declare the command line interface of the option in the tablegen fileclang/include/clang/Driver/Options.td. This is done by instantiating theOption class (defined inllvm/include/llvm/Option/OptParser.td). Theinstance is typically created through one of the helper classes that encode theacceptable ways to specify the option value on the command line:

  • Flag - the option does not accept any value,

  • Joined - the value must immediately follow the option name within the sameargument,

  • Separate - the value must follow the option name in the next command lineargument,

  • JoinedOrSeparate - the value can be specified either asJoined orSeparate,

  • CommaJoined - the values are comma-separated and must immediately followthe option name within the same argument (seeWl, for an example).

The helper classes take a list of acceptable prefixes of the option (e.g."-","--" or"/") and the option name:

 // Options.td+ def fpass_plugin_EQ : Joined<["-"], "fpass-plugin=">;

Then, specify additional attributes via mix-ins:

  • HelpText holds the text that will be printed besides the option name whenthe user requests help (e.g. viaclang--help).

  • Group specifies the “category” of options this option belongs to. This isused by various tools to categorize and sometimes filter options.

  • Flags may contain “tags” associated with the option. These may affect howthe option is rendered, or if it’s hidden in some contexts.

  • Visibility should be used to specify the drivers in which a particularoption would be available. This attribute will impact tool –help

  • Alias denotes that the option is an alias of another option. This may becombined withAliasArgs that holds the implied value.

 // Options.td def fpass_plugin_EQ : Joined<["-"], "fpass-plugin=">,+   Group<f_Group>, Visibility<[ClangOption, CC1Option]>,+   HelpText<"Load pass plugin from a dynamic shared object file.">;

New options are recognized by theclang driver mode ifVisibility isnot specified or containsClangOption. Options intended forclang-cc1must be explicitly marked with theCC1Option flag. Flags that specifyCC1Option but notClangOption will only be accessible via-cc1.This is similar for other driver modes, such asclang-cl orflang.

Next, parse (or manufacture) the command line arguments in the Clang driver anduse them to construct the-cc1 job:

 void Clang::ConstructJob(const ArgList &Args /*...*/) const {   ArgStringList CmdArgs;   // ...+   for (const Arg *A : Args.filtered(OPT_fpass_plugin_EQ)) {+     CmdArgs.push_back(Args.MakeArgString(Twine("-fpass-plugin=") + A->getValue()));+     A->claim();+   } }

The last step is implementing the-cc1 command line argumentparsing/generation that initializes/serializes the option class (in our caseCodeGenOptions) stored withinCompilerInvocation. This can be doneautomatically by using the marshalling annotations on the option definition:

 // Options.td def fpass_plugin_EQ : Joined<["-"], "fpass-plugin=">,   Group<f_Group>, Flags<[CC1Option]>,   HelpText<"Load pass plugin from a dynamic shared object file.">,+   MarshallingInfoStringVector<CodeGenOpts<"PassPlugins">>;

Inner workings of the system are introduced in themarshallinginfrastructure section and the available annotations arelistedhere.

In case the marshalling infrastructure does not support the desired semantics,consider simplifying it to fit the existing model. This makes the command linemore uniform and reduces the amount of custom, manually written code. Rememberthat the-cc1 command line interface is intended only for Clang developers,meaning it does not need to mirror the driver interface, maintain backwardcompatibility or be compatible with GCC.

If the option semantics cannot be encoded via marshalling annotations, you canresort to parsing/serializing the command line arguments manually:

 // CompilerInvocation.cpp static bool ParseCodeGenArgs(CodeGenOptions &Opts, ArgList &Args /*...*/) {   // ...+   Opts.PassPlugins = Args.getAllArgValues(OPT_fpass_plugin_EQ); } static void GenerateCodeGenArgs(const CodeGenOptions &Opts,                                 SmallVectorImpl<const char *> &Args,                                 CompilerInvocation::StringAllocator SA /*...*/) {   // ...+   for (const std::string &PassPlugin : Opts.PassPlugins)+     GenerateArg(Args, OPT_fpass_plugin_EQ, PassPlugin, SA); }

Finally, you can specify the argument on the command line:clang-fpass-plugin=a-fpass-plugin=b and use the new member variable asdesired.

 void EmitAssemblyHelper::EmitAssemblyWithNewPassManager(/*...*/) {   // ...+   for (auto &PluginFN : CodeGenOpts.PassPlugins)+     if (auto PassPlugin = PassPlugin::Load(PluginFN))+        PassPlugin->registerPassBuilderCallbacks(PB); }

Option Marshalling Infrastructure

The option marshalling infrastructure automates the parsing of the Clang-cc1 frontend command line arguments intoCompilerInvocation and theirgeneration fromCompilerInvocation. The system replaces lots of repetitiveC++ code with simple, declarative tablegen annotations and it’s being used forthe majority of the-cc1 command line interface. This section provides anoverview of the system.

Note: The marshalling infrastructure is not intended for driver-onlyoptions. Only options of the-cc1 frontend need to be marshalled to/fromCompilerInvocation instance.

To read and modify contents ofCompilerInvocation, the marshalling systemuses key paths, which are declared in two steps. First, a tablegen definitionfor theCompilerInvocation member is created by inheriting fromKeyPathAndMacro:

// Options.tdclass LangOpts<string field> : KeyPathAndMacro<"LangOpts->", field, "LANG_"> {}//                   CompilerInvocation member  ^^^^^^^^^^//                                    OPTION_WITH_MARSHALLING prefix ^^^^^

The first argument to the parent class is the beginning of the key path thatreferences theCompilerInvocation member. This argument ends with-> ifthe member is a pointer type or with. if it’s a value type. The child classtakes a single parameterfield that is forwarded as the second argument tothe base class. The child class can then be used like so:LangOpts<"IgnoreExceptions">, constructing a key path to the fieldLangOpts->IgnoreExceptions. The third argument passed to the parent class isa string that the tablegen backend uses as a prefix to theOPTION_WITH_MARSHALLING macro. Using the key path as a mix-in on anOption instance instructs the backend to generate the following code:

// Options.inc#ifdef LANG_OPTION_WITH_MARSHALLINGLANG_OPTION_WITH_MARSHALLING([...],LangOpts->IgnoreExceptions,[...])#endif// LANG_OPTION_WITH_MARSHALLING

Such definition can be used used in the function for parsing and generatingcommand line:

// clang/lib/Frontend/CompilerInvoation.cppboolCompilerInvocation::ParseLangArgs(LangOptions*LangOpts,ArgList&Args,DiagnosticsEngine&Diags){boolSuccess=true;#define LANG_OPTION_WITH_MARSHALLING(                                          \    PREFIX_TYPE, NAME, ID, KIND, GROUP, ALIAS, ALIASARGS, FLAGS, PARAM,        \    HELPTEXT, METAVAR, VALUES, SPELLING, SHOULD_PARSE, ALWAYS_EMIT, KEYPATH,   \    DEFAULT_VALUE, IMPLIED_CHECK, IMPLIED_VALUE, NORMALIZER, DENORMALIZER,     \    MERGER, EXTRACTOR, TABLE_INDEX)                                            \  PARSE_OPTION_WITH_MARSHALLING(Args, Diags, Success, ID, FLAGS, PARAM,        \                                SHOULD_PARSE, KEYPATH, DEFAULT_VALUE,          \                                IMPLIED_CHECK, IMPLIED_VALUE, NORMALIZER,      \                                MERGER, TABLE_INDEX)#include"clang/Driver/Options.inc"#undef LANG_OPTION_WITH_MARSHALLING// ...returnSuccess;}voidCompilerInvocation::GenerateLangArgs(LangOptions*LangOpts,SmallVectorImpl<constchar*>&Args,StringAllocatorSA){#define LANG_OPTION_WITH_MARSHALLING(                                          \    PREFIX_TYPE, NAME, ID, KIND, GROUP, ALIAS, ALIASARGS, FLAGS, PARAM,        \    HELPTEXT, METAVAR, VALUES, SPELLING, SHOULD_PARSE, ALWAYS_EMIT, KEYPATH,   \    DEFAULT_VALUE, IMPLIED_CHECK, IMPLIED_VALUE, NORMALIZER, DENORMALIZER,     \    MERGER, EXTRACTOR, TABLE_INDEX)                                            \  GENERATE_OPTION_WITH_MARSHALLING(                                            \      Args, SA, KIND, FLAGS, SPELLING, ALWAYS_EMIT, KEYPATH, DEFAULT_VALUE,    \      IMPLIED_CHECK, IMPLIED_VALUE, DENORMALIZER, EXTRACTOR, TABLE_INDEX)#include"clang/Driver/Options.inc"#undef LANG_OPTION_WITH_MARSHALLING// ...}

ThePARSE_OPTION_WITH_MARSHALLING andGENERATE_OPTION_WITH_MARSHALLINGmacros are defined inCompilerInvocation.cpp and they implement the genericalgorithm for parsing and generating command line arguments.

Option Marshalling Annotations

How does the tablegen backend know what to put in place of[...] in thegeneratedOptions.inc? This is specified by theMarshalling utilitiesdescribed below. All of them take a key path argument and possibly otherinformation required for parsing or generating the command line argument.

Note: The marshalling infrastructure is not intended for driver-onlyoptions. Only options of the-cc1 frontend need to be marshalled to/fromCompilerInvocation instance.

Positive Flag

The key path defaults tofalse and is set totrue when the flag ispresent on command line.

def fignore_exceptions : Flag<["-"], "fignore-exceptions">,  Visibility<[ClangOption, CC1Option]>,  MarshallingInfoFlag<LangOpts<"IgnoreExceptions">>;

Negative Flag

The key path defaults totrue and is set tofalse when the flag ispresent on command line.

def fno_verbose_asm : Flag<["-"], "fno-verbose-asm">,  Visibility<[ClangOption, CC1Option]>,  MarshallingInfoNegativeFlag<CodeGenOpts<"AsmVerbose">>;

Negative and Positive Flag

The key path defaults to the specified value (false,true or someboolean value that’s statically unknown in the tablegen file). Then, the keypath is set to the value associated with the flag that appears last on commandline.

defm legacy_pass_manager : BoolOption<"f", "legacy-pass-manager",  CodeGenOpts<"LegacyPassManager">, DefaultFalse,  PosFlag<SetTrue, [], [], "Use the legacy pass manager in LLVM">,  NegFlag<SetFalse, [], [], "Use the new pass manager in LLVM">,  BothFlags<[], [ClangOption, CC1Option]>>;

With most such pair of flags, the-cc1 frontend accepts only the flag thatchanges the default key path value. The Clang driver is responsible foraccepting both and either forwarding the changing flag or discarding the flagthat would just set the key path to its default.

The first argument toBoolOption is a prefix that is used to construct thefull names of both flags. The positive flag would then be namedflegacy-pass-manager and the negativefno-legacy-pass-manager.BoolOption also implies the- prefix for both flags. It’s also possibleto useBoolFOption that implies the"f" prefix andGroup<f_Group>.ThePosFlag andNegFlag classes hold the associated boolean value,arrays of elements passed to theFlag andVisibility classes and thehelp text. The optionalBothFlags class holds arrays ofFlag andVisibility elements that are common for both the positive and negative flagand their common help text suffix.

String

The key path defaults to the specified string, or an empty one, if omitted. Whenthe option appears on the command line, the argument value is simply copied.

def isysroot : JoinedOrSeparate<["-"], "isysroot">,  Visibility<[ClangOption, CC1Option, FlangOption]>,  MarshallingInfoString<HeaderSearchOpts<"Sysroot">, [{"/"}]>;

List of Strings

The key path defaults to an emptystd::vector<std::string>. Values specifiedwith each appearance of the option on the command line are appended to thevector.

def frewrite_map_file : Separate<["-"], "frewrite-map-file">,  Visibility<[ClangOption, CC1Option]>,  MarshallingInfoStringVector<CodeGenOpts<"RewriteMapFiles">>;

Integer

The key path defaults to the specified integer value, or0 if omitted. Whenthe option appears on the command line, its value gets parsed byllvm::APIntand the result is assigned to the key path on success.

def mstack_probe_size : Joined<["-"], "mstack-probe-size=">,  Visibility<[ClangOption, CC1Option]>,  MarshallingInfoInt<CodeGenOpts<"StackProbeSize">, "4096">;

Enumeration

The key path defaults to the value specified inMarshallingInfoEnum prefixedby the contents ofNormalizedValuesScope and::. This ensures correctreference to an enum case is formed even if the enum resides in differentnamespace or is an enum class. If the value present on command line does notmatch any of the comma-separated values fromValues, an error diagnostics isissued. Otherwise, the corresponding element fromNormalizedValues at thesame index is assigned to the key path (also correctly scoped). The number ofcomma-separated string values and elements of the array withinNormalizedValues must match.

def mthread_model : Separate<["-"], "mthread-model">,  Visibility<[ClangOption, CC1Option]>,  Values<"posix,single">, NormalizedValues<["POSIX", "Single"]>,  NormalizedValuesScope<"LangOptions::ThreadModelKind">,  MarshallingInfoEnum<LangOpts<"ThreadModel">, "POSIX">;

It is also possible to define relationships between options.

Implication

The key path defaults to the default value from the primaryMarshallingannotation. Then, if any of the elements ofImpliedByAnyOf evaluate to true,the key path value is changed to the specified value ortrue if missing.Finally, the command line is parsed according to the primary annotation.

def fms_extensions : Flag<["-"], "fms-extensions">,  Visibility<[ClangOption, CC1Option]>,  MarshallingInfoFlag<LangOpts<"MicrosoftExt">>,  ImpliedByAnyOf<[fms_compatibility.KeyPath], "true">;

Condition

The option is parsed only if the expression inShouldParseIf evaluates totrue.

def fopenmp_enable_irbuilder : Flag<["-"], "fopenmp-enable-irbuilder">,  Visibility<[ClangOption, CC1Option]>,  MarshallingInfoFlag<LangOpts<"OpenMPIRBuilder">>,  ShouldParseIf<fopenmp.KeyPath>;

The Lexer and Preprocessor Library

The Lexer library contains several tightly-connected classes that are involvedwith the nasty process of lexing and preprocessing C source code. The maininterface to this library for outside clients is the largePreprocessorclass. It contains the various pieces of state that are required to coherentlyread tokens out of a translation unit.

The core interface to thePreprocessor object (once it is set up) is thePreprocessor::Lex method, which returns the nextToken fromthe preprocessor stream. There are two types of token providers that thepreprocessor is capable of reading from: a buffer lexer (provided by theLexer class) and a buffered token stream (provided by theTokenLexer class).

The Token class

TheToken class is used to represent a single lexed token. Tokens areintended to be used by the lexer/preprocess and parser libraries, but are notintended to live beyond them (for example, they should not live in the ASTs).

Tokens most often live on the stack (or some other location that is efficientto access) as the parser is running, but occasionally do get buffered up. Forexample, macro definitions are stored as a series of tokens, and the C++front-end periodically needs to buffer tokens up for tentative parsing andvarious pieces of look-ahead. As such, the size of aToken matters. On a32-bit system,sizeof(Token) is currently 16 bytes.

Tokens occur in two forms:annotation tokens andnormal tokens. Normal tokens are those returned by the lexer, annotationtokens represent semantic information and are produced by the parser, replacingnormal tokens in the token stream. Normal tokens contain the followinginformation:

  • A SourceLocation — This indicates the location of the start of thetoken.

  • A length — This stores the length of the token as stored in theSourceBuffer. For tokens that include them, this length includestrigraphs and escaped newlines which are ignored by later phases of thecompiler. By pointing into the original source buffer, it is always possibleto get the original spelling of a token completely accurately.

  • IdentifierInfo — If a token takes the form of an identifier, and ifidentifier lookup was enabled when the token was lexed (e.g., the lexer wasnot reading in “raw” mode) this contains a pointer to the unique hash valuefor the identifier. Because the lookup happens before keywordidentification, this field is set even for language keywords like “for”.

  • TokenKind — This indicates the kind of token as classified by thelexer. This includes things liketok::starequal (for the “*=”operator),tok::ampamp for the “&&” token, and keyword values (e.g.,tok::kw_for) for identifiers that correspond to keywords. Note thatsome tokens can be spelled multiple ways. For example, C++ supports“operator keywords”, where things like “and” are treated exactly like the“&&” operator. In these cases, the kind value is set totok::ampamp,which is good for the parser, which doesn’t have to consider both forms. Forsomething that cares about which form is used (e.g., the preprocessor“stringize” operator) the spelling indicates the original form.

  • Flags — There are currently four flags tracked by thelexer/preprocessor system on a per-token basis:

    1. StartOfLine — This was the first token that occurred on its inputsource line.

    2. LeadingSpace — There was a space character either immediately beforethe token or transitively before the token as it was expanded through amacro. The definition of this flag is very closely defined by thestringizing requirements of the preprocessor.

    3. DisableExpand — This flag is used internally to the preprocessor torepresent identifier tokens which have macro expansion disabled. Thisprevents them from being considered as candidates for macro expansion everin the future.

    4. NeedsCleaning — This flag is set if the original spelling for thetoken includes a trigraph or escaped newline. Since this is uncommon,many pieces of code can fast-path on tokens that did not need cleaning.

One interesting (and somewhat unusual) aspect of normal tokens is that theydon’t contain any semantic information about the lexed value. For example, ifthe token was a pp-number token, we do not represent the value of the numberthat was lexed (this is left for later pieces of code to decide).Additionally, the lexer library has no notion of typedef names vs variablenames: both are returned as identifiers, and the parser is left to decidewhether a specific identifier is a typedef or a variable (tracking thisrequires scope information among other things). The parser can do thistranslation by replacing tokens returned by the preprocessor with “AnnotationTokens”.

Annotation Tokens

Annotation tokens are tokens that are synthesized by the parser and injectedinto the preprocessor’s token stream (replacing existing tokens) to recordsemantic information found by the parser. For example, if “foo” is foundto be a typedef, the “footok::identifier token is replaced with antok::annot_typename. This is useful for a couple of reasons: 1) this makesit easy to handle qualified type names (e.g., “foo::bar::baz<42>::t”) inC++ as a single “token” in the parser. 2) if the parser backtracks, thereparse does not need to redo semantic analysis to determine whether a tokensequence is a variable, type, template, etc.

Annotation tokens are created by the parser and reinjected into the parser’stoken stream (when backtracking is enabled). Because they can only exist intokens that the preprocessor-proper is done with, it doesn’t need to keeparound flags like “start of line” that the preprocessor uses to do its job.Additionally, an annotation token may “cover” a sequence of preprocessor tokens(e.g., “a::b::c” is five preprocessor tokens). As such, the valid fieldsof an annotation token are different than the fields for a normal token (butthey are multiplexed into the normalToken fields):

  • SourceLocation “Location” — TheSourceLocation for the annotationtoken indicates the first token replaced by the annotation token. In theexample above, it would be the location of the “a” identifier.

  • SourceLocation “AnnotationEndLoc” — This holds the location of the lasttoken replaced with the annotation token. In the example above, it would bethe location of the “c” identifier.

  • void* “AnnotationValue” — This contains an opaque object that theparser gets fromSema. The parser merely preserves the information forSema to later interpret based on the annotation token kind.

  • TokenKind “Kind” — This indicates the kind of Annotation token this is.See below for the different valid kinds.

Annotation tokens currently come in three kinds:

  1. tok::annot_typename: This annotation token represents a resolvedtypename token that is potentially qualified. TheAnnotationValue fieldcontains theQualType returned bySema::getTypeName(), possibly withsource location information attached.

  2. tok::annot_cxxscope: This annotation token represents a C++ scopespecifier, such as “A::B::”. This corresponds to the grammarproductions “::” and “:: [opt] nested-name-specifier”. TheAnnotationValue pointer is aNestedNameSpecifier* returned by theSema::ActOnCXXGlobalScopeSpecifier andSema::ActOnCXXNestedNameSpecifier callbacks.

  3. tok::annot_template_id: This annotation token represents a C++template-id such as “foo<int,4>”, where “foo” is the name of atemplate. TheAnnotationValue pointer is a pointer to amalloc’dTemplateIdAnnotation object. Depending on the context, a parsedtemplate-id that names a type might become a typename annotation token (ifall we care about is the named type, e.g., because it occurs in a typespecifier) or might remain a template-id token (if we want to retain moresource location information or produce a new type, e.g., in a declaration ofa class template specialization). template-id annotation tokens that referto a type can be “upgraded” to typename annotation tokens by the parser.

As mentioned above, annotation tokens are not returned by the preprocessor,they are formed on demand by the parser. This means that the parser has to beaware of cases where an annotation could occur and form it where appropriate.This is somewhat similar to how the parser handles Translation Phase 6 of C99:String Concatenation (see C99 5.1.1.2). In the case of string concatenation,the preprocessor just returns distincttok::string_literal andtok::wide_string_literal tokens and the parser eats a sequence of themwherever the grammar indicates that a string literal can occur.

In order to do this, whenever the parser expects atok::identifier ortok::coloncolon, it should call theTryAnnotateTypeOrScopeToken orTryAnnotateCXXScopeToken methods to form the annotation token. Thesemethods will maximally form the specified annotation tokens and replace thecurrent token with them, if applicable. If the current tokens is not valid foran annotation token, it will remain an identifier or “::” token.

TheLexer class

TheLexer class provides the mechanics of lexing tokens out of a sourcebuffer and deciding what they mean. TheLexer is complicated by the factthat it operates on raw buffers that have not had spelling eliminated (this isa necessity to get decent performance), but this is countered with carefulcoding as well as standard performance techniques (for example, the commenthandling code is vectorized on X86 and PowerPC hosts).

The lexer has a couple of interesting modal features:

  • The lexer can operate in “raw” mode. This mode has several features thatmake it possible to quickly lex the file (e.g., it stops identifier lookup,doesn’t specially handle preprocessor tokens, handles EOF differently, etc).This mode is used for lexing within an “#if0” block, for example.

  • The lexer can capture and return comments as tokens. This is required tosupport the-C preprocessor mode, which passes comments through, and isused by the diagnostic checker to identifier expect-error annotations.

  • The lexer can be inParsingFilename mode, which happens whenpreprocessing after reading a#include directive. This mode changes theparsing of “<” to return an “angled string” instead of a bunch of tokensfor each thing within the filename.

  • When parsing a preprocessor directive (after “#”) theParsingPreprocessorDirective mode is entered. This changes the parser toreturn EOD at a newline.

  • TheLexer uses aLangOptions object to know whether trigraphs areenabled, whether C++ or ObjC keywords are recognized, etc.

In addition to these modes, the lexer keeps track of a couple of other featuresthat are local to a lexed buffer, which change as the buffer is lexed:

  • TheLexer usesBufferPtr to keep track of the current character beinglexed.

  • TheLexer usesIsAtStartOfLine to keep track of whether the nextlexed token will start with its “start of line” bit set.

  • TheLexer keeps track of the current “#if” directives that are active(which can be nested).

  • TheLexer keeps track of anMultipleIncludeOpt object, which is used to detect whether the buffer usesthe standard “#ifndefXX /#defineXX” idiom to prevent multipleinclusion. If a buffer does, subsequent includes can be ignored if the“XX” macro is defined.

TheTokenLexer class

TheTokenLexer class is a token provider that returns tokens from a list oftokens that came from somewhere else. It typically used for two things: 1)returning tokens from a macro definition as it is being expanded 2) returningtokens from an arbitrary buffer of tokens. The later use is used by_Pragma and will most likely be used to handle unbounded look-ahead for theC++ parser.

TheMultipleIncludeOpt class

TheMultipleIncludeOpt class implements a really simple little statemachine that is used to detect the standard “#ifndefXX /#defineXX”idiom that people typically use to prevent multiple inclusion of headers. If abuffer uses this idiom and is subsequently#include’d, the preprocessor cansimply check to see whether the guarding condition is defined or not. If so,the preprocessor can completely ignore the include of the header.

The Parser Library

This library contains a recursive-descent parser that polls tokens from thepreprocessor and notifies a client of the parsing progress.

Historically, the parser used to talk to an abstractAction interface thathad virtual methods for parse events, for exampleActOnBinOp(). When Clanggrew C++ support, the parser stopped supporting generalAction clients –it now always talks to theSema library. However, the Parserstill accesses AST objects only through opaque types likeExprResult andStmtResult. OnlySema looks at the AST node contents of thesewrappers.

The AST Library

Design philosophy

Immutability

Clang AST nodes (types, declarations, statements, expressions, and so on) aregenerally designed to be immutable once created. This provides a number of keybenefits:

  • Canonicalization of the “meaning” of nodes is possible as soon as the nodesare created, and is not invalidated by later addition of more information.For example, wecanonicalize types, and use acanonicalized representation of expressions when determining whether twofunction template declarations involving dependent expressions declare thesame entity.

  • AST nodes can be reused when they have the same meaning. For example, wereuseType nodes when representing the same type (but maintain separateTypeLocs for each instance where a type is written), and we reusenon-dependentStmt andExpr nodes across instantiations of atemplate.

  • Serialization and deserialization of the AST to/from AST files is simpler:we do not need to track modifications made to AST nodes imported from ASTfiles and serialize separate “update records”.

There are unfortunately exceptions to this general approach, such as:

  • The first declaration of a redeclarable entity maintains a pointer to themost recent declaration of that entity, which naturally needs to change asmore declarations are parsed.

  • Name lookup tables in declaration contexts change after the namespacedeclaration is formed.

  • We attempt to maintain only a single declaration for an instantiation of atemplate, rather than having distinct declarations for an instantiation ofthe declaration versus the definition, so template instantiation oftenupdates parts of existing declarations.

  • Some parts of declarations are required to be instantiated separately (thisincludes default arguments and exception specifications), and suchinstantiations update the existing declaration.

These cases tend to be fragile; mutable AST state should be avoided wherepossible.

As a consequence of this design principle, we typically do not provide settersfor AST state. (Some are provided for short-term modifications intended to beused immediately after an AST node is created and before it’s “published” aspart of the complete AST, or where language semantics require after-the-factupdates.)

Faithfulness

The AST intends to provide a representation of the program that is faithful tothe original source. We intend for it to be possible to write refactoring toolsusing only information stored in, or easily reconstructible from, the Clang AST.This means that the AST representation should either not desugar source-levelconstructs to simpler forms, or – where made necessary by language semanticsor a clear engineering tradeoff – should desugar minimally and wrap the resultin a construct representing the original source form.

For example,CXXForRangeStmt directly represents the syntactic form of arange-based for statement, but also holds a semantic representation of therange declaration and iterator declarations. It does not contain afully-desugaredForStmt, however.

Some AST nodes (for example,ParenExpr) represent only syntax, and others(for example,ImplicitCastExpr) represent only semantics, but most nodeswill represent a combination of syntax and associated semantics. Inheritanceis typically used when representing different (but related) syntaxes for nodeswith the same or similar semantics.

TheType class and its subclasses

TheType class (and its subclasses) are an important part of the AST.Types are accessed through theASTContext class, which implicitly createsand uniques them as they are needed. Types have a couple of non-obviousfeatures: 1) they do not capture type qualifiers likeconst orvolatile(seeQualType), and 2) they implicitly capture typedefinformation. Once created, types are immutable (unlike decls).

Typedefs in C make semantic analysis a bit more complex than it would be withoutthem. The issue is that we want to capture typedef information and represent itin the AST perfectly, but the semantics of operations need to “see through”typedefs. For example, consider this code:

voidfunc(){typedefintfoo;fooX,*Y;typedeffoo*bar;barZ;*X;// error**Y;// error**Z;// error}

The code above is illegal, and thus we expect there to be diagnostics emittedon the annotated lines. In this example, we expect to get:

test.c:6:1: error: indirection requires pointer operand ('foo' invalid)  *X; // error  ^~test.c:7:1: error: indirection requires pointer operand ('foo' invalid)  **Y; // error  ^~~test.c:8:1: error: indirection requires pointer operand ('foo' invalid)  **Z; // error  ^~~

While this example is somewhat silly, it illustrates the point: we want toretain typedef information where possible, so that we can emit errors about“std::string” instead of “std::basic_string<char,std:...”. Doing thisrequires properly keeping typedef information (for example, the type ofXis “foo”, not “int”), and requires properly propagating it through thevarious operators (for example, the type of*Y is “foo”, not“int”). In order to retain this information, the type of these expressionsis an instance of theTypedefType class, which indicates that the type ofthese expressions is a typedef for “foo”.

Representing types like this is great for diagnostics, because theuser-specified type is always immediately available. There are two problemswith this: first, various semantic checks need to make judgements about theactual structure of a type, ignoring typedefs. Second, we need an efficientway to query whether two types are structurally identical to each other,ignoring typedefs. The solution to both of these problems is the idea ofcanonical types.

Canonical Types

Every instance of theType class contains a canonical type pointer. Forsimple types with no typedefs involved (e.g., “int”, “int*”,“int**”), the type just points to itself. For types that have a typedefsomewhere in their structure (e.g., “foo”, “foo*”, “foo**”,“bar”), the canonical type pointer points to their structurally equivalenttype without any typedefs (e.g., “int”, “int*”, “int**”, and“int*” respectively).

This design provides a constant time operation (dereferencing the canonical typepointer) that gives us access to the structure of types. For example, we cantrivially tell that “bar” and “foo*” are the same type by dereferencingtheir canonical type pointers and doing a pointer comparison (they both pointto the single “int*” type).

Canonical types and typedef types bring up some complexities that must becarefully managed. Specifically, theisa/cast/dyn_cast operatorsgenerally shouldn’t be used in code that is inspecting the AST. For example,when type checking the indirection operator (unary “*” on a pointer), thetype checker must verify that the operand has a pointer type. It would not becorrect to check that with “isa<PointerType>(SubExpr->getType())”, becausethis predicate would fail if the subexpression had a typedef type.

The solution to this problem are a set of helper methods onType, used tocheck their properties. In this case, it would be correct to use“SubExpr->getType()->isPointerType()” to do the check. This predicate willreturn true if thecanonical type is a pointer, which is true any time thetype is structurally a pointer type. The only hard part here is rememberingnot to use theisa/cast/dyn_cast operations.

The second problem we face is how to get access to the pointer type once weknow it exists. To continue the example, the result type of the indirectionoperator is the pointee type of the subexpression. In order to determine thetype, we need to get the instance ofPointerType that best captures thetypedef information in the program. If the type of the expression is literallyaPointerType, we can return that, otherwise we have to dig through thetypedefs to find the pointer type. For example, if the subexpression had type“foo*”, we could return that type as the result. If the subexpression hadtype “bar”, we want to return “foo*” (note that we donot want“int*”). In order to provide all of this,Type has agetAsPointerType() method that checks whether the type is structurally aPointerType and, if so, returns the best one. If not, it returns a nullpointer.

This structure is somewhat mystical, but after meditating on it, it will makesense to you :).

TheQualType class

TheQualType class is designed as a trivial value class that is small,passed by-value and is efficient to query. The idea ofQualType is that itstores the type qualifiers (const,volatile,restrict, plus someextended qualifiers required by language extensions) separately from the typesthemselves.QualType is conceptually a pair of “Type*” and the bitsfor these type qualifiers.

By storing the type qualifiers as bits in the conceptual pair, it is extremelyefficient to get the set of qualifiers on aQualType (just return the fieldof the pair), add a type qualifier (which is a trivial constant-time operationthat sets a bit), and remove one or more type qualifiers (just return aQualType with the bitfield set to empty).

Further, because the bits are stored outside of the type itself, we do not needto create duplicates of types with different sets of qualifiers (i.e. there isonly a single heap allocated “int” type: “constint” and “volatileconstint” both point to the same heap allocated “int” type). Thisreduces the heap size used to represent bits and also means we do not have toconsider qualifiers when uniquing types (Type does not evencontain qualifiers).

In practice, the two most common type qualifiers (const andrestrict)are stored in the low bits of the pointer to theType object, together witha flag indicating whether extended qualifiers are present (which must beheap-allocated). This means thatQualType is exactly the same size as apointer.

Declaration names

TheDeclarationName class represents the name of a declaration in Clang.Declarations in the C family of languages can take several different forms.Most declarations are named by simple identifiers, e.g., “f” and “x” inthe function declarationf(intx). In C++, declaration names can also nameclass constructors (”Class” instructClass{Class();}), classdestructors (”~Class”), overloaded operator names (”operator+”), andconversion functions (”operatorvoidconst*”). In Objective-C,declaration names can refer to the names of Objective-C methods, which involvethe method name and the parameters, collectively called aselector, e.g.,“setWidth:height:”. Since all of these kinds of entities — variables,functions, Objective-C methods, C++ constructors, destructors, and operators— are represented as subclasses of Clang’s commonNamedDecl class,DeclarationName is designed to efficiently represent any kind of name.

Given aDeclarationNameN,N.getNameKind() will produce a valuethat describes what kind of nameN stores. There are 10 options (all ofthe names are inside theDeclarationName class).

Identifier

The name is a simple identifier. UseN.getAsIdentifierInfo() to retrievethe correspondingIdentifierInfo* pointing to the actual identifier.

ObjCZeroArgSelector,ObjCOneArgSelector,ObjCMultiArgSelector

The name is an Objective-C selector, which can be retrieved as aSelectorinstance viaN.getObjCSelector(). The three possible name kinds forObjective-C reflect an optimization within theDeclarationName class:both zero- and one-argument selectors are stored as a maskedIdentifierInfo pointer, and therefore require very little space, sincezero- and one-argument selectors are far more common than multi-argumentselectors (which use a different structure).

CXXConstructorName

The name is a C++ constructor name. UseN.getCXXNameType() to retrievethetype that this constructor is meant to construct. Thetype is always the canonical type, since all constructors for a given typehave the same name.

CXXDestructorName

The name is a C++ destructor name. UseN.getCXXNameType() to retrievethetype whose destructor is being named. This type isalways a canonical type.

CXXConversionFunctionName

The name is a C++ conversion function. Conversion functions are namedaccording to the type they convert to, e.g., “operatorvoidconst*”.UseN.getCXXNameType() to retrieve the type that this conversion functionconverts to. This type is always a canonical type.

CXXOperatorName

The name is a C++ overloaded operator name. Overloaded operators are namedaccording to their spelling, e.g., “operator+” or “operatornew[]”.UseN.getCXXOverloadedOperator() to retrieve the overloaded operator (avalue of typeOverloadedOperatorKind).

CXXLiteralOperatorName

The name is a C++11 user defined literal operator. User definedLiteral operators are named according to the suffix they define,e.g., “_foo” for “operator""_foo”. UseN.getCXXLiteralIdentifier() to retrieve the correspondingIdentifierInfo* pointing to the identifier.

CXXUsingDirective

The name is a C++ using directive. Using directives are not reallyNamedDecls, in that they all have the same name, but they areimplemented as such in order to store them in DeclContexteffectively.

DeclarationNames are cheap to create, copy, and compare. They requireonly a single pointer’s worth of storage in the common cases (identifiers,zero- and one-argument Objective-C selectors) and use dense, uniqued storagefor the other kinds of names. TwoDeclarationNames can be compared forequality (==,!=) using a simple bitwise comparison, can be orderedwith<,>,<=, and>= (which provide a lexicographical orderingfor normal identifiers but an unspecified ordering for other kinds of names),and can be placed into LLVMDenseMaps andDenseSets.

DeclarationName instances can be created in different ways depending onwhat kind of name the instance will store. Normal identifiers(IdentifierInfo pointers) and Objective-C selectors (Selector) can beimplicitly converted toDeclarationNames. Names for C++ constructors,destructors, conversion functions, and overloaded operators can be retrievedfrom theDeclarationNameTable, an instance of which is available asASTContext::DeclarationNames. The member functionsgetCXXConstructorName,getCXXDestructorName,getCXXConversionFunctionName, andgetCXXOperatorName, respectively,returnDeclarationName instances for the four kinds of C++ special functionnames.

Declaration contexts

Every declaration in a program exists within somedeclaration context, suchas a translation unit, namespace, class, or function. Declaration contexts inClang are represented by theDeclContext class, from which the variousdeclaration-context AST nodes (TranslationUnitDecl,NamespaceDecl,RecordDecl,FunctionDecl, etc.) will derive. TheDeclContext classprovides several facilities common to each declaration context:

Source-centric vs. Semantics-centric View of Declarations

DeclContext provides two views of the declarations stored within adeclaration context. The source-centric view accurately represents theprogram source code as written, including multiple declarations of entitieswhere present (see the sectionRedeclarations and Overloads), while the semantics-centric view represents the programsemantics. The two views are kept synchronized by semantic analysis whilethe ASTs are being constructed.

Storage of declarations within that context

Every declaration context can contain some number of declarations. Forexample, a C++ class (represented byRecordDecl) contains various memberfunctions, fields, nested types, and so on. All of these declarations willbe stored within theDeclContext, and one can iterate over thedeclarations via [DeclContext::decls_begin(),DeclContext::decls_end()). This mechanism provides the source-centricview of declarations in the context.

Lookup of declarations within that context

TheDeclContext structure provides efficient name lookup for names withinthat declaration context. For example, ifN is a namespace we can lookfor the nameN::f usingDeclContext::lookup. The lookup itself isbased on a lazily-constructed array (for declaration contexts with a smallnumber of declarations) or hash table (for declaration contexts with moredeclarations). The lookup operation provides the semantics-centric view ofthe declarations in the context.

Ownership of declarations

TheDeclContext owns all of the declarations that were declared withinits declaration context, and is responsible for the management of theirmemory as well as their (de-)serialization.

All declarations are stored within a declaration context, and one can queryinformation about the context in which each declaration lives. One canretrieve theDeclContext that contains a particularDecl usingDecl::getDeclContext. However, see the sectionLexical and Semantic Contexts for more information about how to interpretthis context information.

Redeclarations and Overloads

Within a translation unit, it is common for an entity to be declared severaltimes. For example, we might declare a function “f” and then laterre-declare it as part of an inlined definition:

voidf(intx,inty,intz=1);inlinevoidf(intx,inty,intz){/* ...  */}

The representation of “f” differs in the source-centric andsemantics-centric views of a declaration context. In the source-centric view,all redeclarations will be present, in the order they occurred in the sourcecode, making this view suitable for clients that wish to see the structure ofthe source code. In the semantics-centric view, only the most recent “f”will be found by the lookup, since it effectively replaces the firstdeclaration of “f”.

(Note that becausef can be redeclared at block scope, or in a frienddeclaration, etc. it is possible that the declaration off found by namelookup will not be the most recent one.)

In the semantics-centric view, overloading of functions is representedexplicitly. For example, given two declarations of a function “g” that areoverloaded, e.g.,

voidg();voidg(int);

theDeclContext::lookup operation will return aDeclContext::lookup_result that contains a range of iterators overdeclarations of “g”. Clients that perform semantic analysis on a programthat is not concerned with the actual source code will primarily use thissemantics-centric view.

Lexical and Semantic Contexts

Each declaration has two potentially different declaration contexts: alexical context, which corresponds to the source-centric view of thedeclaration context, and asemantic context, which corresponds to thesemantics-centric view. The lexical context is accessible viaDecl::getLexicalDeclContext while the semantic context is accessible viaDecl::getDeclContext, both of which returnDeclContext pointers. Formost declarations, the two contexts are identical. For example:

classX{public:voidf(intx);};

Here, the semantic and lexical contexts ofX::f are theDeclContextassociated with the classX (itself stored as aRecordDecl AST node).However, we can now defineX::f out-of-line:

voidX::f(intx=17){/* ...  */}

This definition of “f” has different lexical and semantic contexts. Thelexical context corresponds to the declaration context in which the actualdeclaration occurred in the source code, e.g., the translation unit containingX. Thus, this declaration ofX::f can be found by traversing thedeclarations provided by [decls_begin(),decls_end()) in thetranslation unit.

The semantic context ofX::f corresponds to the classX, since thismember function is (semantically) a member ofX. Lookup of the namefinto theDeclContext associated withX will then return the definitionofX::f (including information about the default argument).

Transparent Declaration Contexts

In C and C++, there are several contexts in which names that are logicallydeclared inside another declaration will actually “leak” out into the enclosingscope from the perspective of name lookup. The most obvious instance of thisbehavior is in enumeration types, e.g.,

enumColor{Red,Green,Blue};

Here,Color is an enumeration, which is a declaration context that containsthe enumeratorsRed,Green, andBlue. Thus, traversing the list ofdeclarations contained in the enumerationColor will yieldRed,Green, andBlue. However, outside of the scope ofColor one canname the enumeratorRed without qualifying the name, e.g.,

Colorc=Red;

There are other entities in C++ that provide similar behavior. For example,linkage specifications that use curly braces:

extern"C"{voidf(int);voidg(int);}// f and g are visible here

For source-level accuracy, we treat the linkage specification and enumerationtype as a declaration context in which its enclosed declarations (”Red”,“Green”, and “Blue”; “f” and “g”) are declared. However, thesedeclarations are visible outside of the scope of the declaration context.

These language features (and several others, described below) have roughly thesame set of requirements: declarations are declared within a particular lexicalcontext, but the declarations are also found via name lookup in scopesenclosing the declaration itself. This feature is implemented viatransparent declaration contexts (seeDeclContext::isTransparentContext()), whose declarations are visible in thenearest enclosing non-transparent declaration context. This means that thelexical context of the declaration (e.g., an enumerator) will be thetransparentDeclContext itself, as will the semantic context, but thedeclaration will be visible in every outer context up to and including thefirst non-transparent declaration context (since transparent declarationcontexts can be nested).

The transparentDeclContexts are:

  • Enumerations (but not C++11 “scoped enumerations”):

    enumColor{Red,Green,Blue};// Red, Green, and Blue are in scope
  • C++ linkage specifications:

    extern"C"{voidf(int);voidg(int);}// f and g are in scope
  • Anonymous unions and structs:

    structLookupTable{boolIsVector;union{std::vector<Item>*Vector;std::set<Item>*Set;};};LookupTableLT;LT.Vector=0;// Okay: finds Vector inside the unnamed union
  • C++11 inline namespaces:

    namespacemylib{inlinenamespacedebug{classX;}}mylib::X*xp;// okay: mylib::X refers to mylib::debug::X

Multiply-Defined Declaration Contexts

C++ namespaces have the interesting property thatthe namespace can be defined multiple times, and the declarations provided byeach namespace definition are effectively merged (from the semantic point ofview). For example, the following two code snippets are semanticallyindistinguishable:

// Snippet #1:namespaceN{voidf();}namespaceN{voidf(int);}// Snippet #2:namespaceN{voidf();voidf(int);}

In Clang’s representation, the source-centric view of declaration contexts willactually have two separateNamespaceDecl nodes in Snippet #1, each of whichis a declaration context that contains a single declaration of “f”.However, the semantics-centric view provided by name lookup into the namespaceN for “f” will return aDeclContext::lookup_result that contains arange of iterators over declarations of “f”.

DeclContext manages multiply-defined declaration contexts internally. ThefunctionDeclContext::getPrimaryContext retrieves the “primary” context fora givenDeclContext instance, which is theDeclContext responsible formaintaining the lookup table used for the semantics-centric view. Given aDeclContext, one can obtain the set of declaration contexts that aresemantically connected to this declaration context, in source order, includingthis context (which will be the only result, for non-namespace contexts) viaDeclContext::collectAllContexts. Note that these functions are usedinternally within the lookup and insertion methods of theDeclContext, sothe vast majority of clients can ignore them.

Because the same entity can be defined multiple times in different modules,it is also possible for there to be multiple definitions of (for instance)aCXXRecordDecl, all of which describe a definition of the same class.In such a case, only one of those “definitions” is considered by Clang to bethe definition of the class, and the others are treated as non-definingdeclarations that happen to also contain member declarations. Correspondingmembers in each definition of such multiply-defined classes are identifiedeither by redeclaration chains (if the members areRedeclarable)or by simply a pointer to the canonical declaration (if the declarationsare notRedeclarable – in that case, aMergeable base class is usedinstead).

Error Handling

Clang produces an AST even when the code contains errors. Clang won’t generateand optimize code for it, but it’s used as parsing continues to detect furthererrors in the input. Clang-based tools also depend on such ASTs, and IDEs inparticular benefit from a high-quality AST for broken code.

In presence of errors, clang uses a few error-recovery strategies to present thebroken code in the AST:

  • correcting errors: in cases where clang is confident about the fix, itprovides a FixIt attaching to the error diagnostic and emits a corrected AST(reflecting the written code with FixIts applied). The advantage of that is toprovide more accurate subsequent diagnostics. Typo correction is a typicalexample.

  • representing invalid node: the invalid node is preserved in the AST in someform, e.g. when the “declaration” part of the declaration contains semanticerrors, the Decl node is marked as invalid.

  • dropping invalid node: this often happens for errors that we don’t havegraceful recovery. Prior to Recovery AST, a mismatched-argument function callexpression was dropped though a CallExpr was created for semantic analysis.

With these strategies, clang surfaces better diagnostics, and provides ASTconsumers a rich AST reflecting the written source code as much as possible evenfor broken code.

Recovery AST

The idea of Recovery AST is to use recovery nodes which act as a placeholder tomaintain the rough structure of the parsing tree, preserve locations andchildren but have no language semantics attached to them.

For example, consider the following mismatched function call:

intNoArg();voidtest(intabc){NoArg(abc);// oops, mismatched function arguments.}

Without Recovery AST, the invalid function call expression (and its childexpressions) would be dropped in the AST:

|-FunctionDecl <line:1:1, col:11> NoArg 'int ()'`-FunctionDecl <line:2:1, line:4:1> test 'void (int)' |-ParmVarDecl <col:11, col:15> col:15 used abc 'int' `-CompoundStmt <col:20, line:4:1>

With Recovery AST, the AST looks like:

|-FunctionDecl <line:1:1, col:11> NoArg 'int ()'`-FunctionDecl <line:2:1, line:4:1> test 'void (int)'  |-ParmVarDecl <col:11, col:15> used abc 'int'  `-CompoundStmt <col:20, line:4:1>    `-RecoveryExpr <line:3:3, col:12> 'int' contains-errors      |-UnresolvedLookupExpr <col:3> '<overloaded function type>' lvalue (ADL) = 'NoArg'      `-DeclRefExpr <col:9> 'int' lvalue ParmVar 'abc' 'int'

An alternative is to use existing Exprs, e.g. CallExpr for the above example.This would capture more call details (e.g. locations of parentheses) and allowit to be treated uniformly with valid CallExprs. However, jamming the data wehave into CallExpr forces us to weaken its invariants, e.g. arg count may bewrong. This would introduce a huge burden on consumers of the AST to handle such“impossible” cases. So when we’re representing (rather than correcting) errors,we use a distinct recovery node type with extremely weak invariants instead.

RecoveryExpr is the only recovery node so far. In practice, broken declsneed more detailed semantics preserved (the currentInvalid flag worksfairly well), and completely broken statements with interesting internalstructure are rare (so dropping the statements is OK).

Types and dependence

RecoveryExpr is anExpr, so it must have a type. In many cases the truetype can’t really be known until the code is corrected (e.g. a call to afunction that doesn’t exist). And it means that we can’t properly perform typechecks on some containing constructs, such asreturn42+unknownFunction().

To model this, we generalize the concept of dependence from C++ templates tomean dependence on a template parameter or how an error is repaired. TheRecoveryExprunknownFunction() has the totally unknown typeDependentTy, and this suppresses type-based analysis in the same way itwould inside a template.

In cases where we are confident about the concrete type (e.g. the return typefor a broken non-overloaded function call), theRecoveryExpr will have thistype. This allows more code to be typechecked, and produces a better AST andmore diagnostics. For example:

unknownFunction().size()// .size() is a CXXDependentScopeMemberExprstd::string(42).size()// .size() is a resolved MemberExpr

Whether or not theRecoveryExpr has a dependent type, it is alwaysconsidered value-dependent, because its value isn’t well-defined until the erroris resolved. Among other things, this means that clang doesn’t emit more errorswhere a RecoveryExpr is used as a constant (e.g. array size), but also won’t tryto evaluate it.

ContainsErrors bit

Beyond the template dependence bits, we add a new “ContainsErrors” bit toexpress “Does this expression or anything within it contain errors” semantic,this bit is always set for RecoveryExpr, and propagated to other related nodes.This provides a fast way to query whether any (recursive) child of an expressionhad an error, which is often used to improve diagnostics.

// C++voidrecoveryExpr(intabc){unknownFunction();// type-dependent, value-dependent, contains-errorsstd::string(42).size();// value-dependent, contains-errors,// not type-dependent, as we know the type is std::string}
// CvoidrecoveryExpr(intabc){unknownVar+abc;// type-dependent, value-dependent, contains-errors}

The ASTImporter

TheASTImporter class imports nodes of anASTContext into anotherASTContext. Please refer to the documentASTImporter: Merging ClangASTs for an introduction. And please read through thehigh-leveldescription of the import algorithm, this is essential forunderstanding further implementation details of the importer.

Abstract Syntax Graph

Despite the name, the Clang AST is not a tree. It is a directed graph withcycles. One example of a cycle is the connection between aClassTemplateDecl and its “templated”CXXRecordDecl. ThetemplatedCXXRecordDecl represents all the fields and methods inside the classtemplate, while theClassTemplateDecl holds the information which isrelated to being a template, i.e. template arguments, etc. We can get thetemplated class (theCXXRecordDecl) of aClassTemplateDecl withClassTemplateDecl::getTemplatedDecl(). And we can get back a pointer of the“described” class template from thetemplated class:CXXRecordDecl::getDescribedTemplate(). So, this is a cycle between twonodes: between thetemplated and thedescribed node. There may be variousother kinds of cycles in the AST especially in case of declarations.

Structural Equivalency

Importing one AST node copies that node into the destinationASTContext. Tocopy one node means that we create a new node in the “to” context then we setits properties to be equal to the properties of the source node. Before thecopy, we make sure that the source node is notstructurally equivalent to anyexisting node in the destination context. If it happens to be equivalent thenwe skip the copy.

The informal definition of structural equivalency is the following:Two nodes arestructurally equivalent if they are

  • builtin types and refer to the same type, e.g.int andint arestructurally equivalent,

  • function types and all their parameters have structurally equivalent types,

  • record types and all their fields in order of their definition have the sameidentifier names and structurally equivalent types,

  • variable or function declarations and they have the same identifier name andtheir types are structurally equivalent.

In C, two types are structurally equivalent if they arecompatible types. Fora formal definition ofcompatible types, please refer to 6.2.7/1 in the C11standard. However, there is no definition forcompatible types in the C++standard. Still, we extend the definition of structural equivalency totemplates and their instantiations similarly: besides checking the previouslymentioned properties, we have to check for equivalent templateparameters/arguments, etc.

The structural equivalent check can be and is used independently from theASTImporter, e.g. theclang::Sema class uses it also.

The equivalence of nodes may depend on the equivalency of other pairs of nodes.Thus, the check is implemented as a parallel graph traversal. We traversethrough the nodes of both graphs at the same time. The actual implementation issimilar to breadth-first-search. Let’s say we start the traverse with the <A,B>pair of nodes. Whenever the traversal reaches a pair <X,Y> then the followingstatements are true:

  • A and X are nodes from the same ASTContext.

  • B and Y are nodes from the same ASTContext.

  • A and B may or may not be from the same ASTContext.

  • if A == X and B == Y (pointer equivalency) then (there is a cycle during thetraverse)

    • A and B are structurally equivalent if and only if

      • All dependent nodes on the path from <A,B> to <X,Y> are structurallyequivalent.

When we compare two classes or enums and one of them is incomplete or hasunloaded external lexical declarations then we cannot descend to compare theircontained declarations. So in these cases they are considered equal if theyhave the same names. This is the way how we compare forward declarations withdefinitions.

Redeclaration Chains

The early version of theASTImporter’s merge mechanism squashed thedeclarations, i.e. it aimed to have only one declaration instead of maintaininga whole redeclaration chain. This early approach simply skipped importing afunction prototype, but it imported a definition. To demonstrate the problemwith this approach let’s consider an empty “to” context and the followingvirtual function declarations off in the “from” context:

structB{virtualvoidf();};voidB::f(){}// <-- let's import this definition

If we imported the definition with the “squashing” approach then we wouldend-up having one declaration which is indeed a definition, butisVirtual()returnsfalse for it. The reason is that the definition is indeed notvirtual, it is the property of the prototype!

Consequently, we must either set the virtual flag for the definition (but thenwe create a malformed AST which the parser would never create), or we importthe whole redeclaration chain of the function. The most recent version of theASTImporter uses the latter mechanism. We do import all functiondeclarations - regardless if they are definitions or prototypes - in the orderas they appear in the “from” context.

If we have an existing definition in the “to” context, then we cannot importanother definition, we will use the existing definition. However, we can importprototype(s): we chain the newly imported prototype(s) to the existingdefinition. Whenever we import a new prototype from a third context, that willbe added to the end of the redeclaration chain. This may result in longredeclaration chains in certain cases, e.g. if we import from severaltranslation units which include the same header with the prototype.

To mitigate the problem of long redeclaration chains of free functions, wecould compare prototypes to see if they have the same properties and if yesthen we could merge these prototypes. The implementation of squashing ofprototypes for free functions is future work.

Chaining functions this way ensures that we do copy all information from thesource AST. Nonetheless, there is a problem with member functions: While we canhave many prototypes for free functions, we must have only one prototype for amember function.

voidf();// OKvoidf();// OKstructX{voidf();// OKvoidf();// ERROR};voidX::f(){}// OK

Thus, prototypes of member functions must be squashed, we cannot just simplyattach a new prototype to the existing in-class prototype. Consider thefollowing contexts:

// "to" contextstructX{voidf();// D0};
// "from" contextstructX{voidf();// D1};voidX::f(){}// D2

When we import the prototype and the definition off from the “from”context, then the resulting redecl chain will look like thisD0->D2',whereD2' is the copy ofD2 in the “to” context.

Generally speaking, when we import declarations (like enums and classes) we doattach the newly imported declaration to the existing redeclaration chain (ifthere is structural equivalency). We do not import, however, the wholeredeclaration chain as we do in case of functions. Up till now, we haven’tfound any essential property of forward declarations which is similar to thecase of the virtual flag in a member function prototype. In the future, thismay change, though.

Traversal during the Import

The node specific import mechanisms are implemented inASTNodeImporter::VisitNode() functions, e.g.VisitFunctionDecl().When we import a declaration then first we import everything which is needed tocall the constructor of that declaration node. Everything which can be setlater is set after the node is created. For example, in case of aFunctionDecl we first import the declaration context in which the functionis declared, then we create theFunctionDecl and only then we import thebody of the function. This means there are implicit dependencies between ASTnodes. These dependencies determine the order in which we visit nodes in the“from” context. As with the regular graph traversal algorithms like DFS, wekeep track which nodes we have already visited inASTImporter::ImportedDecls. Whenever we create a node then we immediatelyadd that to theImportedDecls. We must not start the import of any otherdeclarations before we keep track of the newly created one. This is essential,otherwise, we would not be able to handle circular dependencies. To enforcethis, we wrap all constructor calls of all AST nodes inGetImportedOrCreateDecl(). This wrapper ensures that all newly createddeclarations are immediately marked as imported; also, if a declaration isalready marked as imported then we just return its counterpart in the “to”context. Consequently, calling a declaration’s::Create() function directlywould lead to errors, please don’t do that!

Even with the use ofGetImportedOrCreateDecl() there is still aprobability of having an infinite import recursion if things are imported fromeach other in wrong way. Imagine that during the import ofA, the import ofB is requested before we could create the node forA (the constructorneeds a reference toB). And the same could be true for the import ofB(A is requested to be imported before we could create the node forB).In case of thetemplated-described swing we takeextra attention to break the cyclical dependency: we import and set thedescribed template only after theCXXRecordDecl is created. As a bestpractice, before creating the node in the “to” context, avoid importing ofother nodes which are not needed for the constructor of nodeA.

Error Handling

Every import function returns with either anllvm::Error or anllvm::Expected<T> object. This enforces to check the return value of theimport functions. If there was an error during one import then we return withthat error. (Exception: when we import the members of a class, we collect theindividual errors with each member and we concatenate them in one Errorobject.) We cache these errors in cases of declarations. During the next importcall if there is an existing error we just return with that. So, clients of thelibrary receive an Error object, which they must check.

During import of a specific declaration, it may happen that some AST nodes hadalready been created before we recognize an error. In this case, we signal backthe error to the caller, but the “to” context remains polluted with those nodeswhich had been created. Ideally, those nodes should not had been created, butthat time we did not know about the error, the error happened later. Since theAST is immutable (most of the cases we can’t remove existing nodes) we chooseto mark these nodes as erroneous.

We cache the errors associated with declarations in the “from” context inASTImporter::ImportDeclErrors and the ones which are associated with the“to” context inASTImporterSharedState::ImportErrors. Note that, there maybe several ASTImporter objects which import into the same “to” context but fromdifferent “from” contexts; in this case, they have to share the associatederrors of the “to” context.

When an error happens, that propagates through the call stack, through all thedependant nodes. However, in case of dependency cycles, this is not enough,because we strive to mark the erroneous nodes so clients can act upon. In thosecases, we have to keep track of the errors for those nodes which areintermediate nodes of a cycle.

Animport path is the list of the AST nodes which we visit during an Importcall. If nodeA depends on nodeB then the path contains anA->Bedge. From the call stack of the import functions, we can read the very samepath.

Now imagine the following AST, where the-> represents dependency in termsof the import (all nodes are declarations).

A->B->C->D   `->E

We would like to import A.The import behaves like a DFS, so we will visit the nodes in this order: ABCDE.During the visitation we will have the following import paths:

AABABCABCDABCABABEABA

If during the visit of E there is an error then we set an error for E, then asthe call stack shrinks for B, then for A:

AABABCABCDABCABABE // Error! Set an error to EAB  // Set an error to BA   // Set an error to A

However, during the import we could import C and D without any error and theyare independent of A,B and E. We must not set up an error for C and D. So, atthe end of the import we have an entry inImportDeclErrors for A,B,E butnot for C,D.

Now, what happens if there is a cycle in the import path? Let’s consider thisAST:

A->B->C->A   `->E

During the visitation, we will have the below import paths and if during thevisit of E there is an error then we will set up an error for E,B,A. But what’sup with C?

AABABCABCAABCABABE // Error! Set an error to EAB  // Set an error to BA   // Set an error to A

This time we know that both B and C are dependent on A. This means we must setup an error for C too. As the call stack reverses back we get to A and we mustset up an error to all nodes which depend on A (this includes C). But C is nolonger on the import path, it just had been previously. Such a situation canhappen only if during the visitation we had a cycle. If we didn’t have anycycle, then the normal way of passing an Error object through the call stackcould handle the situation. This is why we must track cycles during the importprocess for each visited declaration.

Lookup Problems

When we import a declaration from the source context then we check whether wealready have a structurally equivalent node with the same name in the “to”context. If the “from” node is a definition and the found one is also adefinition, then we do not create a new node, instead, we mark the found nodeas the imported node. If the found definition and the one we want to importhave the same name but they are structurally in-equivalent, then we have an ODRviolation in case of C++. If the “from” node is not a definition then we addthat to the redeclaration chain of the found node. This behaviour is essentialwhen we merge ASTs from different translation units which include the sameheader file(s). For example, we want to have only one definition for the classtemplatestd::vector, even if we included<vector> in severaltranslation units.

To find a structurally equivalent node we can use the regular C/C++ lookupfunctions:DeclContext::noload_lookup() andDeclContext::localUncachedLookup(). These functions do respect the C/C++name hiding rules, thus you cannot find certain declarations in a givendeclaration context. For instance, unnamed declarations (anonymous structs),non-firstfriend declarations and template specializations are hidden. Thisis a problem, because if we use the regular C/C++ lookup then we createredundant AST nodes during the merge! Also, having two instances of the samenode could result in falsestructural in-equivalenciesof other nodes which depend on the duplicated node. Because of these reasons,we created a lookup class which has the sole purpose to register alldeclarations, so later they can be looked up by subsequent import requests.This is theASTImporterLookupTable class. This lookup table should beshared amongst the differentASTImporter instances if they happen to importto the very same “to” context. This is why we can use the importer specificlookup only via theASTImporterSharedState class.

ExternalASTSource

TheExternalASTSource is an abstract interface associated with theASTContext class. It provides the ability to read the declarations storedwithin a declaration context either for iteration or for name lookup. Adeclaration context with an external AST source may load its declarationson-demand. This means that the list of declarations (represented as a linkedlist, the head isDeclContext::FirstDecl) could be empty. However, memberfunctions likeDeclContext::lookup() may initiate a load.

Usually, external sources are associated with precompiled headers. For example,when we load a class from a PCH then the members are loaded only if we do wantto look up something in the class’ context.

In case of LLDB, an implementation of theExternalASTSource interface isattached to the AST context which is related to the parsed expression. Thisimplementation of theExternalASTSource interface is realized with the helpof theASTImporter class. This way, LLDB can reuse Clang’s parsingmachinery while synthesizing the underlying AST from the debug data (e.g. fromDWARF). From the view of theASTImporter this means both the “to” and the“from” context may have declaration contexts with external lexical storage. IfaDeclContext in the “to” AST context has external lexical storage then wemust take extra attention to work only with the already loaded declarations!Otherwise, we would end up with an uncontrolled import process. For instance,if we used the regularDeclContext::lookup() to find the existingdeclarations in the “to” context then thelookup() call itself wouldinitiate a new import while we are in the middle of importing a declaration!(By the time we initiate the lookup we haven’t registered yet that we alreadystarted to import the node of the “from” context.) This is why we useDeclContext::noload_lookup() instead.

Class Template Instantiations

Different translation units may have class template instantiations with thesame template arguments, but with a different set of instantiatedMethodDecls andFieldDecls. Consider the following files:

// x.htemplate<typenameT>structX{inta{0};// FieldDecl with InitListExprX(char):a(3){}// (1)X(int){}// (2)};// foo.cppvoidfoo(){// ClassTemplateSpec with ctor (1): FieldDecl without InitlistExprX<char>xc('c');}// bar.cppvoidbar(){// ClassTemplateSpec with ctor (2): FieldDecl WITH InitlistExprX<char>xc(1);}

Infoo.cpp we use the constructor with number(1), which explicitlyinitializes the membera to3, thus theInitListExpr{0} is notused here and the AST node is not instantiated. However, in the case ofbar.cpp we use the constructor with number(2), which does notexplicitly initialize thea member, so the defaultInitListExpr isneeded and thus instantiated. When we merge the AST offoo.cpp andbar.cpp we must create an AST node for the class template instantiation ofX<char> which has all the required nodes. Therefore, when we find anexistingClassTemplateSpecializationDecl then we merge the fields of theClassTemplateSpecializationDecl in the “from” context in a way that theInitListExpr is copied if not existent yet. The same merge mechanism shouldbe done in the cases of instantiated default arguments and exceptionspecifications of functions.

Visibility of Declarations

During import of a global variable with external visibility, the lookup willfind variables (with the same name) but with static visibility (linkage).Clearly, we cannot put them into the same redeclaration chain. The same is truethe in case of functions. Also, we have to take care of other kinds ofdeclarations like enums, classes, etc. if they are in anonymous namespaces.Therefore, we filter the lookup results and consider only those which have thesame visibility as the declaration we currently import.

We consider two declarations in two anonymous namespaces to have the samevisibility only if they are imported from the same AST context.

Strategies to Handle Conflicting Names

During the import we lookup existing declarations with the same name. We filterthe lookup results based on theirvisibility. If any of thefound declarations are not structurally equivalent then we bumped to a nameconflict error (ODR violation in C++). In this case, we return with anError and we set up theError object for the declaration. However, someclients of theASTImporter may require a different, perhaps lessconservative and more liberal error handling strategy.

E.g. static analysis clients may benefit if the node is created even if thereis a name conflict. During the CTU analysis of certain projects, we recognizedthat there are global declarations which collide with declarations from othertranslation units, but they are not referenced outside from their translationunit. These declarations should be in an unnamed namespace ideally. If we treatthese collisions liberally then CTU analysis can find more results. Note, thefeature be able to choose between name conflict handling strategies is still anongoing work.

TheCFG class

TheCFG class is designed to represent a source-level control-flow graphfor a single statement (Stmt*). Typically instances ofCFG areconstructed for function bodies (usually an instance ofCompoundStmt), butcan also be instantiated to represent the control-flow of any class thatsubclassesStmt, which includes simple expressions. Control-flow graphsare especially useful for performingflow- or path-sensitive programanalyses on a given function.

Basic Blocks

Concretely, an instance ofCFG is a collection of basic blocks. Each basicblock is an instance ofCFGBlock, which simply contains an ordered sequenceofStmt* (each referring to statements in the AST). The ordering ofstatements within a block indicates unconditional flow of control from onestatement to the next.Conditional control-flow is represented using edges between basic blocks. Thestatements within a givenCFGBlock can be traversed using theCFGBlock::*iterator interface.

ACFG object owns the instances ofCFGBlock within the control-flowgraph it represents. EachCFGBlock within a CFG is also uniquely numbered(accessible viaCFGBlock::getBlockID()). Currently the number is based onthe ordering the blocks were created, but no assumptions should be made on howCFGBlocks are numbered other than their numbers are unique and that theyare numbered from 0..N-1 (where N is the number of basic blocks in the CFG).

Entry and Exit Blocks

Each instance ofCFG contains two special blocks: anentry block(accessible viaCFG::getEntry()), which has no incoming edges, and anexit block (accessible viaCFG::getExit()), which has no outgoing edges.Neither block contains any statements, and they serve the role of providing aclear entrance and exit for a body of code such as a function body. Thepresence of these empty blocks greatly simplifies the implementation of manyanalyses built on top of CFGs.

Conditional Control-Flow

Conditional control-flow (such as those induced by if-statements and loops) isrepresented as edges betweenCFGBlocks. Because different C languageconstructs can induce control-flow, eachCFGBlock also records an extraStmt* that represents theterminator of the block. A terminator issimply the statement that caused the control-flow, and is used to identify thenature of the conditional control-flow between blocks. For example, in thecase of an if-statement, the terminator refers to theIfStmt object in theAST that represented the given branch.

To illustrate, consider the following code example:

intfoo(intx){x=x+1;if(x>2)x++;else{x+=2;x*=2;}returnx;}

After invoking the parser+semantic analyzer on this code fragment, the AST ofthe body offoo is referenced by a singleStmt*. We can then constructan instance ofCFG representing the control-flow graph of this functionbody by single call to a static class method:

Stmt*FooBody=...std::unique_ptr<CFG>FooCFG=CFG::buildCFG(FooBody);

Along with providing an interface to iterate over itsCFGBlocks, theCFG class also provides methods that are useful for debugging andvisualizing CFGs. For example, the methodCFG::dump() dumps apretty-printed version of the CFG to standard error. This is especially usefulwhen one is using a debugger such as gdb. For example, here is the output ofFooCFG->dump():

[ B5 (ENTRY) ]   Predecessors (0):   Successors (1): B4[ B4 ]   1: x = x + 1   2: (x > 2)   T: if [B4.2]   Predecessors (1): B5   Successors (2): B3 B2[ B3 ]   1: x++   Predecessors (1): B4   Successors (1): B1[ B2 ]   1: x += 2   2: x *= 2   Predecessors (1): B4   Successors (1): B1[ B1 ]   1: return x;   Predecessors (2): B2 B3   Successors (1): B0[ B0 (EXIT) ]   Predecessors (1): B1   Successors (0):

For each block, the pretty-printed output displays for each block the number ofpredecessor blocks (blocks that have outgoing control-flow to the givenblock) andsuccessor blocks (blocks that have control-flow that have incomingcontrol-flow from the given block). We can also clearly see the special entryand exit blocks at the beginning and end of the pretty-printed output. For theentry block (block B5), the number of predecessor blocks is 0, while for theexit block (block B0) the number of successor blocks is 0.

The most interesting block here is B4, whose outgoing control-flow representsthe branching caused by the sole if-statement infoo. Of particularinterest is the second statement in the block,(x>2), and the terminator,printed asif[B4.2]. The second statement represents the evaluation ofthe condition of the if-statement, which occurs before the actual branching ofcontrol-flow. Within theCFGBlock for B4, theStmt* for the secondstatement refers to the actual expression in the AST for(x>2). Thuspointers to subclasses ofExpr can appear in the list of statements in ablock, and not just subclasses ofStmt that refer to proper C statements.

The terminator of block B4 is a pointer to theIfStmt object in the AST.The pretty-printer outputsif[B4.2] because the condition expression ofthe if-statement has an actual place in the basic block, and thus theterminator is essentiallyreferring to the expression that is the secondstatement of block B4 (i.e., B4.2). In this manner, conditions forcontrol-flow (which also includes conditions for loops and switch statements)are hoisted into the actual basic block.

Constant Folding in the Clang AST

There are several places where constants and constant folding matter a lot tothe Clang front-end. First, in general, we prefer the AST to retain the sourcecode as close to how the user wrote it as possible. This means that if theywrote “5+4”, we want to keep the addition and two constants in the AST, wedon’t want to fold to “9”. This means that constant folding in variousways turns into a tree walk that needs to handle the various cases.

However, there are places in both C and C++ that require constants to befolded. For example, the C standard defines what an “integer constantexpression” (i-c-e) is with very precise and specific requirements. Thelanguage then requires i-c-e’s in a lot of places (for example, the size of abitfield, the value for a case statement, etc). For these, we have to be ableto constant fold the constants, to do semantic checks (e.g., verify bitfieldsize is non-negative and that case statements aren’t duplicated). We aim forClang to be very pedantic about this, diagnosing cases when the code does notuse an i-c-e where one is required, but accepting the code unless running with-pedantic-errors.

Things get a little bit more tricky when it comes to compatibility withreal-world source code. Specifically, GCC has historically accepted a hugesuperset of expressions as i-c-e’s, and a lot of real world code depends onthis unfortunate accident of history (including, e.g., the glibc systemheaders). GCC accepts anything its “fold” optimizer is capable of reducing toan integer constant, which means that the definition of what it accepts changesas its optimizer does. One example is that GCC accepts things like “caseX-X:” even whenX is a variable, because it can fold this to 0.

Another issue are how constants interact with the extensions we support, suchas__builtin_constant_p,__builtin_inf,__extension__ and manyothers. C99 obviously does not specify the semantics of any of theseextensions, and the definition of i-c-e does not include them. However, theseextensions are often used in real code, and we have to have a way to reasonabout them.

Finally, this is not just a problem for semantic analysis. The code generatorand other clients have to be able to fold constants (e.g., to initialize globalvariables) and have to handle a superset of what C99 allows. Further, theseclients can benefit from extended information. For example, we know that“foo()||1” always evaluates totrue, but we can’t replace theexpression withtrue because it has side effects.

Implementation Approach

After trying several different approaches, we’ve finally converged on a design(Note, at the time of this writing, not all of this has been implemented,consider this a design goal!). Our basic approach is to define a singlerecursive evaluation method (Expr::Evaluate), which is implementedinAST/ExprConstant.cpp. Given an expression with “scalar” type (integer,fp, complex, or pointer) this method returns the following information:

  • Whether the expression is an integer constant expression, a general constantthat was folded but has no side effects, a general constant that was foldedbut that does have side effects, or an uncomputable/unfoldable value.

  • If the expression was computable in any way, this method returns theAPValue for the result of the expression.

  • If the expression is not evaluatable at all, this method returns informationon one of the problems with the expression. This includes aSourceLocation for where the problem is, and a diagnostic ID that explainsthe problem. The diagnostic should haveERROR type.

  • If the expression is not an integer constant expression, this method returnsinformation on one of the problems with the expression. This includes aSourceLocation for where the problem is, and a diagnostic ID thatexplains the problem. The diagnostic should haveEXTENSION type.

This information gives various clients the flexibility that they want, and wewill eventually have some helper methods for various extensions. For example,Sema should have aSema::VerifyIntegerConstantExpression method, whichcallsEvaluate on the expression. If the expression is not foldable, theerror is emitted, and it would returntrue. If the expression is not ani-c-e, theEXTENSION diagnostic is emitted. Finally it would returnfalse to indicate that the AST is OK.

Other clients can use the information in other ways, for example, codegen canjust use expressions that are foldable in any way.

Extensions

This section describes how some of the various extensions Clang supportsinteracts with constant evaluation:

  • __extension__: The expression form of this extension causes anyevaluatable subexpression to be accepted as an integer constant expression.

  • __builtin_constant_p: This returns true (as an integer constantexpression) if the operand evaluates to either a numeric value (that is, nota pointer cast to integral type) of integral, enumeration, floating orcomplex type, or if it evaluates to the address of the first character of astring literal (possibly cast to some other type). As a special case, if__builtin_constant_p is the (potentially parenthesized) condition of aconditional operator expression (”?:”), only the true side of theconditional operator is considered, and it is evaluated with full constantfolding.

  • __builtin_choose_expr: The condition is required to be an integerconstant expression, but we accept any constant as an “extension of anextension”. This only evaluates one operand depending on which way thecondition evaluates.

  • __builtin_classify_type: This always returns an integer constantexpression.

  • __builtin_inf,nan,...: These are treated just like a floating-pointliteral.

  • __builtin_abs,copysign,...: These are constant folded as generalconstant expressions.

  • __builtin_strlen andstrlen: These are constant folded as integerconstant expressions if the argument is a string literal.

The Sema Library

This library is called by theParser library during parsing todo semantic analysis of the input. For valid programs, Sema builds an AST forparsed constructs.

The CodeGen Library

CodeGen takes anAST as input and producesLLVM IR code from it.

How to change Clang

How to add an attribute

Attributes are a form of metadata that can be attached to a program construct,allowing the programmer to pass semantic information along to the compiler forvarious uses. For example, attributes may be used to alter the code generationfor a program construct, or to provide extra semantic information for staticanalysis. This document explains how to add a custom attribute to Clang.Documentation on existing attributes can be foundhere.

Attribute Basics

Attributes in Clang are handled in three stages: parsing into a parsed attributerepresentation, conversion from a parsed attribute into a semantic attribute,and then the semantic handling of the attribute.

Parsing of the attribute is determined by the various syntactic forms attributescan take, such as GNU, C++11, and Microsoft style attributes, as well as otherinformation provided by the table definition of the attribute. Ultimately, theparsed representation of an attribute object is aParsedAttr object.These parsed attributes chain together as a list of parsed attributes attachedto a declarator or declaration specifier. The parsing of attributes is handledautomatically by Clang, except for attributes spelled as so-called “custom”keywords. When implementing a custom keyword attribute, the parsing of thekeyword and creation of theParsedAttr object must be done manually.

Eventually,Sema::ProcessDeclAttributeList() is called with aDecl andaParsedAttr, at which point the parsed attribute can be transformedinto a semantic attribute. The process by which a parsed attribute is convertedinto a semantic attribute depends on the attribute definition and semanticrequirements of the attribute. The end result, however, is that the semanticattribute object is attached to theDecl object, and can be obtained by acall toDecl::getAttr<T>(). Similarly, for statement attributes,Sema::ProcessStmtAttributes() is called with aStmt a list ofParsedAttr objects to be converted into a semantic attribute.

The structure of the semantic attribute is also governed by the attributedefinition given in Attr.td. This definition is used to automatically generatefunctionality used for the implementation of the attribute, such as a classderived fromclang::Attr, information for the parser to use, automatedsemantic checking for some attributes, etc.

include/clang/Basic/Attr.td

The first step to adding a new attribute to Clang is to add its definition toinclude/clang/Basic/Attr.td.This tablegen definition must derive from theAttr (tablegen, notsemantic) type, or one of its derivatives. Most attributes will derive from theInheritableAttr type, which specifies that the attribute can be inherited bylater redeclarations of theDecl it is associated with.InheritableParamAttr is similar toInheritableAttr, except that theattribute is written on a parameter instead of a declaration. If the attributeapplies to statements, it should inherit fromStmtAttr. If the attribute isintended to apply to a type instead of a declaration, such an attribute shouldderive fromTypeAttr, and will generally not be given an AST representation.(Note that this document does not cover the creation of type attributes.) Anattribute that inherits fromIgnoredAttr is parsed, but will generate anignored attribute diagnostic when used, which may be useful when an attribute issupported by another vendor but not supported by clang.

The definition will specify several key pieces of information, such as thesemantic name of the attribute, the spellings the attribute supports, thearguments the attribute expects, and more. Most members of theAttr tablegentype do not require definitions in the derived definition as the defaultsuffice. However, every attribute must specify at least a spelling list, asubject list, and a documentation list.

Spellings

All attributes are required to specify a spelling list that denotes the ways inwhich the attribute can be spelled. For instance, a single semantic attributemay have a keyword spelling, as well as a C++11 spelling and a GNU spelling. Anempty spelling list is also permissible and may be useful for attributes whichare created implicitly. The following spellings are accepted:

Spelling

Description

GNU

Spelled with a GNU-style__attribute__((attr))syntax and placement.

CXX11

Spelled with a C++-style[[attr]] syntax with anoptional vendor-specific namespace.

C23

Spelled with a C-style[[attr]] syntax with anoptional vendor-specific namespace.

Declspec

Spelled with a Microsoft-style__declspec(attr)syntax.

CustomKeyword

The attribute is spelled as a keyword, and requirescustom parsing.

RegularKeyword

The attribute is spelled as a keyword. It can beused in exactly the places that the standard[[attr]] syntax can be used, and appertains toexactly the same thing that a standard attributewould appertain to. Lexing and parsing of the keywordare handled automatically.

GCC

Specifies two or three spellings: the first is aGNU-style spelling, the second is a C++-style spellingwith thegnu namespace, and the third is an optionalC-style spelling with thegnu namespace. Attributesshould only specify this spelling for attributessupported by GCC.

Clang

Specifies two or three spellings: the first is aGNU-style spelling, the second is a C++-style spellingwith theclang namespace, and the third is anoptional C-style spelling with theclang namespace.By default, a C-style spelling is provided.

Pragma

The attribute is spelled as a#pragma, and requirescustom processing within the preprocessor. If theattribute is meant to be used by Clang, it shouldset the namespace to"clang". Note that thisspelling is not used for declaration attributes.

The C++ standard specifies that “any [non-standard attribute] that is notrecognized by the implementation is ignored” ([dcl.attr.grammar]).The rule for C is similar. This makesCXX11 andC23 spellingsunsuitable for attributes that affect the type system, that change thebinary interface of the code, or that have other similar semantic meaning.

RegularKeyword provides an alternative way of spelling such attributes.It reuses the production rules for standard attributes, but it applies themto plain keywords rather than to[[…]] sequences. Compilers that don’trecognize the keyword are likely to report an error of some kind.

For example, theArmStreaming function type attribute affectsboth the type system and the binary interface of the function.It cannot therefore be spelled[[arm::streaming]], since compilersthat don’t understandarm::streaming would ignore it and miscompilethe code.ArmStreaming is instead spelled__arm_streaming, but itcan appear wherever a hypothetical[[arm::streaming]] could appear.

Subjects

Attributes appertain to one or more subjects. If the attribute attempts toattach to a subject that is not in the subject list, a diagnostic is issuedautomatically. Whether the diagnostic is a warning or an error depends on howthe attribute’sSubjectList is defined, but the default behavior is to warn.The diagnostics displayed to the user are automatically determined based on thesubjects in the list, but a custom diagnostic parameter can also be specified intheSubjectList. The diagnostics generated for subject list violations arecalculated automatically or specified by the subject list itself. If apreviously unused Decl node is added to theSubjectList, the logic used toautomatically determine the diagnostic parameter inutils/TableGen/ClangAttrEmitter.cppmay need to be updated.

By default, all subjects in the SubjectList must either be a Decl node definedinDeclNodes.td, or a statement node defined inStmtNodes.td. However,more complex subjects can be created by creating aSubsetSubject object.Each such object has a base subject which it appertains to (which must be aDecl or Stmt node, and not a SubsetSubject node), and some custom code which iscalled when determining whether an attribute appertains to the subject. Forinstance, aNonBitField SubsetSubject appertains to aFieldDecl, andtests whether the given FieldDecl is a bit field. When a SubsetSubject isspecified in a SubjectList, a custom diagnostic parameter must also be provided.

Diagnostic checking for attribute subject lists for declaration and statementattributes is automated except whenHasCustomParsing is set to1.

Documentation

All attributes must have some form of documentation associated with them.Documentation is table generated on the public web server by a server-sideprocess that runs daily. Generally, the documentation for an attribute is astand-alone definition ininclude/clang/Basic/AttrDocs.tdthat is named after the attribute being documented.

If the attribute is not for public consumption, or is an implicitly-createdattribute that has no visible spelling, the documentation list can specify theInternalOnly object. Otherwise, the attribute should have its documentationadded to AttrDocs.td.

Documentation derives from theDocumentation tablegen type. All derivedtypes must specify a documentation category and the actual documentation itself.Additionally, it can specify a custom heading for the attribute, though adefault heading will be chosen when possible.

There are four predefined documentation categories:DocCatFunction forattributes that appertain to function-like subjects,DocCatVariable forattributes that appertain to variable-like subjects,DocCatType for typeattributes, andDocCatStmt for statement attributes. A custom documentationcategory should be used for groups of attributes with similar functionality.Custom categories are good for providing overview information for the attributesgrouped under it. For instance, the consumed annotation attributes define acustom category,DocCatConsumed, that explains what consumed annotations areat a high level.

Documentation content (whether it is for an attribute or a category) is writtenusing reStructuredText (RST) syntax.

After writing the documentation for the attribute, it should be locally testedto ensure that there are no issues generating the documentation on the server.Local testing requires a fresh build of clang-tblgen. To generate the attributedocumentation, execute the following command:

clang-tblgen-gen-attr-docs-I/path/to/clang/include/path/to/clang/include/clang/Basic/Attr.td-o/path/to/clang/docs/AttributeReference.rst

When testing locally,do not commit changes toAttributeReference.rst.This file is generated by the server automatically, and any changes made to thisfile will be overwritten.

Arguments

Attributes may optionally specify a list of arguments that can be passed to theattribute. Attribute arguments specify both the parsed form and the semanticform of the attribute. For example, ifArgs is[StringArgument<"Arg1">,IntArgument<"Arg2">] then__attribute__((myattribute("Hello",3))) will be a valid use; it requirestwo arguments while parsing, and the Attr subclass’ constructor for thesemantic attribute will require a string and integer argument.

All arguments have a name and a flag that specifies whether the argument isoptional. The associated C++ type of the argument is determined by the argumentdefinition type. If the existing argument types are insufficient, new types canbe created, but it requires modifyingutils/TableGen/ClangAttrEmitter.cppto properly support the type.

Other Properties

TheAttr definition has other members which control the behavior of theattribute. Many of them are special-purpose and beyond the scope of thisdocument, however a few deserve mention.

If the parsed form of the attribute is more complex, or differs from thesemantic form, theHasCustomParsing bit can be set to1 for the class,and the parsing code inParser::ParseGNUAttributeArgs()can be updated for the special case. Note that this only applies to argumentswith a GNU spelling – attributes with a __declspec spelling currently ignorethis flag and are handled byParser::ParseMicrosoftDeclSpec.

Note that setting this member to 1 will opt out of common attribute semantichandling, requiring extra implementation efforts to ensure the attributeappertains to the appropriate subject, etc.

If the attribute should not be propagated from a template declaration to aninstantiation of the template, set theClone member to 0. By default, allattributes will be cloned to template instantiations.

Attributes that do not require an AST node should set theASTNode field to0 to avoid polluting the AST. Note that anything inheriting fromTypeAttr orIgnoredAttr automatically do not generate an AST node. Allother attributes generate an AST node by default. The AST node is the semanticrepresentation of the attribute.

TheLangOpts field specifies a list of language options required by theattribute. For instance, all of the CUDA-specific attributes specify[CUDA]for theLangOpts field, and when the CUDA language option is not enabled, an“attribute ignored” warning diagnostic is emitted. Since language options arenot table generated nodes, new language options must be created manually andshould specify the spelling used byLangOptions class.

Custom accessors can be generated for an attribute based on the spelling listfor that attribute. For instance, if an attribute has two different spellings:‘Foo’ and ‘Bar’, accessors can be created:[Accessor<"isFoo",[GNU<"Foo">]>,Accessor<"isBar",[GNU<"Bar">]>]These accessors will be generated on the semantic form of the attribute,accepting no arguments and returning abool.

Attributes that do not require custom semantic handling should set theSemaHandler field to0. Note that anything inheriting fromIgnoredAttr automatically do not get a semantic handler. All otherattributes are assumed to use a semantic handler by default. Attributeswithout a semantic handler are not given a parsed attributeKind enumerator.

“Simple” attributes, that require no custom semantic processing aside from whatis automatically provided, should set theSimpleHandler field to1.

Target-specific attributes may share a spelling with other attributes indifferent targets. For instance, the ARM and MSP430 targets both have anattribute spelledGNU<"interrupt">, but with different parsing and semanticrequirements. To support this feature, an attribute inheriting fromTargetSpecificAttribute may specify aParseKind field. This fieldshould be the same value between all arguments sharing a spelling, andcorresponds to the parsed attribute’sKind enumerator. This allowsattributes to share a parsed attribute kind, but have distinct semanticattribute classes. For instance,ParsedAttr is the sharedparsed attribute kind, but ARMInterruptAttr and MSP430InterruptAttr are thesemantic attributes generated.

By default, attribute arguments are parsed in an evaluated context. If thearguments for an attribute should be parsed in an unevaluated context (akin tothe way the argument to asizeof expression is parsed), setParseArgumentsAsUnevaluated to1.

If additional functionality is desired for the semantic form of the attribute,theAdditionalMembers field specifies code to be copied verbatim into thesemantic attribute class object, withpublic access.

If two or more attributes cannot be used in combination on the same declarationor statement, aMutualExclusions definition can be supplied to automaticallygenerate diagnostic code. This will disallow the attribute combinationsregardless of spellings used. Additionally, it will diagnose combinations withinthe same attribute list, different attribute list, and redeclarations, asappropriate.

Boilerplate

All semantic processing of declaration attributes happens inlib/Sema/SemaDeclAttr.cpp,and generally starts in theProcessDeclAttribute() function. If theattribute has theSimpleHandler field set to1 then the function toprocess the attribute will be automatically generated, and nothing needs to bedone here. Otherwise, write a newhandleYourAttr() function, and add that tothe switch statement. Please do not implement handling logic directly in thecase for the attribute.

Unless otherwise specified by the attribute definition, common semantic checkingof the parsed attribute is handled automatically. This includes diagnosingparsed attributes that do not appertain to the givenDecl orStmt,ensuring the correct minimum number of arguments are passed, etc.

If the attribute adds additional warnings, define aDiagGroup ininclude/clang/Basic/DiagnosticGroups.tdnamed after the attribute’sSpelling with “_”s replaced by “-“s. If thereis only a single diagnostic, it is permissible to useInGroup<DiagGroup<"your-attribute">>directly inDiagnosticSemaKinds.td

All semantic diagnostics generated for your attribute, including automatically-generated ones (such as subjects and argument counts), should have acorresponding test case.

Semantic handling

Most attributes are implemented to have some effect on the compiler. Forinstance, to modify the way code is generated, or to add extra semantic checksfor an analysis pass, etc. Having added the attribute definition and conversionto the semantic representation for the attribute, what remains is to implementthe custom logic requiring use of the attribute.

Theclang::Decl object can be queried for the presence or absence of anattribute usinghasAttr<T>(). To obtain a pointer to the semanticrepresentation of the attribute,getAttr<T> may be used.

Theclang::AttributedStmt object can be queried for the presence or absenceof an attribute by callinggetAttrs() and looping over the list ofattributes.

How to add an expression or statement

Expressions and statements are one of the most fundamental constructs within acompiler, because they interact with many different parts of the AST, semanticanalysis, and IR generation. Therefore, adding a new expression or statementkind into Clang requires some care. The following list details the variousplaces in Clang where an expression or statement needs to be introduced, alongwith patterns to follow to ensure that the new expression or statement workswell across all of the C languages. We focus on expressions, but statementsare similar.

  1. Introduce parsing actions into the parser. Recursive-descent parsing ismostly self-explanatory, but there are a few things that are worth keepingin mind:

    • Keep as much source location information as possible! You’ll want it laterto produce great diagnostics and support Clang’s various features that mapbetween source code and the AST.

    • Write tests for all of the “bad” parsing cases, to make sure your recoveryis good. If you have matched delimiters (e.g., parentheses, squarebrackets, etc.), useParser::BalancedDelimiterTracker to give nicediagnostics when things go wrong.

  2. Introduce semantic analysis actions intoSema. Semantic analysis shouldalways involve two functions: anActOnXXX function that will be calleddirectly from the parser, and aBuildXXX function that performs theactual semantic analysis and will (eventually!) build the AST node. It’sfairly common for theActOnXXX function to do very little (often justsome minor translation from the parser’s representation toSema’srepresentation of the same thing), but the separation is still important:C++ template instantiation, for example, should always call theBuildXXXvariant. Several notes on semantic analysis before we get into constructionof the AST:

    • Your expression probably involves some types and some subexpressions.Make sure to fully check that those types, and the types of thosesubexpressions, meet your expectations. Add implicit conversions wherenecessary to make sure that all of the types line up exactly the way youwant them. Write extensive tests to check that you’re getting gooddiagnostics for mistakes and that you can use various forms ofsubexpressions with your expression.

    • When type-checking a type or subexpression, make sure to first checkwhether the type is “dependent” (Type::isDependentType()) or whether asubexpression is type-dependent (Expr::isTypeDependent()). If any ofthese returntrue, then you’re inside a template and you can’t do muchtype-checking now. That’s normal, and your AST node (when you get there)will have to deal with this case. At this point, you can write tests thatuse your expression within templates, but don’t try to instantiate thetemplates.

    • For each subexpression, be sure to callSema::CheckPlaceholderExpr()to deal with “weird” expressions that don’t behave well as subexpressions.Then, determine whether you need to perform lvalue-to-rvalue conversions(Sema::DefaultLvalueConversions) or the usual unary conversions(Sema::UsualUnaryConversions), for places where the subexpression isproducing a value you intend to use.

    • YourBuildXXX function will probably just returnExprError() atthis point, since you don’t have an AST. That’s perfectly fine, andshouldn’t impact your testing.

  3. Introduce an AST node for your new expression. This starts with declaringthe node ininclude/Basic/StmtNodes.td and creating a new class for yourexpression in the appropriateinclude/AST/Expr*.h header. It’s best tolook at the class for a similar expression to get ideas, and there are somespecific things to watch for:

    • If you need to allocate memory, use theASTContext allocator toallocate memory. Never use rawmalloc ornew, and never hold anyresources in an AST node, because the destructor of an AST node is nevercalled.

    • Make sure thatgetSourceRange() covers the exact source range of yourexpression. This is needed for diagnostics and for IDE support.

    • Make sure thatchildren() visits all of the subexpressions. This isimportant for a number of features (e.g., IDE support, C++ variadictemplates). If you have sub-types, you’ll also need to visit thosesub-types inRecursiveASTVisitor.

    • Add printing support (StmtPrinter.cpp) for your expression.

    • Add profiling support (StmtProfile.cpp) for your AST node, noting thedistinguishing (non-source location) characteristics of an instance ofyour expression. Omitting this step will lead to hard-to-diagnosefailures regarding matching of template declarations.

    • Add serialization support (ASTReaderStmt.cpp,ASTWriterStmt.cpp)for your AST node.

  4. Teach semantic analysis to build your AST node. At this point, you can wireup yourSema::BuildXXX function to actually create your AST. A fewthings to check at this point:

    • If your expression can construct a new C++ class or return a newObjective-C object, be sure to update and then callSema::MaybeBindToTemporary for your just-created AST node to be surethat the object gets properly destructed. An easy way to test this is toreturn a C++ class with a private destructor: semantic analysis shouldflag an error here with the attempt to call the destructor.

    • Inspect the generated AST by printing it usingclang-cc1-ast-print,to make sure you’re capturing all of the important information about howthe AST was written.

    • Inspect the generated AST underclang-cc1-ast-dump to verify thatall of the types in the generated AST line up the way you want them.Remember that clients of the AST should never have to “think” tounderstand what’s going on. For example, all implicit conversions shouldshow up explicitly in the AST.

    • Write tests that use your expression as a subexpression of other,well-known expressions. Can you call a function using your expression asan argument? Can you use the ternary operator?

  5. Teach code generation to create IR to your AST node. This step is the first(and only) that requires knowledge of LLVM IR. There are several things tokeep in mind:

    • Code generation is separated into scalar/aggregate/complex andlvalue/rvalue paths, depending on what kind of result your expressionproduces. On occasion, this requires some careful factoring of code toavoid duplication.

    • CodeGenFunction contains functionsConvertType andConvertTypeForMem that convert Clang’s types (clang::Type* orclang::QualType) to LLVM types. Use the former for values, and thelatter for memory locations: test with the C++ “bool” type to checkthis. If you find that you are having to use LLVM bitcasts to make thesubexpressions of your expression have the type that your expressionexpects, STOP! Go fix semantic analysis and the AST so that you don’tneed these bitcasts.

    • TheCodeGenFunction class has a number of helper functions to makecertain operations easy, such as generating code to produce an lvalue oran rvalue, or to initialize a memory location with a given value. Preferto use these functions rather than directly writing loads and stores,because these functions take care of some of the tricky details for you(e.g., for exceptions).

    • If your expression requires some special behavior in the event of anexception, look at thepush*Cleanup functions inCodeGenFunctionto introduce a cleanup. You shouldn’t have to deal withexception-handling directly.

    • Testing is extremely important in IR generation. Useclang-cc1-emit-llvm andFileCheck to verify that you’regenerating the right IR.

  6. Teach template instantiation how to cope with your AST node, which requiressome fairly simple code:

    • Make sure that your expression’s constructor properly computes the flagsfor type dependence (i.e., the type your expression produces can changefrom one instantiation to the next), value dependence (i.e., the constantvalue your expression produces can change from one instantiation to thenext), instantiation dependence (i.e., a template parameter occursanywhere in your expression), and whether your expression contains aparameter pack (for variadic templates). Often, computing these flagsjust means combining the results from the various types andsubexpressions.

    • AddTransformXXX andRebuildXXX functions to theTreeTransformclass template inSema.TransformXXX should (recursively)transform all of the subexpressions and types within your expression,usinggetDerived().TransformYYY. If all of the subexpressions andtypes transform without error, it will then call theRebuildXXXfunction, which will in turn callgetSema().BuildXXX to performsemantic analysis and build your expression.

    • To test template instantiation, take those tests you wrote to make surethat you were type checking with type-dependent expressions and dependenttypes (from step #2) and instantiate those templates with various types,some of which type-check and some that don’t, and test the error messagesin each case.

  7. There are some “extras” that make other features work better. It’s worthhandling these extras to give your expression complete integration intoClang:

    • Add code completion support for your expression inSemaCodeComplete.cpp.

    • If your expression has types in it, or has any “interesting” featuresother than subexpressions, extend libclang’sCursorVisitor to provideproper visitation for your expression, enabling various IDE features suchas syntax highlighting, cross-referencing, and so on. Thec-index-test helper program can be used to test these features.

Testing

All functional changes to Clang should come with test coverage demonstratingthe change in behavior.

Verifying Diagnostics

Clang-cc1 supports the-verify command line option as a way tovalidate diagnostic behavior. This option will use special comments within thetest file to verify that expected diagnostics appear in the correct sourcelocations. If all of the expected diagnostics match the actual output of Clang,then the invocation will return normally. If there are discrepancies betweenthe expected and actual output, Clang will emit detailed information aboutwhich expected diagnostics were not seen or which unexpected diagnostics wereseen, etc. A complete example is:

If the test is run and the expected error is emitted on the expected line, thediagnostic verifier will pass. However, if the expected error does not appearor appears in a different location than expected, or if additional diagnosticsappear, the diagnostic verifier will fail and emit information as to why.

The-verify command optionally accepts a comma-delimited list of one ormore verification prefixes that can be used to craft those special comments.Each prefix must start with a letter and contain only alphanumeric characters,hyphens, and underscores.-verify by itself is equivalent to-verify=expected, meaning that special comments will start withexpected. Using different prefixes makes it easier to have separateRUN: lines in the same test file which result in differing diagnosticbehavior. For example:

// RUN: %clang_cc1 -verify=foo,bar %sintA=B;// foo-error {{use of undeclared identifier 'B'}}intC=D;// bar-error {{use of undeclared identifier 'D'}}intE=F;// expected-error {{use of undeclared identifier 'F'}}

The verifier will recognizefoo-error andbar-error as special commentsbut will not recognizeexpected-error as one because the-verify linedoes not contain that as a prefix. Thus, this test would fail verificationbecause an unexpected diagnostic would appear on the declaration ofE.

Multiple occurrences accumulate prefixes. For example,-verify-verify=foo,bar-verify=baz is equivalent to-verify=expected,foo,bar,baz.

Specifying Diagnostics

Indicating that a line expects an error or a warning is easy. Put a commenton the line that has the diagnostic, useexpected-{error,warning,remark,note} to tag if it’s an expected error,warning, remark, or note (respectively), and place the expected text between{{ and}} markers. The full text doesn’t have to be included, onlyenough to ensure that the correct diagnostic was emitted. (Note: full textshould be included in test cases unless there is a compelling reason to usetruncated text instead.)

For a full description of the matching behavior, including more complexmatching scenarios, seematching below.

Here’s an example of the most commonly used way to specify expecteddiagnostics:

intA=B;// expected-error {{use of undeclared identifier 'B'}}

You can place as many diagnostics on one line as you wish. To make the codemore readable, you can use slash-newline to separate out the diagnostics.

Alternatively, it is possible to specify the line on which the diagnosticshould appear by appending@<line> toexpected-<type>, for example:

#warning some text// expected-warning@10 {{some text}}

The line number may be absolute (as above), or relative to the current line byprefixing the number with either+ or-.

If the diagnostic is generated in a separate file, for example in a sharedheader file, it may be beneficial to be able to declare the file in which thediagnostic will appear, rather than placing theexpected-* directive in theactual file itself. This can be done using the following syntax:

// expected-error@path/include.h:15 {{error message}}

The path can be absolute or relative and the same search paths will be used asfor#include directives. The line number in an external file may besubstituted with* meaning that any line number will match (useful wherethe included file is, for example, a system header where the actual line numbermay change and is not critical).

As an alternative to specifying a fixed line number, the location of adiagnostic can instead be indicated by a marker of the form#<marker>.Markers are specified by including them in a comment, and then referenced byappending the marker to the diagnostic with@#<marker>, as with:

#warning some text// #1// ... other code ...// expected-warning@#1 {{some text}}

The name of a marker used in a directive must be unique within the compilation.

The simple syntax above allows each specification to match exactly onediagnostic. You can use the extended syntax to customize this. The extendedsyntax isexpected-<type><n>{{diagtext}}, where<type> is one oferror,warning,remark, ornote, and<n> is a positiveinteger. This allows the diagnostic to appear as many times as specified. Forexample:

voidf();// expected-note 2 {{previous declaration is here}}

Where the diagnostic is expected to occur a minimum number of times, this canbe specified by appending a+ to the number. For example:

voidf();// expected-note 0+ {{previous declaration is here}}voidg();// expected-note 1+ {{previous declaration is here}}

In the first example, the diagnostic becomes optional, i.e. it will beswallowed if it occurs, but will not generate an error if it does not occur. Inthe second example, the diagnostic must occur at least once. As a short-hand,“one or more” can be specified simply by+. For example:

voidg();// expected-note + {{previous declaration is here}}

A range can also be specified by<n>-<m>. For example:

voidf();// expected-note 0-1 {{previous declaration is here}}

In this example, the diagnostic may appear only once, if at all.

Matching Modes

The default matching mode is simple string, which looks for the expected textthat appears between the first{{ and}} pair of the comment. The string isinterpreted just as-is, with one exception: the sequencen is converted to asingle newline character. This mode matches the emitted diagnostic when thetext appears as a substring at any position of the emitted message.

To enable matching against desired strings that contain}} or{{, thestring-mode parser accepts opening delimiters of more than two curly braces,like{{{. It then looks for a closing delimiter of equal “width” (i.e}}}).For example:

// expected-note {{{evaluates to '{{2, 3, 4}} == {0, 3, 4}'}}}

The intent is to allow the delimeter to be wider than the longest{ or}brace sequence in the content, so that if your expected text contains{{{(three braces) it may be delimited with{{{{ (four braces), and so on.

Regex matching mode may be selected by appending-re to the diagnostic typeand including regexes wrapped in double curly braces ({{ and}}) in thedirective, such as:

expected-error-re {{format specifies type 'wchar_t **' (aka '{{.+}}')}}

Examples matching error: “variable has incomplete type ‘struct s’”

// expected-error {{variable has incomplete type 'struct s'}}// expected-error {{variable has incomplete type}}// expected-error {{{variable has incomplete type}}}// expected-error {{{{variable has incomplete type}}}}// expected-error-re {{variable has type 'struct {{.}}'}}// expected-error-re {{variable has type 'struct {{.*}}'}}// expected-error-re {{variable has type 'struct {{(.*)}}'}}// expected-error-re {{variable has type 'struct{{[[:space:]](.*)}}'}}

Feature Test Macros

Clang implements several ways to test whether a feature is supported or not.Some of these feature tests are standardized, like__has_cpp_attribute or__cpp_lambdas, while others are Clang extensions, like__has_builtin.The common theme among all the various feature tests is that they are a utilityto tell users that we think a particular feature is complete. However,completeness is a difficult property to define because features may still havelingering bugs, may only work on some targets, etc. We use the followingcriteria when deciding whether to expose a feature test macro (or particularresult value for the feature test):

  • Are there known issues where we reject valid code that should be accepted?

  • Are there known issues where we accept invalid code that should be rejected?

  • Are there known crashes, failed assertions, or miscompilations?

  • Are there known issues on a particular relevant target?

If the answer to any of these is “yes”, the feature test macro should eithernot be defined or there should be very strong rationale for why the issuesshould not prevent defining it. Note, it is acceptable to define the featuretest macro on a per-target basis if needed.

When in doubt, being conservative is better than being aggressive. If we don’tclaim support for the feature but it does useful things, users can still use itand provide us with useful feedback on what is missing. But if we claim supportfor a feature that has significant bugs, we’ve eliminated most of the utilityof having a feature testing macro at all because users are then forced to testwhat compiler version is in use to get a more accurate answer.

The status reported by the feature test macro should always be reflected in thelanguage support page for the corresponding feature (C++,C) if applicable. This page can givemore nuanced information to the user as well, such as claiming partial supportfor a feature and specifying details as to what remains to be done.