This articleneeds additional citations forverification. Please helpimprove this article byadding citations to reliable sources. Unsourced material may be challenged and removed. Find sources: "Syntax" programming languages – news ·newspapers ·books ·scholar ·JSTOR(August 2013) (Learn how and when to remove this message) |

Thesyntax ofcomputersource code is code structured and ordered restricted to computer language rules. Like anatural language, acomputer language (i.e. aprogramming language) defines thesyntax that is valid for that language.[1] Asyntax error occurs when syntactically invalid source code is processed by atool such as acompiler orinterpreter.
The most commonly used languages aretext-based with syntax based onstrings. Alternatively, the syntax of avisual programming language is based on relationships between graphical elements.
When designing syntax, a designer of any language might start by writing down examples of both legal and illegalstrings, before trying to figure out the general rules from those examples in use. Such that its general structure of syntax can be determined through its form of composition, so to yield semantically valid ranges everytime for each possible alteration. Or else, it returns errors, and warnings, for each invalid input.[2]
Computer language syntax is generally distinguished into three levels:
Distinguishing in this way yields modularity, allowing each level to be described and processed separately and often independently.
First, alexer turns the linear sequence of characters into alinear sequence of tokens; this is known as "lexical analysis" or "lexing".[3]
Second, the parser turns the linear sequence of tokens into ahierarchical syntax tree; this is known as "parsing" narrowly speaking. This ensures that the line of tokens conform to the formal grammars of the programming language. The parsing stage itself can be divided into two parts: theparse tree, or "concrete syntax tree", which is determined by the grammar, but is generally far too detailed for practical use, and theabstract syntax tree (AST), which simplifies this into a usable form. The AST and contextual analysis steps can be considered a form ofsemantic analysis, as they are adding meaning and interpretation to the syntax, or alternatively as informal, manualimplementations of syntactical rules that would be difficult or awkward to describe or implement formally.
Thirdly, the contextual analysis resolves names and checks types. This modularity is sometimes possible, but in many real-world languages an earlier step depends on a later step – for example,the lexer hack inC is because tokenization depends on context. Even in these cases, syntactical analysis is often seen as approximating this ideal model.
The levels generally correspond to levels in theChomsky hierarchy. Words are in aregular language, specified in thelexical grammar, which is a Type-3 grammar, generally given asregular expressions. Phrases are in acontext-free language (CFL), generally adeterministic context-free language (DCFL), specified in aphrase structure grammar, which is a Type-2 grammar, generally given asproduction rules inBackus–Naur form (BNF). Phrase grammars are often specified in much more constrained grammars than fullcontext-free grammars, in order to make them easier to parse; while theLR parser can parse any DCFL in linear time, the simpleLALR parser and even simplerLL parser are more efficient, but can only parse grammars which production rules are constrained. In principle, contextual structure can be described by acontext-sensitive grammar, and automatically analyzed by means such asattribute grammars, though, in general, this step is done manually, vianame resolution rules andtype checking, and implemented via asymbol table which stores names and types for each scope.
Tools have been written that automatically generate a lexer from a lexicalspecification written in regular expressions and a parser from the phrase grammar written in BNF: this allows one to usedeclarative programming, rather than need to haveprocedural orfunctional programming. A notable example is thelex-yacc pair. These automatically produce aconcrete syntax tree; the parser writer must then manually write code describing how this is converted to anabstract syntax tree. Contextual analysis is also generally implemented manually. Despite the existence of these automatic tools, parsing is often implemented manually, for various reasons – perhaps the phrase structure is not context-free, or an alternative implementation improves performance or error-reporting, or allows the grammar to be changed more easily. Parsers are often written in functional programming languages, such asHaskell, or inscripting languages, such asPython orPerl, or inimperative programming languages such asC orC++.

The syntax of textual programming languages is usually defined using a combination ofregular expressions (forlexical structure) andBackus–Naur form (ametalanguage forgrammatical structure) to inductively specifysyntactic categories (nonterminal) andterminal symbols.[4] Syntactic categories are defined by rules calledproductions, which specify the values that belong to a particular syntactic category.[1] Terminal symbols are the concrete characters or strings of characters (for examplekeywords such asdefine,if,let, orvoid) from which syntactically valid programs are constructed.
Syntax can be divided into context-free syntax and context-sensitive syntax.[4] Context-free syntax are rules directed by the metalanguage of the programming language. These would not be constrained by the context surrounding or referring that part of the syntax, whereas context-sensitive syntax would.
A language can have different equivalent grammars, such as equivalent regular expressions (at the lexical levels), or different phrase rules which generate the same language. Using a broader category of grammars, such as LR grammars, can allow shorter or simpler grammars compared with more restricted categories, such as LL grammar, which may require longer grammars with more rules. Different but equivalent phrase grammars yield different parse trees, though the underlying language (set of valid documents) is the same.
Below is a simple grammar, defined using the notation of regular expressions andExtended Backus–Naur form. It describes the syntax ofS-expressions, a data syntax of the programming languageLisp, which defines productions for the syntactic categoriesexpression,atom,number,symbol, andlist:
expression=atom|listatom=number|symbolnumber=[+-]?['0'-'9']+symbol=['A'-'Z']['A'-'Z''0'-'9'].*list='(',expression*,')'
This grammar specifies the following:
Here the decimal digits, upper- and lower-case characters, and parentheses are terminal symbols.
The following are examples of well-formed token sequences in this grammar: '12345', '()', '(A B C232 (1))'
The grammar needed to specify a programming language can be classified by its position in theChomsky hierarchy. The phrase grammar of most programming languages can be specified using a Type-2 grammar, i.e., they arecontext-free grammars,[5] though the overall syntax is context-sensitive (due to variable declarations and nested scopes), hence Type-1. However, there are exceptions, and for some languages the phrase grammar is Type-0 (Turing-complete).
In some languages like Perl and Lisp the specification (or implementation) of the language allows constructs that execute during the parsing phase. Furthermore, these languages have constructs that allow the programmer to alter the behavior of the parser. This combination effectively blurs the distinction between parsing and execution, and makes syntax analysis anundecidable problem in these languages, meaning that the parsing phase may not finish. For example, in Perl it is possible to execute code during parsing using aBEGIN statement, and Perl function prototypes may alter the syntactic interpretation, and possibly even the syntactic validity of the remaining code.[6][7] Colloquially this is referred to as "only Perl can parse Perl" (because code must be executed during parsing, and can modify the grammar), or more strongly "even Perl cannot parse Perl" (because it is undecidable). Similarly, Lispmacros introduced by thedefmacro syntax also execute during parsing, meaning that a Lisp compiler must have an entire Lisp run-time system present. In contrast, C macros are merely string replacements, and do not require code execution.[8][9]
The syntax of a language describes the form of a valid program, but does not provide any information about the meaning of the program or the results of executing that program. The meaning given to a combination of symbols is handled by semantics (eitherformal or hard-coded in areference implementation). Valid syntax must be established before semantics can make meaning out of it.[4] Not all syntactically correct programs are semantically correct. Many syntactically correct programs are nonetheless ill-formed, per the language's rules; and may (depending on the language specification and the soundness of the implementation) result in an error on translation or execution. In some cases, such programs may exhibitundefined behavior. Even when a program is well-defined within a language, it may still have a meaning that is not intended by the person who wrote it.
Usingnatural language as an example, it may not be possible to assign a meaning to a grammatically correct sentence or the sentence may be false:
The following C language fragment is syntactically correct, but performs an operation that is not semantically defined (becausep is anull pointer, the operationsp->real andp->im have no meaning):
complex*p=NULL;complexabs_p=sqrt(p->real*p->real+p->im*p->im);
As a simpler example,
intx;printf("%d",x);
is syntactically valid, but not semantically defined, as it uses anuninitialized variable. Even though compilers for some programming languages (e.g., Java and C#) would detect uninitialized variable errors of this kind, they should be regarded assemantic errors rather than syntax errors.[10][11]