MySQL grammar in ANTLR 4

The main purpose of a web application firewall is to analyze and filter traffic relevant to an application or a class of applications, such as web applications or database management systems (DBMS). A firewall needs to speak the language of the application it is protecting. For a relational DBMS, the language in question will be an SQL dialect.

Let us assume that the task is to build a firewall to protect a DBMS. In this case, the firewall must recognize and analyze SQL statements in order to determine whether they comply with the security policy. The depth of analysis depends on the task required (for example, detection of SQL injection attacks, access control, or correlation of SQL and HTTP requests). In any case, the firewall must perform lexical, syntactic, and semantic analysis of SQL statements.

Contents

Introduction


Formal grammar

With a formal grammar of a language, we can obtain a full picture of how the language is structured and analyze it. Formal grammars help to create statements and recognize them using a syntactic analyzer.

According to the Chomsky hierarchy, there are four types of languages and, consequently, four types of grammars. Grammars are differentiated by their derivations. MySQL is a context-sensitive language. However, a context-sensitive grammar alone cannot yield a large number of language patterns. Generally, a context-free grammar is sufficient for generating the language patterns used in practice. This article describes development of a context-free grammar for MySQL.

Key terms

A language is defined on the basis of its alphabet, that is to say a set of symbols. Letters of the alphabet are united into meaningful sequences called lexemes. There are different types of lexemes (for example, identifiers, strings, and keywords). A token is a tuple consisting of a lexeme and a type name. A phrase is a sequence of specifically arranged lexemes. Phrases can be used to create statements. A statement refers to a complete sequence of lexemes that has independent meaning in the context of a given language. The notion statement has applied significance only.

An application based on any given language makes use of statements in that language, such as by running or interpreting these statements. From an applied point of view, a phrase is an incomplete structure, which can be a part of a statement. However, phrases are more useful for generating a grammar. Phrases and statements are similar from the point of view of the grammar—they follow certain of its rules with nonterminals on their right side.

Use of a language implies formation or recognition of statements. Recognition refers to the capability, for any sequence of lexemes received as input, to provide an answer the question "Does this sequence constitute a set of valid statements in this language?" as output.

MySQL language

The MySQL language is an SQL dialect used to write queries for the MySQL DBMS. SQL dialects follow the ISO/IEC 9075 Information technology – Database languages – SQL standard (or, strictly speaking, series of standards).

The MySQL dialect is a specific implementation of this standard with particular limitations and additions. Most MySQL statements can be described using a context-free grammar, but some statements require a context-sensitive grammar for their description. Put simply, if a lexeme can affect the recognition of subsequent phrases, then the phrase must be described by a rule of a context-sensitive grammar.

Some MySQL expressions are formed using this approach. For example:

DELIMITER SOME_LITERAL

In this case it is required to memorize SOME_LITERAL, because this literal will be used in subsequent statements instead of ; to mark their ending.

In a procedural extension, loop and compound statements can be tagged with labels having the following structure:

label somephrases label

In this case, label identifiers should be identical. Such a statement can be built only using a context-sensitive grammar.

ANTLR

We selected ANTLR as the parser generator for developing a MySQL parser. ANTLR has the following advantages to recommend it:



ANTLR uses a two-step algorithm for generating recognition code. The first step is to describe the lexical structure of a language, i.e., determine what the tokens are. The second step is to describe the syntactic structure of the language by grouping the tokens from the previous step into statements. Lexical and syntactic structures in ANTLR are described by rules. The lexical structure is defined by the type (lexeme descriptor) and value. To describe the value, a language with regular expression elements, but supporting recursion, is used. The syntactic structure rule is composed of lexeme descriptors based on the statement composition rules in ANTLR 4, which allow defining the structure of lexeme arrangement in a statement or a phrase within a statement.

When creating rules, the core principle of lexical analysis (to which ANTLR is no exception) should be taken into account. A lexer starts by recognizing the longest sequence of symbols in the input stream that can be described by any lexical rule. If multiple matching rules exist, the one with highest precedence is applied.

Without using semantic predicates in ANTLR, only a context-free grammar can be built. An advantage in this case is that such a grammar does not depend on the runtime environment. The grammar proposed in this article for the MySQL dialect is built without using semantic predicates.

Lexer


Getting started

The first step in developing a grammar is to define a list of the lexeme types that occur in the language. The recognizer accepts alphabet symbols, from which it must form lexemes; symbols that are not used as a part of lexemes, such as spaces and comments, can be filtered out. Thanks to filtering, only meaningful lexemes of the language are set aside for further analysis. Spaces and comments can be filtered out as follows:

SPACE: [ \t\r\n]+ -> channel(HIDDEN); COMMENT_INPUT: '/*' .*? '*/' -> channel(HIDDEN); LINE_COMMENT: ('-- ' | '#') ~[\r\n]* ('\r'? '\n' | EOF) -> channel(HIDDEN);

Potential lexical errors can also be taken into account, and unknown symbols can be omitted:

ERROR_RECONGNIGION: . -> channel(ERRORCHANNEL);

If a symbol is not recognized by any lexical rule, it is recognized by the rule ERROR_RECOGNITION. This rule is placed at the end of the grammar, giving it the lowest priority.

Now we can start identifying lexemes. Lexemes can be classified under the following types:
  • Keywords
  • Identifiers
  • Literals
  • Special symbols

If there is no obvious (or implicit) intersection between these types of lexemes in the language, it is only required to describe all lexemes. However, if there are any intersections, they should be resolved. The situation becomes complicated because some lexemes require a regular grammar for recognition. In MySQL, this is an issue for identifiers with a dot (fully qualified name) and keywords that can be identifiers.

Identifiers with a dot

Recognition of such MySQL lexemes as identifiers starting with a numeral has certain difficulties: the "." symbol can occur in both full column names and real literals:

select table_name.column_name as full_column_name ...

select 1.e as real_number ...

Therefore, it is required to recognize correctly a full column name in the first case, and a real literal in the second case. An intersection here is caused by the fact that identifiers in MySQL can start with numerals.

MySQL sees the phrase

someTableName.1SomeColumn

as a sequence of three tokens:

(someTableName, identifier), (. , dot delimeter), (1SomeColumn, identifier)

In this example, it is quite natural to use the following rules for recognition:

DOT: .; ID_LITERAL: [0-9]*[a-zA-Z_$][a-zA-Z_$0-9]*;

and the following rule for numerals:

DECIMAL_LITERAL: [0-9]+;

Tokenization results in a sequence of four tokens:

(someTableName, identifier), (. , dot delimeter), (1, число), (SomeColumn, identifier)

To avoid ambiguity, an auxiliary structure can be introduced to recognize identifiers:

fragment ID_LITERAL: [0-9]*[a-zA-Z_$][a-zA-Z_$0-9]*;

and prioritized rules can be defined:

DOT_ID: '.' ID_LITERAL; ... ID: ID_LITERAL; ... DOT: '.'

Since ANTLR recognizes sequences of maximum length, "." will surely not be recognized as a separate symbol.

Strings

Using strings as an example, we can illustrate one more rule of lexical analysis in ANTLR. A string in MySQL is a sequence of almost any characters put between single or double quotation marks. Strings put in single quotation marks cannot contain a single backslash and a quotation mark, because a lexer would not be able to determine where the string ends. If such characters have to be used, a single quote is replaced by two consecutive single quotes to escape these characters. Moreover, an escape character inside a string cannot be unaccompanied, since there is supposed to be something to be escaped. Therefore, use of this character by itself without an accompanying character should also be prohibited. As a result, we obtain the following fragment of a lexical rule:

fragment SQUOTA_STRING: '\'' ('\\'. | '\'\'' | ~('\'' | '\\'))* '\'';
  • '\\'. allows a backslash and the symbol it escapes.
  • '\'\'' allows a sequence of two single quotation marks.
  • ~('\'' | '\\') prohibits a standalone single quotation mark or escape character.

Keywords

An ANTLR lexer, by contrast with a parser, applies rules in order of precedence. Rules that have been defined earlier have a higher priority than those described later. This approach gives a clear instruction for rule sorting: higher priority is given to specific rules (defining keywords and special symbols), followed by general rules (for recognizing literals, variables, identifiers, etc.).

Special comment type in MySQL

MySQL uses a special comment style that can span multiple lines. Such comments allow creating queries compatible with other DBMSs with no need to follow MySQL-specific requirements. When generating a query, MySQL will analyze the text in such comments. To recognize special MySQL comments, we can use the following rule:

SPEC_MYSQL_COMMENT: '/*!' .+? '*/' -> channel(MYSQLCOMMENT);

However, using this rule by itself is not enough for correctly parsing queries.

Assume that a query of the following kind is received:

select name, info /*!, secret_info */ from users;

Applying the above-mentioned rule, we obtain the following sequence of tokens:

(SELECT, 'select') (ID, 'name') (COMMA, ',') (ID, 'info') (SPEC_MYSQL_COMMENT, '/*!, secret_info */') (FROM, 'from') (ID, 'users') (SEMI, ';')

Whereas the standard MySQL lexer recognizes slightly different tokens:

(SELECT, 'select') (ID, 'name') (COMMA, ',') (ID, 'info') (COMMA, ',') (ID, 'secret_info') (FROM, 'from') (ID, 'users') (SEMI, ';')

That is why correct recognition of comments written in the unique MySQL style requires additional processing:
  1. Source text is recognized by a special lexer for preprocessing.
  2. Values are extracted from the SPEC_MYSQL_COMMENT tokens and a new text is created, which will be processed only by a MySQL server.
  3. The newly created text is processed using an ordinary parser and lexer.

A lexer for preprocessing splits the input stream into phrases that are part of:

  • Special comments (SPEC_MYSQL_COMMENT)
  • Main queries (TEXT)

The rules can be arranged in the following way:

lexer grammar mysqlPreprocessorLexer; channels { MYSQLCOMMENT } TEXT: ~'/'+; SPEC_MYSQL_COMMENT: '/*!' .+? '*/'; //-> channel(MYSQLCOMMENT);SLASH: '/' -> type(TEXT);

A pre-lexer splits query code into a sequence of the SPEC_MYSQL_COMMENT and TEXT tokens. If a MySQL statement is processed, values extracted from the SPEC_MYSQL_COMMENT tokens are combined with values of the TEXT tokens. Then the resulting text is processed by the standard MySQL lexer. If another SQL dialect is used, the SPEC_MYSQL_COMMENT tokens are simply removed or set aside.

Case insensitivity

Almost all lexemes in MySQL are case-insensitive, which means that the following two queries are identical:

select * from t; SelECT * fROm t;

Unfortunately, ANTLR does not support case-insensitive tokens. Therefore it is necessary to apply the following entry with fragment tokens, which are used to build actual tokens:

SELECT: S E L E C T; FROM: F R O M; fragment S: [sS]; fragment E: [eE];

This makes a grammar less readable. Moreover, a lexer has to select one of two variants for each symbol—upper or lower case—which has a negative impact on performance.

To make lexer code cleaner and improve performance, the input stream should be normalized, meaning that all symbols should be in the same case (upper or lower). ANTLR supports a special stream that disregards case during lexical analysis, but retains the cases of original token values. These tokens can be used in tree traversal.

Implementation of such a stream for various runtimes has been suggested by @KvanTTT. The implementation can be found in the DAGE project, a cross-platform editor of ANTLR 4 grammars.
As a result, all lexemes are written either in lower or in upper case. Because normally SQL keywords in queries are written in upper case, it was decided to use upper case for the grammar:

SELECT: "SELECT"; FROM: "FROM";

Parser

To describe the syntactic structure of a language, the sequence of the following components should be defined:
  • Statements in a text
  • Phrases in a statement
  • Lexemes and phrases inside larger phrases

Text structure in MySQL

There is an excellent MySQL grammar description, though it is spread across the whole reference guide. The arrangement of statements in a text is given in the section that describes the MySQL client/server messaging protocol. We can see that all statements, except possibly the last one, use a semicolon (;) as the delimiter. Moreover, there is a peculiarity about inline comments: the last sentence in a text can end with such a comment. As a result, it turns out that any valid sequence of statements in MySQL can be represented in the following form:

root : sqlStatements? MINUSMINUS? EOF ; sqlStatements : (sqlStatement MINUSMINUS? SEMI | emptyStatement)* (sqlStatement (MINUSMINUS? SEMI)? | emptyStatement) ;... MINUSMINUS:

A context-free grammar is not powerful enough to ensure full-fledged support of these rules, because a MySQL client can use the DELIMITER command to set the current delimiter. In this case, it is required to memorize and use the delimiter in other rules. Thus, if we use the DELIMITER directive, SQL statements that are written correctly will not be recognized by the grammar under discussion.

Types of MySQL statements

MySQL statements can be of the following types:
  • DDL statements
  • DML statements
  • Transaction statements
  • Replication statements
  • Prepared statements
  • Server administration statements
  • Utility statements
  • Procedural extension statements

The root rule for statements, based on the MySQL documentation, looks as follows:


sqlStatement : ddlStatement | dmlStatement | transactionStatement | replicationStatement | preparedStatement | administrationStatement | utilityStatement

There is also an empty statement that consists of a single semicolon:

empty_statement : SEMI ; SEMI: ';';

The subsequent chapters of the official documentation have been similarly transformed into ANTLR rules.

SELECT

The SELECT statement is probably the most interesting and wide-ranging statement in both MySQL and SQL in general. When writing the grammar, our main focus went towards the following tasks:
  • Description of tables
  • Description of expressions
  • Combination using UNION

Let us start with description of tables. MySQL has a rather elaborate description of what can be used in the FROM field of SELECT queries (which here we'll call "table references"). Careful study and testing on actively used versions reveals that table references have the following structure:

Table object 1, Table object 2, …, Table object N

where a "table object" is one of four structures:
  • A separate table
  • Joined tables
  • A subquery
  • Table references in parentheses

If we start from less general, we get a table object inductively recognized as a table or a table-based structure. The latter can be:

• Joined tables
• A subquery
• A sequence of table objects in parentheses

Then a sequence of table objects, consisting of at least one table object, is detected in the FROM field. Of course, the grammar describes additional structures, such as "connection conditions" and references to partitions (PARTITION), but the general structure is as follows:

tableSources : tableSource (',' tableSource)* ; tableSource : tableSourceItem joinPart* | '(' tableSourceItem joinPart* ')' ; tableSourceItem : tableName (PARTITION '(' uidList ')' )? (AS? alias=uid)? (indexHint (',' indexHint)* )? #atomTableItem | (subquery | '(' parenthesisSubquery=subquery ')') AS? alias=uid #subqueryTableItem | '(' tableSources ')' #tableSourcesItem ;

Expressions

Expressions are widely used in MySQL wherever it is required to evaluate a value (value vector). Inductively, an expression can be defined as follows:
  • An expression is any lexeme that is:
    • a constant (literal) value
    • a variable
    • an object identifier
  • An expression is a superposition of expressions that have been united by transformations.
Transformations include operations, operators (including set-theoretic and comparison operators), functions, queries, and parentheses.

UNION

Unlike other dialects, MySQL has only two set-theoretic operations on tables. The first one, JOIN, has already been considered. Empirically, we found that the description of UNION in the official documentation is incomplete. We built upon it in the following way:

selectStatement : querySpecification lockClause? #simpleSelect | queryExpression lockClause? #parenthesisSelect | querySpecificationNointo unionStatement+ ( UNION (ALL | DISTINCT)? (querySpecification | queryExpression) )? orderByClause? limitClause? lockClause? #unionSelect | queryExpressionNointo unionParenthesis+ ( UNION (ALL | DISTINCT)? queryExpression )? orderByClause? limitClause? lockClause? #unionParenthesisSelect ;

If UNION is used, individual queries can be enclosed in parentheses. Using parentheses is not essential, unless queries use ORDER BY and LIMIT. However, if the first query in UNION is in parentheses, they should be used for all subsequent queries as well.

Incorrect:

(select 1) union select 2; (select 1) union (select 2) union select 3;

Correct:

(((select 1))) union (select 2); (select 1) union ((select 2)) union (select 3);

Use of grammar

A grammar is written to solve tasks relevant to syntactic and lexical analysis. On the one hand, recognition is required to be performed as quickly as possible; on the other hand, any developed applications should be able to take advantage of lexer and parser code without compromising functionality and performance.

An application that uses a parser most likely applies either the Visitor or Observer design pattern. Both patterns involve analysis of a defined subset of the nodes of a parse tree. Nodes of a parse tree, other than leaf nodes, correspond to certain syntax rules. To analyze nodes of a parse tree, we must look at the child nodes, which could be either individual nodes or groups of nodes, that correspond to fragments of a parent rule.

A critical condition for developing a good grammar is the ability to gain "easy" access to any part of the rule. Intuitively, "easy" access can be described as the possibility to get any given part as an object without searching and iterating. This is implemented in ANTLR by means of such entities as alternative and element labels. Alternative labels allow splitting a complex rule into alternative phrases and, if the Visitor pattern is used, processing each of them using a separate method. For example, a table object in MySQL can be defined by the following rule:

tableSourceItem : tableName (PARTITION '(' uidList ')' )? (AS? alias=uid)? (indexHint (',' indexHint)* )? | (subquery | '(' parenthesisSubquery=subquery ')') AS? alias=uid | '(' tableSources ')' ;

We can see that a table object is defined as one of three possible variants:
  • A table
  • A subquery
  • A sequence of table objects in parentheses

Therefore, instead of processing the whole structure, alternative labels are used, creating the possibility to process each variant independently of the others:

tableSourceItem : tableName (PARTITION '(' uidList ')' )? (AS? alias=uid)? (indexHint (',' indexHint)* )? #atomTableItem | (subquery | '(' parenthesisSubquery=subquery ')') AS? alias=uid #subqueryTableItem | '(' tableSources ')' #tableSourcesItem ;

Element labels are used to label individual nonterminals and their sequences. These labels allow accessing the content of a rule context as a field with a certain name. Thus, instead of extracting a certain content element from some context, an element label would be sufficient. By contrast, extraction depends on the rule structure. The more elaborate a rule, the more complicated extraction becomes.

For example, the rule:

loadXmlStatement : LOAD XML (LOW_PRIORITY | CONCURRENT)? LOCAL? INFILE STRING_LITERAL (REPLACE | IGNORE)? INTO TABLE tableName (CHARACTER SET charsetName)? (ROWS IDENTIFIED BY '<' STRING_LITERAL '>')? ( IGNORE decimalLiteral (LINES | ROWS) )? ( '(' assignmentField (',' assignmentField)* ')' )? (SET updatedElement (',' updatedElement)*)? ;

requires extracting the tag name that identifies strings imported by the LOAD XML operator. There is also the need to identify the conditions that would determine the specific form of the LOAD XML operator:
  • Is any priority explicitly set for the operator? If yes, then what priority?
  • Which string append mode will be used by the operator?
  • Exactly what kind of syntax will be used if the syntax is used to ignore several initial strings during import?

To obtain the required values immediately in code without any extraction, element labels can be used:

loadXmlStatement : LOAD XML priority=(LOW_PRIORITY | CONCURRENT)? LOCAL? INFILE file=STRING_LITERAL violation=(REPLACE | IGNORE)? INTO TABLE tableName (CHARACTER SET charsetName)? (ROWS IDENTIFIED BY '<' tag=STRING_LITERAL '>')? ( IGNORE decimalLiteral linesFormat=(LINES | ROWS) )? ( '(' assignmentField (',' assignmentField)* ')' )? (SET updatedElement (',' updatedElement)*)? ;

Simplifying code of the target application simplifies the grammar as well, because the names of alternative labels improve readability.

Conclusion

Developing grammars for SQL languages is quite challenging, because they are case-insensitive and contain a large number of keywords, ambiguities, and context-sensitive structures. In particular, while developing our MySQL grammar, we implemented processing of special types of comments, developed a lexer able to differentiate between identifiers with a dot and real literals, and wrote the parser grammar, which covers the majority of the MySQL syntax described in the documentation. The MySQL grammar developed by us can be used to recognize queries generated by WordPress and Bitrix, as well as other applications that do not require exact processing of context-sensitive cases. The grammar source files are available in the official grammar repository under the MIT license.

Author: Ivan Khudyashov, Positive Technologies

Article Link: http://blog.ptsecurity.com/2018/01/mysql-grammar-in-antlr-4.html