BackusNaur Form (BNF) is a notation for the formal description of computer languages. Simply stated, it is a syntax for describing syntax. BNF and its deviations(EBNF, ABNF ...) are commonly used to elaborate wide range of programming specifications.
BNF based on a declarative syntax that allows the user to define language constructions via "production rules". Each rule except "terminal" is a conjunction of a series of more detailed rules or terminals:
production_rule ::= <rule_1>...<rule_n> | <rule_n_1>...<rule_m>;
For example:
<digit> ::= <0> | <1> | <2> | <3> | <4> | <5> | <6> | <7> | <8> | <9>
<number> ::= <digit> | <digit> <number>
which means that a number is either a single digit, or a single digit followed by another number. ( the number is just a digit or another number with one more digit )
BNFlite implements embedded domain specific language approach for grammar specification when all "production rules" are constructed as instances of C++ classes concatenated by means of overloaded arithmetic operators.
Previous example can be presented in C++ form:
#include "bnflite.h" // include lib
using namespace bnf;
Lexem Digit = Token("0") | "1" | "2" | "4" | "5" | "6" | "7" | "8" | "9";
Lexem Number;
Number = Digit | Digit + Number;
Execution of this C++ code produces an internal date representation for further parsing.
For the previous example the terminal symbol "Token" can be specified like this:
Token Digit("0123456789"); // more compact and optimal than above
Token Digit('0', '9'); // one more compact form
One more way to create Token:
Token NotPoint(1,127); NotPoint.Remove('.'); // all chars except point
Lexical productions are introduced as "Lexem" object:
Token Letter('A', 'Z'); // Declare as upper-case
Letter.Add('a', 'z'); // append lower-case to token
Lexem LetterOrDigit = Letter | Digit;
Lexem OptionalIdentifier;
OptionalIdentifier = LetterOrDigit + OptionalIdentifier | Null();
Lexem Identifier = Letter + OptionalIdentifier;
The C++ '=' operator is used as BNF '::='. The '+' operator implements sequential concatenation. The '|' operator implements 'or' relation. The Null() object return positive production with zero length. By default the parser tests all alternative productions and selects one with longest input. The "Identifier" can not started with digit. But it can be one letter.
Let assume we need to support 32-character identifiers only, so a 33-character identifier should be treated as error.
The above OptionalIdentifier
recursive repeat construction can not support such functionality.
However it can be replaced by cyclic sequence mechanisms.
BNFlite offers to use the following functions:
Series(a, token, b);
Iterate(a, lexem, b);
Repeat(a, rule, b);
where a
and b
imply at least a
and at most b
occurrences
So we can use:
Lexem Number = Series(1, Digit); // 1 means at least one digit
Lexem Identifier = Letter + Iterate(0, LetterOrDigit, 32); // 0 - means zero or more at last 32
Additionally BNFlite supports compact style(ABNF) introducing several overloaded operators to implement repeatable constructions.
Lexem Number = 1*Digit; // 1 means at least one digit
Lexem Identifier = Letter + *LetterOrDigit; // * - means zero-or-one or more
"Rule" object is used to define syntax productions:
Rule Array = Identifier + "[" + Number + "]";
Now we can call the parser:
int tst = Analyze(Identifier, "b0[16]");
The parser tries to find right correspondence between "Array" objects and parts of parsed text:
Identifier + "[" + Number + "]"
<=> "b0" + "[" + "16" + "]"
Practically the parser goes down through composed BNFlite objects achieving "Token" objects. Each Token can be fail or succeed against current character. In case of success the upper object continues parsing with next lower object. Otherwise, the upper object goes to next alternation(|) or fails if no more alternation is left.
Lets consider another example:
char* end;
int tst = Analyze(Identifier, "b[16];", &end);
Parser decomposes text to "b0" + "[" + "16" + "]"
.
The ";"
character is left because no rule for the last character, so the Analyze
returns a negative value.
The "end" variable contains the pointer to unrecognized ";"
.
Let assume we need to parse buf[16]
text as C style array:
We can define it as a lexem:
Lexem Identifier = Letter + OptionalIdentifier;
Lexem Array = Identifier + "[" + Digit + "]";
Or as a production rule:
Rule Identifier = Letter + OptionalIdentifier;
Rule Array = Identifier + "[" + Digit + "]";
The result of Analyze(Array, "buf0\t[ 16 ]");
depends on Array
type.
For the "Rule" case the input text is successfully parsed.
For the "Lexem" case any spaces or tabs in the input text is treated as error.
The "Rule" behavior to ignore some predefined constructions can be changed by the user.
In this case the custom handler const char* pre_parse(const char* ptr)
should be introduced
to call Analyze(Array, "buf [ 16 /*17*/ ]", pre_parse);
Intermediate parsing results can be obtain by callbacks. Two kinds of callback are supported.
-
Function with prototype
bool fun(const char*, size_t)
can be used as an expression element:bool SizeNumber(const char* number_string, size_t length_of_number) { int number; std::istringstream iss(std::string(number_string, length_of_number)); iss >> number; if (iss.good() && number > 0 && number < MAX_NUMBER) return true; else { std::cout << "bad size of array:"; std::cout.write(number_string, length_of_number); return false; } } //... Rule Array = Identifier + "[" + Digit + SizeNumber + "]";
The user callback has too parameters: the pointer to the latest found element and its length.
In this example SizeNumber
callback accepts the string of a digit number.
The user callback can do some semantic analyzes and return 1
(true) to continue parsing
or 0
(false) to reject the current rule.
-
Each "Rule" can be bound with the callback to be called in successful case First of all the user needs to define own working type for his data. This type is used for specialization of BNFlite
Interface
template class to pass data between Rules.typedef Interface<user_type> Usr; static Usr SizeNumber(std::vector& usr) // user callback function { printf("Size of Array : %.*s;\n", usr[2].length, usr[2].text); return usr[2]; } Bind(Array, SizeNumber); //connection between rule
Array
and user callback
The callback receives vector of Interface objects from lower Rules
and returns single Interface
object as a result.
The final root result is in Analyze
call.
Usr usr; // results after parsing
int tst = bnf::Analyze(Identifier, "b[16];", usr);
root
- top Rule for parsingtext
- text to be parsedpstop
- the pointer where parser stops (*pstop != '\0'
- not enough rules or resources)u
- top variable ofInterface
template structure (see the second kind of callbacks)u.text
- pointer to text to be parsed (copy oftext
)u.length
- final length of parsed data to be returned afterAnalize
callu.data
- final user data to be returned afterAnalize
call
Analize()
returns a negative value in case of parsing error.
Bit fields of the returned value can provide more information about parser behavior.
Generally, BNFlite utilizes simple top-down parser with backtracking. This parser may be not so good for complex grammar. However, the user has several special rules to make parsing smarter.
Return()
- Choose current production (should be last in conjunction rule)AcceptFirst()
- Choose first appropriate production (should be first in disjunction rule)Skip()
- Accept result but not production itself (can not be first)
In some cases less optimal MemRule()
can be used to remember unsuccessful parsing to reduce known overhead
Writing grammar by EDSL is easier if the user has some understanding the parser behavior.
If the Analyze
call returns an error the user always should take into consideration
both possibilities:
- syntax errors in the input text (incorrect text)
- grammar bugs (incorrect rules). The BNFLite provides several mechanisms to minimize debugging and testing overheads
Return code from Analyze
call can contain flags related to the gramma.
eBadRule
,eBadLexem
- the rule tree is not properly builteEof
- "unexpected end of file" for most cases it is OK, just not enough text for applied ruleseSyntax
- syntax error (controlled by the user)eOver
- too much data for cycle ruleseRest
- not all text has been parsedeNull
- no result
The user can assign the internal name to the Rule
object by setName()
.
It can help to track recursive descent parser looking Rule::_parse
function calls.
Debugger stack (history of function calls) can inform which Rule was applied and when.
The user just needs to watch the this->name
variable. It is not as difficult as it seems at first glance.
The user can divide complex gramma for several parts to develop them independently.
Analyze
function can be applied as unit test to any Rule
representing subset of the gramma.
The first kind of callback or function with prototype bool foo(const char* lexem, size_t len)
can be used in BNFLite rules to obtain the temporary result.
This function will print the parsed number:
static bool DebugNumber(const char* lexem, size_t len)
{ printf("The number is: %.*s;\n", len, lexem); return true; }
/* … */
Token digit1_9('1', '9');
Token DIGIT("0123456789");
Lexem I_DIGIT = 1*DIGIT; // Series(1, DIGIT);
Lexem exp_ = "Ee" + !Token("+-") + I_DIGIT ;
Lexem frac_ = "." + I_DIGIT;
Lexem int_ = "0" | digit1_9 + *DIGIT; //Series(0, DIGIT);
Lexem num_ = !Token("-") + int_ + !frac_ + !exp_;
Rule number = num_ + DebugNumber;
The function need to return true
because result is correct.
The first kind of callback can be used in BNFLite rules to inform about incorrect situations.
Let's assume numbers with leading 0
can be performed bat they are not wanted.
static bool Check0Number(const char* lexem, size_t len)
{ printf("Warning: the number %.*s with leading zero found\n", len, lexem); return false;}
/* … */
Lexem num_0 = "0" + digit1_9 + *DIGIT; // special rule for numbers with one leading zero
Rule number = num_ + DebugNumber | num_0 + Check0Number;
The function still should return true
to allow the parser to perform such numbers.
C++11 constructions like below are also possible:
Rule number = num_ | num_0 + [](const char* lexem, size_t len)
{ return !!printf("Warning: the number %.*s with leading zero found\n", len, lexem); }
In some cases we need to force parser to stop correctly because input text is erroneous.
Let's assume the numbers with leading of several 0
should be treated as error.
static int Error00Number(const char* lexem, size_t len)
{ printf("Error with leading zeros: %.*s\n", len, lexem); return false}
/* … */
Lexem num_00 = Series(2, "0") + digit1_9 + *DIGIT; // special rule for numbers with two or more leading zeros
Rule number = num_ + DebugNumber | num_00 + Error00Number + Catch() | num_0 + Check0Number;
The function should return false
because text is not applied.
The Catch()
special library statement forces the parser to stop and return eSyntax
error flag.
Note:
- The order
num_ + num_00 + num_0
is important becausenum_0
is subset ofnum_00
Below is the extended example for hex-decimal digit:
Token hex_digit("abcdefABCDEF", DIGIT);
Lexem HexDigits = Token("0") + Token("Xx") + Series(1, hex_digit);
Let's assume the parser tries to apply the HexDigits
rule to 0.0
and OxZ
text elements.
Both elements are not fit, but 0.0
can be appropriate for some another rule of gramma.
But the OxZ
is syntax error. We definitely know there is no rule for it.
So we should use Try()
special statement to force the parser to chech the production is ok.
Otherwise, parser stops witheSyntax
error at the end.
In other words, the Try()
forces eSyntax
error if input text is not applied after Try()
special rule.
Lexem HexDigits = Token("0") + Token("Xx") + Try() + Series(1, hex_digit);
In case of HexDigits
unsuccess the internal catch_error
handler is called.
And the Analize
returns eSyntax
flag.