Parsek provides parser combinators (kombinators in fact) for Kotlin usages. With the library you can easily implement parsers and lexers in Kotlin. It's a multiplatform library, so it can be used not only inside JVM projects, but also inside Kotlin JS and Kotlin Native.
To use the library, it's enough to include the dependency from Maven Central.
dependencies {
implementation("io.github.kkarnauk:parsek:0.1")
}
<dependency>
<groupId>io.github.kkarnauk</groupId>
<artifactId>parsek</artifactId>
<version>0.1</version>
</dependency>
Let's start with an easy task. For instance, you want to parse an integer. There are several ways to do that. The first one:
object IntegerGrammar : Grammar<Int>() {
val digits by chars { it.isDigit() } // define a token for sequence of digits
val minus by char('-') // define a token for a minus
val positiveInteger by digits map { it.text.toInt() } // parser for a positive integer
val integer by (-minus seq positiveInteger map { -it }) alt positiveInteger // parser for an integer
override val parser by integer
}
The second one is much simpler:
object IntegerGrammar : Grammar<Int>() {
private val integer by regex("-?\\d+") // define a token for an entire integer
override val parser by integer map { it.text.toInt() } // map a text into an integer
}
And now, if you want to parse an integer in your program, you can just do something like that:
val integer = IntegerGrammar.parse("-42")
Yes, that's too simple an example, isn't it? Let's try something bigger. For instance, you want to parse an arithmetic expression with summing, multiplying, powering and parenthesis. So:
object ArithmeticExpressionGrammar : Grammar<ArithmeticExpression>() {
val num by regex("-?\\d+")
val sum by char('+')
val mul by char('*')
val pow by text("**")
val lpar by char('(')
val rpar by char(')')
val whitespaces by chars { it.isWhitespace() }.ignored()
val primary by (num map { Value(it.text.toInt()) }) alt (-lpar seq ref(::parser) seq -rpar)
val pows by rightAssociative(primary, pow) map { cur, res -> Power(cur, res) }
val mults by leftAssociative(pows, mul) map { res, cur -> Mul(cur, res) }
val sums by leftAssociative(mults, sum) map { res, cur -> Sum(cur, res) }
override val parser by sums
}
There are some, maybe, not obvious things like the unary minuses. So let's go into it.
- First, we declared the main tokens. Note, that we've used
ignored()
on thewhitespaces
-token, so this token will not be passed to the parser, therefore we don't need to care about all the whitespaces. - After that, we've declared the
primary
-rule. There are actually two different rules, separated by thealt
-combinator, that means "First, I'll try the first parser, and if it fails, I'll try the another one". - The
map
-combinator transforms the result under the parser, if it succeeds. - The
seq
-combinator tells "To succeed, I need the first parser to be successful and after that the another one.". The unary minus tells that we don't care about the result of the parser and it must be skipped, if is succeeds. If we didn't use the unary minus, then we would havePair<T, S>
as the result of successful parsing. - The
ref
-combinator is used to take not-initialized parsers. We cannot directly useparser
, because it isn't initialized yet. - The
rightAssociative
andleftAssociative
are used to performfoldr
andfoldl
on the results of consecutive parsersitem
,separator
,item
and so on. For example,item=primary
andseparator=pow
here.
After it's all done, we can use the grammar in the following way:
val expression = ArithmeticExpressionGrammar.parse("1 + 2 ** 3 ** 4 * 12 + 3 * (1 + 2) ** 2 ** 3")
A token represents a matched part of an input. A parser doesn't consume an initial input, it does consume a sequence of tokens.
Each token can tell you:
- The type
- The input used to produce it
- The length of the matched part in the input
- The location where it was matched: the offset, row and column of the match in the input
- The substring of the input matched by the token
The important point is that you don't have to produce tokens on your own. This task is completed by producers.
A token type introduces a family for tokens.
For example, you may have an input Hello, my friend!
a token type any non-empty sequence of letters
.
Then there are several tokens with the type: Hello
, my
and friend
.
Each token type can tell you:
- The name (that can be helpful while debugging)
- Whether tokens of that type are ignored by a parser: if
true
, then tokens are not passed to a parser (but they still consume the input!)
Each token type has the match method with the signature:
fun match(input: CharSequence, fromIndex: Int): Int
It starts matching input
from fromIndex
and returns the number of matched chars.
If the result is considered as successful if and only if the result is not 0.
The good news that lots of token types are already implemented. Check out:
- Text.
Has parameters
text: String
andignoreCase: Boolean
and matches only the strings that are equal totext
up toignoreCase
. - Char.
Has parameters
char: Char
andignoreCase: Boolean
and matches only the string that are equal to singlechar
up toignoreCase
. - Regex.
Has parameters
regex: String
andoptions: Set<RegexOption
. Compilesregex
withoptions
and adds something like$
to make it match from the start. - Char predicate.
Accepts a lambda
(Char) -> Boolean
and matches chars while lambda returnstrue
. - General token type.
There are two very general token types:
- The first one accepts a lambda
(String, Int) -> Int
and it creates a token type with the match function replaced with the given lambda. - The second token type accepts a lambda
(String) -> Int
, creates a view on theinput
fromfromIndex
inmatch
and invokes the given lambda.
- The first one accepts a lambda
Also, there are even more good news. You don't have to name token types on your own. For convenience, there are token type providers!
The purpose of TokenTypeProvider is very easy: we don't want to write extra information while creating new types. When we create a new token type, we write it to a property, so, for example, there is already the name of the token type!
val digits by chars { it.isDigit() }
In the example, we create a token type of type Char predicate
(described above) and it now has the name digits
.
Also, we've talked about ignoring tokens by parsers. If you want tokens of the specific type to be ignored by a parser, you just go with one of the following ways:
val digits by regex("\\d+").ignored()
val digits by regex("\\d+", ignored = true)
So the providers help you to create tokens in a more convenient way.
Note: you must use delegation here in order to pass created tokens into a tokenizer.
Now, let's map the described token types into their providers:
- Char:
val letterA by char('a', ignoreCase = true)
val plus by char('+') // case is not ignored by default
- Text:
val myName by text("kirill", ignoreCase = true)
val classKeyword by text("class") // case is not ignored by default
- Regex:
val whitespaces by regex("\\s+", setOf(RegexOption.MULTILINE)).ignored()
val name by regex("[A-Za-z_]+") // no regex options by default
- Char predicate:
val digits by chars { it.isDigit() }
- General token type:
val beforeLastChar by tokenType { input -> input.substringBeforeLast(char).length }
val findStr by tokenType { input, fromIndex -> input.find(str, fromIndex) }
val myType: TokenType = getTokenType()
val myTokenType by tokenType(myType)
Each provider takes the name of the created token type from the name of the property.
On each of those providers you can invoke .ignored()
or pass ignore = true
to them in order to make them
ignored by parsers!
We've talked much about tokens and token types, but you still don't know how to convert a string into a collection of tokens.
There is an interface TokenProducer,
which is responsible to provide tokens. The only interesting method it has is nextToken(): Token?
.
So, it either returns a new Token
, or null
if nothing can be produced.
But how to get a producer? For now, there is only one way to get it: Tokenizer. The main purpose of tokenizers is to take a string and turn it into a producer of tokens.
Usually, a tokenizer is initialized with a list of token types. These token types are retrieved automatically while you define them. Exactly for this purpose you must use delegation when creating a token type.
For now, there are two different implementations of a tokenizer:
- Longest match tokenizer: on each step tries to find a token type that matches the largest number of characters. After that turns the token type and current location in the input into a new token and returns it.
- First match tokenizer: almost the same, but takes the first token type with non-zero match result.
The default tokenizer is the longest match one.
Note that parsers accepts IndexedTokenProcucer, not the regular Token producer. The reason is that the regular token producers are lazy and there is no possibility to reuse the produced tokens. The indexed producers memorize produced tokens and allow getting them by an index.
Anyway, you don't need to implement an indexed producer on your own.
Each producer can be effectively turned into an indexed one by calling producer.indexed()
.
As already mentioned above, tokenizers pull information about token types when you delegate token type providers.
If a tokenizer is initialized before all token types are initialized, it would not get all information.
It's not convenient to make users use by lazy
or something like that on each override of tokenizer
.
So there is a new abstraction: TokenizerProvider. You give a list of token type, the provider gives a tokenizer.
Now, the method for getting`a tokenizer for your Grammar is final and lazy-implemented. If you want to change a tokenizer for your grammar, override the method for getting a tokenizer provider.
There are implementations for providing default tokenizers: longest match and first match.
TODO
TODO
The project is inspired by an interesting library better-parse for the combinators. I really liked that and decided to implement parser combinators on my own. I'll try to make parsers better :)