diff --git a/.github/workflows/oft.yaml b/.github/workflows/oft.yaml new file mode 100644 index 00000000..9b19354a --- /dev/null +++ b/.github/workflows/oft.yaml @@ -0,0 +1,24 @@ +name: OFT Report + +on: + push: + branches: + - master + pull_request: + +jobs: + build: + runs-on: ubuntu-24.04 + steps: + - uses: actions/checkout@v4 + - name: Run HTML Report + run: | + bash .github/workflows/scripts/run_oft.sh ./exaudfclient base -o html -f ./oft_report.html -t V2,_ || echo failed + - name: Run Plaintext Report + run: | + bash .github/workflows/scripts/run_oft.sh ./exaudfclient base -t V2,_ + - uses: actions/upload-artifact@v4 + if: always() + with: + name: "oft_report.html" + path: oft_report.html diff --git a/.github/workflows/scripts/run_oft.sh b/.github/workflows/scripts/run_oft.sh new file mode 100644 index 00000000..10bdc7b7 --- /dev/null +++ b/.github/workflows/scripts/run_oft.sh @@ -0,0 +1,27 @@ +#!/bin/bash + +set -o errexit +set -o nounset +set -o pipefail + +oft_version="4.1.0" + +base_dir="$1" +shift 1 +src_dir="$1" +shift 1 +additional_options=$@ +readonly base_dir +readonly oft_jar="$HOME/.m2/repository/org/itsallcode/openfasttrace/openfasttrace/$oft_version/openfasttrace-$oft_version.jar" + +if [ ! -f "$oft_jar" ]; then + echo "Downloading OpenFastTrace $oft_version" + mvn --batch-mode org.apache.maven.plugins:maven-dependency-plugin:3.3.0:get -Dartifact=org.itsallcode.openfasttrace:openfasttrace:$oft_version +fi + +# Trace all +java -jar "$oft_jar" trace \ + $additional_options \ + -a feat,req,dsn \ + "$base_dir/docs" \ + "$base_dir/$src_dir" diff --git a/exaudfclient/docs/diagrams/CTPGParserHandler.drawio.png b/exaudfclient/docs/diagrams/CTPGParserHandler.drawio.png new file mode 100644 index 00000000..1b4836e3 Binary files /dev/null and b/exaudfclient/docs/diagrams/CTPGParserHandler.drawio.png differ diff --git a/exaudfclient/docs/diagrams/LegacyParserHandler.drawio.png b/exaudfclient/docs/diagrams/LegacyParserHandler.drawio.png new file mode 100644 index 00000000..0907f6dc Binary files /dev/null and b/exaudfclient/docs/diagrams/LegacyParserHandler.drawio.png differ diff --git a/exaudfclient/docs/diagrams/OveralScriptOptionalsBuildingBlocks.drawio.png b/exaudfclient/docs/diagrams/OveralScriptOptionalsBuildingBlocks.drawio.png new file mode 100644 index 00000000..b834931d Binary files /dev/null and b/exaudfclient/docs/diagrams/OveralScriptOptionalsBuildingBlocks.drawio.png differ diff --git a/exaudfclient/docs/diagrams/ScriptOptionsExtractorInterface.drawio.png b/exaudfclient/docs/diagrams/ScriptOptionsExtractorInterface.drawio.png new file mode 100644 index 00000000..6583634b Binary files /dev/null and b/exaudfclient/docs/diagrams/ScriptOptionsExtractorInterface.drawio.png differ diff --git a/exaudfclient/docs/diagrams/ScriptOptionsParserHandlerSequence.drawio.png b/exaudfclient/docs/diagrams/ScriptOptionsParserHandlerSequence.drawio.png new file mode 100644 index 00000000..dbaac3fd Binary files /dev/null and b/exaudfclient/docs/diagrams/ScriptOptionsParserHandlerSequence.drawio.png differ diff --git a/exaudfclient/docs/diagrams/V2ImportScriptFlow.drawio.png b/exaudfclient/docs/diagrams/V2ImportScriptFlow.drawio.png new file mode 100644 index 00000000..9ddd77ed Binary files /dev/null and b/exaudfclient/docs/diagrams/V2ImportScriptFlow.drawio.png differ diff --git a/exaudfclient/docs/script_options_design.md b/exaudfclient/docs/script_options_design.md new file mode 100644 index 00000000..7e43136c --- /dev/null +++ b/exaudfclient/docs/script_options_design.md @@ -0,0 +1,483 @@ +# Design Document + +This document details the design aspects of the new Exasol UDF Client Script Options parser, based on the high-level requirements outlined in the System Requirement Specification. + +## Acknowledgments + +This document's section structure is derived from the "[arc42](https://arc42.org/)" architectural template by Dr. Gernot Starke, Dr. Peter Hruschka. + +## Constraints + +- The parser implementation must be in C++. +- The selected parser should allow easy encapsulation in a custom C++ namespace (UDF client linker namespace constraint) +- The selected parser should not depend on additional runtime dependencies +- The selected parser should have minimal compile time dependencies, i.e. no additional shared libraries or tools to generate C++ code + +### Requirement Overview + +Please refer to the [System Requirement Specification](script_options_requirments.md) for user-level requirements. + +## Building Blocks + +### Overall Architecture + +#### Component Overview + +![Components](diagrams/OveralScriptOptionalsBuildingBlocks.drawio.png) + +At the very high level there can be distinguished between the generic "Script Options Parser" module which parses a UDF script code and returns the found script options, and the "Script Options Parser Handler" which converts the Java UDF specific script options. In both modules there are specific implementation for the legacy parser and the new CTPG based parser. We need to keep the legacy implementation alive, as the new approach causes some breaking changes, especially related to the new escape patterns: Existing UDF might not be working with the new parser implementation. + +### Script Options Parser + +The parser component can be used to parse any script code (Java, Python, R) for any script options. It provides simplistic interfaces, which are different between the two versions, which accept the script code as input and return the found script option(s). The interfaces need to be different because both parsers work inherently differently: While the legacy parser successively finds and removes script options by the given key, the new parser finds **all** script options at once, but does not remove them. + +#### Legacy Parser + +The legacy parser (V1) parser searches for one specific script option, removes the whole option from the script code, and returns the script option value. + +#### V2 Parser + +The new parser uses the [CTPG libary](https://github.com/peter-winter/ctpg) which fulfills the technical constraints: It comes as a single C++ header file and does not require any runtime dependencies. The grammar and lexical rules can be defined in pure C++ and the parser is constructed during compile time, thus not having any performance overhead at runtime. +It is important to use a parser generator implementation which allows the definition of grammar and lexical rules, in order to achieve the new requirements regarding recognizing escape sequences in the script option values. Also, the clear definition of those rules makes the implementation better understandable. + +As the parser needs to find script options in any given script code, the generated parser must accept any strings which are not script options and ignore those. In order to achieve this, the lexer rules need to be as simple as possible, in order to avoid collisions. + +It is important to emphasize that in contrast to the legacy parser, the caller is responsible for removing the script options from the script code, as the parser is agnostic to the actual script options keys. +The interface provides a method which accepts the script code as input and returns a map with all found script options in the whole code. Each key in the map points to a list of all option values plus the start and end position of the option for this specific option key. + +### Parser Handler + +The parser handler uses the Script Options parser to query for specific options which are part of [Exasol's Java UDF specification](https://docs.exasol.com/db/latest/database_concepts/udf_scripts/java.htm): +1. JVM Options +2. JAR Options +3. Import Script Options +4. ScriptClass Option + +Because the new parser implementation parses all script options at once, and because some of the system requirements differ between both version, the Parser Handler implementations are also very different between the legacy and the ctpg based one. However, both implementations provide the same interface to the Java VM in the UDF Framework: + +![Exctractor](diagrams/ScriptOptionsExtractorInterface.drawio.png) + +Note that variable `script_code` is passed per reference. This is because the Parser Handler might modify the script code: +1. Remove the script options +2. Replace any found `import` script option with the respective scripts. + +The following sequence diagram shows how the Java VM implementation uses the Parser Handler to extract the script options. + +![ExctractSequenceDiagram](diagrams/ScriptOptionsParserHandlerSequence.drawio.png) + +#### Legacy Parser Handler + +![LegacyParserHandler](diagrams/LegacyParserHandler.drawio.png) + +The `ScriptOptionsLinesParserLegacy` class uses the Parser to search for Java specific script options and forwards the found options to the class `ConverterLegacy`, which uses a common implementation for the conversion of the options. +Class `tLegacyExtractor` connects `ScriptOptionsLinesParserLegacy` to `ConverterLegacy` and then orchestrates the parsing sequence. + +`ScriptOptionsLinesParserLegacy` also implements the import of foreign scripts. The import script algorithm iteratively replaces foreign scripts. The algorithm is described in the following pseudocode snippet: +``` +while True: + next_import_option, position = ScriptOptionsLegacyParser.parse(script_code, "%import") + if next_import_option_value != "": + foreign_script_code = resolve_foreign_script_somehow(next_import_option.value) + if not md5_hashset.has(foreign_script_code): + md5_hashset.add(foreign_script_code) + script_code.replaceAt(position, lenght(next_import_option), foreign_script_code) + else if not found: + break +``` + +#### CTPG based Parser Handler + +![CTPGParserHandler](diagrams/CTPGParserHandler.drawio.png) + +The `ScriptOptionsLinesParserCTPG` class uses the new CTPG based Parser to search for **all** Java specific script options at once. Then it forwards the found options to class `ConverterV2`, which uses a common implementation for the conversion of the options. `ConverterV2` also implements the functions to convert Jvm otions and JAR options. +Class `tExtractorV2` connects `ScriptOptionsLinesParserCTPG` to `ConverterV2` and then orchestrates the parsing sequence. + +##### CTPG based Script Import Algorithm +`ScriptOptionsLinesParserCTPG` uses an instance of `ScriptImporter` to import foreign scripts. Because the new parser collects all script options at once, but backwards compatibility with existing UDF scripts must be ensured, there is an additional level of complexity in the import script algorithm. The algorithm is described in the following pseudocode snippet: +``` +function import(script_code, options_map) + import_options = options_map.find("import") + if found: + sorted_import_options = sort(import_options) //import options according to their location in the script, increasing order + collectedScripts = list() //list of (script_code, location, size) + for each import_option in sorted_import_options: + import_script_code = resolve_foreign_script_somehow(import_option.value) + if not md5_hashset.has(import_script_code): + md5_hashset.add(import_script_code) + new_options_map = ScriptOptionsCTPGParser.parse(import_script_code) + new_script_code = "" + import(new_script_code, new_options_map) + collectedScripts.add(new_script_code, import_option.position, import_option.length_of_script_option) + for foreign_script in reverse(collectedScripts): + script_code.replaceAt(foreign_script.position, foreign_script.length_of_script_option, foreign_script.foreign_script) +``` + +The scripts need to be replaced in reverse order because otherwise the locations of import options later in the script would get invalidated by the replacement. + +The following example demonstrates the flow of the algorithm: + +_Main UDF_: +``` +%import other_script_A; +%import other_script_C; +class JVMOPTION_TEST { + static void run(ExaMetadata exa, ExaIterator ctx) throws Exception { + ctx.emit(\"Success!\"); + } +} +``` + +_other_script_A_: +``` +%import other_script_B; +class OtherClassA { + static void doSomething() {} +} +``` + +_other_script_B_: +``` +class OtherClassB { + static void doSomething() {} +} +``` + +_other_script_C_: +``` +%import other_script_A +class OtherClassC { + static void doSomething() {} +} +``` + +The result must be: +``` +class OtherClassB { + static void doSomething() {} +} +class OtherClassA { + static void doSomething() {} +} +class OtherClassC { + static void doSomething() {} +} +class JVMOPTION_TEST { + static void run(ExaMetadata exa, ExaIterator ctx) throws Exception { + ctx.emit(\"Success!\"); + } +} +``` + +The following diagram shows how the scripts are collected in the recursive algorithm: + +![V2ImportScriptFlow](diagrams/V2ImportScriptFlow.drawio.png) + + + +## Runtime + +### Parser Implementation V1 + +The legacy parser (V1) parser searches for one specific script option. The parser starts from the beginning of the script code. If found, the parser immediately removes the script option from the script code and returns the option value. + +### Parser Implementation V2 + +The parser provides an interface to parse the UDF script for all options at once. All found script options need to be collected in an associative container. Internally, the parses uses [ctpg](https://github.com/peter-winter/ctpg) to parse the UDF script code line-by-line. + +## Cross-cutting Concerns + + +## Design Decisions for V2 + +### Parser Implementation +`dsn~parser-implementation~1` + +Implement the parser using [ctpg](https://github.com/peter-winter/ctpg), an open-source parser library. This library will be used to define Lexer and Parser Rules in C++ code, ensuring no additional runtime dependencies exist. + + +Covers: +- `req~general-script-options-parsing~1` +- `req~existing-parser-library-license~1` + +Tags: V2 + + +### Lexer and Parser Rules Option +`dsn~lexer-parser-rules~1` + +Lexer and Parser rules to recognize `%optionKey`, `optionValue`, with whitespace characters as separator. The Parser rules will define the grammar to correctly identify Script Options, manage multiple options with the same key, and handle duplicates. +The regular expression for the lexer term for finding newline escape sequences or the semicolon escape sequence is `\\;|\\n|\\r|\\\\`. The regular expression for the lexer term for finding white space escape sequences is `\\ |\\t|\\f|\\v`. + +Covers: +- `req~general-script-options-parsing~1` +- `req~white-spaces~1` +- `req~leading-white-spaces-script-options-parsing~1` + +Depends: +- `dsn~parser-implementation~1` + +Tags: V2 + +### Lexer and Parser Rules Not an option +`dsn~lexer-parser-rules-not-an-option~1` + +Lexer and Parser rules to recognize anything what is not an option. + + +Covers: +- `req~ignore-none-script-options~1` + +Depends: +- `dsn~parser-implementation~1` + +Tags: V2 + +### Run Parser line-by-line +`dsn~run-parser-line-by-line~1` + +The parser must be executed line-by-line, because script options can be placed at any location in the UDF. Every found map of options must be added to the resulting map. + + +Covers: +- `req~multiple-lines-script-options-parsing~1` + +Depends: +- `dsn~parser-implementation~1` + +Tags: V2 + +### Ignore lines without script options +`dsn~ignore-lines-without-script-options~1` + +In order to avoid lower performance compared to the old implementation, the parser must run only on lines which contain a `%` character after only whitespaces. + + +Covers: +- `req~multiple-lines-script-options-parsing~1` + +Depends: +- `dsn~parser-implementation~1` + +Tags: V2 + + +### Handling Multiple and Duplicate Options +`dsn~handling-multiple-duplicate-options~1` + +Create a mechanism within the parser to collect and aggregate multiple Script Options with the same key. +The parser must return an associative container of the format: +``` +{ +