Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pages for new documentation #25 #26

Merged
merged 69 commits into from
Sep 5, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
9b49eb1
Add pages for new documentation #25
TobiasNx Aug 7, 2023
a096b54
Update documentation
TobiasNx Aug 21, 2023
4b0f27e
Update Approaching a transformation with metafacture.md
TobiasNx Aug 30, 2023
9ca5a11
Update Approaching a transformation with metafacture.md
TobiasNx Aug 30, 2023
350309e
Update Approaching a transformation with metafacture.md
TobiasNx Aug 30, 2023
6b3942a
Update Approaching a transformation with metafacture.md
TobiasNx Aug 30, 2023
1fd8f78
Update Approaching a transformation with metafacture.md
TobiasNx Aug 30, 2023
653e34a
Update Fix-User-Guide.md
TobiasNx Aug 30, 2023
6d52a6e
Update Fix-User-Guide.md
TobiasNx Aug 30, 2023
e146ed8
Update Fix-User-Guide.md
TobiasNx Aug 30, 2023
de18548
Update Fix-User-Guide.md
TobiasNx Aug 30, 2023
6f0a32f
Update Fix-User-Guide.md
TobiasNx Aug 30, 2023
6335bc6
Update Fix-User-Guide.md
TobiasNx Aug 30, 2023
6e99a9f
Update Fix-User-Guide.md
TobiasNx Aug 30, 2023
b6cb81f
Update Fix-User-Guide.md
TobiasNx Aug 30, 2023
7b7ba32
Update Fix-User-Guide.md
TobiasNx Aug 30, 2023
7294b83
Update Fix-User-Guide.md
TobiasNx Aug 30, 2023
e5ad340
Update Home.md
TobiasNx Aug 30, 2023
91e3e53
Update Home.md
TobiasNx Aug 30, 2023
5278245
Update Home.md
TobiasNx Aug 30, 2023
1ad57d9
Update Getting-Started.md
TobiasNx Aug 30, 2023
deaa566
Update Fix-User-Guide.md
TobiasNx Aug 30, 2023
039a4b0
Update Fix-User-Guide.md
TobiasNx Aug 30, 2023
291a2cf
Update Flux-User-Guide.md
TobiasNx Aug 30, 2023
af3c0ad
Update Fix-User-Guide.md
TobiasNx Aug 30, 2023
9f8c9ef
Update Getting-Started.md
TobiasNx Aug 30, 2023
feb3fa4
Update Fix-User-Guide.md
TobiasNx Aug 30, 2023
70a252e
Update Fix-User-Guide.md
TobiasNx Aug 30, 2023
a8696f1
Update Fix-User-Guide.md
TobiasNx Aug 30, 2023
250467d
Update Fix-User-Guide.md
TobiasNx Aug 30, 2023
83e1416
Update Flux-User-Guide.md
TobiasNx Aug 30, 2023
0065f86
Update Fix-User-Guide.md
TobiasNx Aug 30, 2023
3ebde20
Update Fix-User-Guide.md
TobiasNx Aug 30, 2023
38a5890
Update Fix-User-Guide.md
TobiasNx Aug 30, 2023
ac368b0
Update Fix-User-Guide.md
TobiasNx Aug 30, 2023
af14837
Update Framework-User-Guide.md
TobiasNx Aug 30, 2023
2dee362
Update Flux-User-Guide.md
TobiasNx Aug 30, 2023
833b420
Update Fix-User-Guide.md
TobiasNx Aug 30, 2023
0677e65
Update Fix-User-Guide.md
TobiasNx Aug 30, 2023
4b0e0df
Update Fix-User-Guide.md
TobiasNx Aug 30, 2023
19b190d
Update Fix-User-Guide.md
TobiasNx Aug 30, 2023
8a92e39
Update Flux-User-Guide.md
TobiasNx Aug 30, 2023
aeffe64
Update Getting-Started.md
TobiasNx Aug 30, 2023
ce1e95d
Update Fix-User-Guide.md
TobiasNx Aug 30, 2023
3a3dc5f
Update Fix-User-Guide.md
TobiasNx Aug 30, 2023
9e0ff19
Update Fix-User-Guide.md
TobiasNx Aug 30, 2023
9afa67e
Update Fix-User-Guide.md
TobiasNx Aug 30, 2023
88b0aa9
Update Fix-User-Guide.md
TobiasNx Aug 30, 2023
1b304f7
Update Getting-Started.md
TobiasNx Aug 30, 2023
522c300
Update Fix-User-Guide.md
TobiasNx Aug 30, 2023
8f12f4f
Update Flux-User-Guide.md
TobiasNx Aug 30, 2023
6e9cf7a
Update Getting-Started.md
TobiasNx Aug 30, 2023
90340ea
Update Getting-Started.md
TobiasNx Aug 30, 2023
d5ecf6a
Update Framework-User-Guide.md
TobiasNx Aug 30, 2023
d728298
Update Getting-Started.md
TobiasNx Aug 30, 2023
01b6f39
Update Framework-User-Guide.md
TobiasNx Aug 30, 2023
1c3c881
Update Framework-User-Guide.md
TobiasNx Aug 30, 2023
bd740cb
Update Framework-User-Guide.md
TobiasNx Aug 30, 2023
7332013
Update Getting-Started.md
TobiasNx Aug 30, 2023
97261a8
Update documentation pages
TobiasNx Aug 30, 2023
d744489
Update Flux-User-Guide.md
TobiasNx Aug 30, 2023
f415526
Update Fix-User-Guide.md
TobiasNx Aug 30, 2023
6dcb20d
Update Fix-User-Guide.md
TobiasNx Aug 30, 2023
106c2e1
Update Fix-User-Guide.md
TobiasNx Aug 30, 2023
595eb8b
Update Flux-User-Guide.md
TobiasNx Aug 31, 2023
587fbd3
Update README.md
TobiasNx Aug 31, 2023
f3c248e
Update README.md
TobiasNx Sep 5, 2023
ff68810
Update Getting-Started.md
TobiasNx Sep 5, 2023
cfe5862
Update Fix-User-Guide.md
TobiasNx Sep 5, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions Approaching a transformation with metafacture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
Every approach to transform metadata with metafacture is quite similiar:

- You need to know the type and source of the input and the type and destination of the output:
e.g. Transform data from Marc21 from a certain folder to some kind of JSON Data.
- You have to identify the commands that you need.
- Combine the commands without the transformation module and test if the workflow goes through.
- Adjust the workflow until it works.
- If the general workflow is working, move on to prepare the transformation.
- Get familar with the incoming data:
- e.g. use `| list-fix-paths| print` to checkout the metadata-element paths that are provided.
- use `| list-fix-values ("specifiedElementPath")| print` to get all element values of a certain element
- Start to write your transformation successivly and `write` to a specific destination or `print` the result.
- Start with one element that you want to transform and retain it.
- If you are happy with the result continue.
- If you have finalized your transformation include it in your application or transform the data you want for single reuse.
48 changes: 48 additions & 0 deletions Documentation-Maintainer-Guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@

## how to change flux-commands.md

The entries in flux-commands.md describe the usage of commands used by flux.
flux-commands.md is fully automatically generated. To make this happen one has to
fill in the proper annotations in the correponding java classes. E.g.

```
reset-object-batch
------------------
- description: Resets the downstream modules every batch-size objects
- options: batchsize (int)
- signature: Object -> Object
- java class: org.metafacture.flowcontrol.ObjectBatchResetter
```

is generated by reading following annotations in [ObjectBatchResetter.java](https://github.com/metafacture/metafacture-core/blob/511b4af8b993c85a33d6a18322258a195684d133/metafacture-flowcontrol/src/main/java/org/metafacture/flowcontrol/ObjectBatchResetter.java):

```
@Description("Resets the downstream modules every batch-size objects")
@FluxCommand("reset-object-batch")
@In(Object.class)
@Out(Object.class)
```
The description of "options" is produced from all "public setter-methods", in this case:
```
public void setBatchSize(final int batchSize) { ...
```
The option's name is produced by cutting away the "set" from the methods name, leaving
"BatchSize" which is then lowercased. The parameter of this option is generated from the
parameter type of the method - here an "int"eger.

## how to publish flux-commands.md

If you have updated some of these annotations, say "description", and these changes are
merged into the master branch, generate a new flux-commands.md like this:

Go to metafacture-core, checkout master and build a distribution and start flux.sh:
```bash
$ ./gradlew installDist
$ cd ./metafacture-runner/build/install/metafacture-core/
$ flux.sh > flux-commands.md
```

Open the generated flux-commands.md and remove some boilerplate at the beginning of the
file manually. Save it, copy it here, commit and push.

The [publishing process will be automated with an github action](https://github.com/metafacture/metafacture-core/issues/368).
251 changes: 251 additions & 0 deletions Fix-User-Guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,251 @@
![logo](https://github.com/culturegraph/metafacture-core/wiki/img/metafacture_small.png)

# Fix User Guide

This document provides an introduction to the Metafacture Fix language (short: Metafix or Fix). The Fix language for Metafacture is introduced as an alternative to configuring data transformations with Metamorph. Inspired by Catmandu Fix, Metafix processes metadata not as a continuous data stream but as discrete records.

## Part of a metafacture worflow
Metafacture Fix is a transformation module that can be used in a [Flux Workflow](/Flux-User-Guide.md), for this you have to use this in your pipeline:

Flux-Example:
```PERL
infile
| open-file
| as-lines
| decode-marc21
| fix(FLUX_DIR + "fixFile.fix")
| encode-json
| print
;
```

- when using the FLUX:
- - address the `fix`-module
- - you can add variables
- - The Fix transformation can be part of the FLUX `|fix("retain(`245??`)")` - usually useful for short fixes
- - or it can be separated in an external file, that is called in the FLUX-Process as in the code snippet above
- when using it in a Java process, just add the library to your process

## Record-based and metadata manipulating approach
While Metafature processes the data as a stream, the `fix` module does not. It buffers the incoming stream to distinct records.
Thus you can manipulate all metadata elements of a record at once and don't need to think about the order of the incoming stream - which was a really big hassle in the stream-based MORPH.
The incoming record then can be manipulated, fields can be changed, removed or added. This also differs from the approach in the other Transformation Module MORPH where you construct a new record and a new data stream. With FIX you change stuff in the record and "only" change the data stream in Metafacture.


## Basic concepts
The four main concepts of FIX (introduced by catmandu) are [functions](https://librecat.org/Catmandu/#functions), [selector](https://librecat.org/Catmandu/#selectors), [conditionals](https://librecat.org/Catmandu/#conditionals) and [binds](https://librecat.org/Catmandu/#binds). The following code snippet shows examples of eachs of these concepts:


```PERL

# Simple fix function

add_field("hello", "world")
remove_field("my.deep.nested.junk")
copy_field("stats", "output.$append")

# Conditionals

if exists("error")
set_field("is_valid", "no")
log("error")
elsif exists("warning")
set_field("is_valid", "yes")
log("warning")
else
set_field("is_valid", "yes")
end

# Binds - Loops

do list(path: "foo", "var": "$i")
add_field("$i.bar", "baz")
end

# Selector
if exists("error")
reject()
end

```

**Functions** are used to add, change, remove or otherwise manipulate elements.

**Conditionals** are used to control the processing of fix functions. The included fix functions are not process with every workflow but only under certain conditions.

**Selectors** can be used to filter the records you want.

**Binds** are wrappers for one or more fixes. They give extra control functionality for fixes such as loops.
All binds have the same syntax:

```PERL
do Bind(params,…)
fix(..)
fix(..)
end
```

Find here a [list of all functions, selectors, binds and conditionals](/Fix-function-and-Cookbook.md).


## Addressing Pieces of Data: FIX-Path and the record structure in FIX

Internally FIX knows arrays, objects/hashes and simple elements. How a format is translated depends on the `decode-...` command in the MF Workflow. Only one thing is specific to the fix, as in Catmandu: a repeated field is translated into a list depending on the real input data of the single record. Elements with the suffix `[]` are interpreted as arrays.

Since functions manipulate, add or remove elements in a record, it is essential to understand the way you can address source or target elements.

e.g.:
```PERL
copy_field("<sourceField>", "<targetField>")
```

To address the source or target element here, you need to provide the path to the element.
Metafacture Fix uses a path syntax that is JSON-Path-like but not identical. It also uses the dot notation but there are some differences with the path structure of arrays and repeated fields. Especially when working with JSON, YAML, or records with repeated fields.

```
a : simpleField
b : c : objectField1
d : objectField2
e : objectField3
f : repeatedField1
f : repeatedField2
f : repeatedField3
g : - listElement1
- listElement2
- listElement3
h : - i : listObjectElement1.1
j : listObjectElement1.2
- i : listObjectElement2.1
j : listObjectElement2.2
k : l : m : o : deepNestedField
```

The path for a simple string element is addressed by stating the element name: `a`
For the fields with deeper structure you add a dot ‘.’. The path for elements in nested objects is stated by: `b.c` or `k.l.m.o`

Sometimes an element can have multiple instances. Different data models solve this possibility differently. In XML records element repetition is possible and (partly) allowed in many schemas. Repeatable elements also exist in JSON and YAML but are unusual.

To point to a specific element you state the index number. To adress the value `repeatedField2` the path would be `f.2` since the repeated field is handled as a list.
Similarly you address the `listElement3` of the array/list by `g[].3`. The brackets are an array indicator created by the flux command `decode-yaml`(or by `decode-json`). It helps to interpret an repeatable element as an array even if the list has only one value.

When working with nested structures and combinations of arrays and objects the path is a combination of element names, dots and index numbers.

`listObjectElement2.2` has the path: `h[].2.j`
TobiasNx marked this conversation as resolved.
Show resolved Hide resolved

You do not only need the path name for your source element but also if you want to create a new element. But remember that fix, as in catmandu, is using repeated fields and arrays as lists so if you want to create a repeated field you have to create an array without suffix [].

e.g.:
```PERL
copy_field("a", "z.y.x")
```

This would copy the value of z in a nested object:

```
z :
y:
x : simpleField
```


To address paths you can use wildcards. For instance the star-wildcard: `person*` would match all simple literals with element names starting with 'person': 'person\_name', 'person\_age', etc.
Apart from the `*` wildcard, the `?` wildcard is supported. It matches exactly one arbitrary character.

Not fully supported yet is alteration of pathes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Not fully supported yet is alteration of pathes.
Not fully supported yet is alternation of paths.

But what exactly is / is not supported?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can alternate parentElement1|parentElement2
but not parentElement1.subElement1|parentElement2.subElement1
or parentElement1.(subElement1|subElement2)


Besides path wildcards there are array/list wildcards that are used to reference specific elements or all elements in an array. `g[].*` addresses all strings in the array `g[]`. `g[].$append` would reference a new element in the array at the end of the array. `g[].$last` references the last element in an array.

## Cleaning up the transformation

Since FIX is not constructing a new record stream but is manipulating the existing record you usually clean up after you transform the data. There are functions to remove all unnecessary elements and to remove all empty elements.

e.g.: if you transform MARC21 to JSON but you want to keep only certain elements that you created, you state them in a `retain` function:

```
retain("all",
elements",
"that",
"I",
"want")
```
This function only keeps all the elements that I wanted. At the moment this only works with top-level elements.

`vacuum()` deletes all empty elements.

## Defining Macros

Macros can be defined with the `put_macro`-Bind and use the same parameter
mechanism later.
Macros are called with the `call_macro` function. Attributes
of the function are used as parameters:

```PERL
do put_macro("concat-up")
set_array("$[target_field]")
copy_field("$[source_field]","$[target_field].$append")
case("$[target_field].*")
join_field("$[target_field]",", ")
end



call_macro("concat-up", source_field:"data1", target_field:"Data1")
call_macro("concat-up", source_field:"data2", target_field:"Data2")
``````

In this case `target_field` and `source_field` serve as a parameter (the name is arbitrary). In the macro definition itself, the parameters are addressed by `$[target_field]` and `$[source_field]`.

Parameters are scoped, which means that the ones provided with the `call_macro` function shadow global ones. Macros cannot be nested.

## Parameters to Fix definitions / Using variables

Fix definitions may contain parameters. They follow the pattern `$[NAME]`:

```perl
add_field("rights","$[rights]")
```

`$[rights]` in this case is a compile-time variable which is evaluated on
creation of the respective Fix object.

The `<vars>` section in the Metamorph definition can be used to set defaults:

```xml
<vars>
<var name="rights" value="CC0" />
</vars>
```

For Java implementations: Compile-time variables are passed to Fix as a constructor parameter.

```java
final Map<String, String> vars = new HashMap<String, String>();
vars.put("rights", "CC-0");

final Metafix metafix = new metafix("fixdef.fix", vars);
```



## Splitting Fixes for Reuse

In a complex project setting there may be several Fixes in use,
and it is likely that they share common parts. Imagine for instance a
transformations from Marc 21 records holding data on books to RDF, and Marc 21
records holding data on authors to RDF. Both make use of a table assigning
country names to ISO country codes. Such a table should only exist once.

Another scenario would be to reduce the size of a single fix file and create several fix files used for different purposes.

To accomodate for such reuse, Fix offers an include mechanism:

``````
# Setup adds maps, macros and vars once
do once("setup")
include ("./fix/maps.fix")
include ("./fix/macros.fix")
put_var("member", "-")
end
```

For performance reasons it is useful to integrate macros and maps that are used often in a `do once` bind.
Loading