diff --git a/content/post/project-segmentation-of-layout-based-documents.md b/content/post/project-segmentation-of-layout-based-documents.md
new file mode 100644
index 0000000..f41129f
--- /dev/null
+++ b/content/post/project-segmentation-of-layout-based-documents.md
@@ -0,0 +1,507 @@
+---
+title: "Segmentation of layout-based documents"
+date: 2020-12-30T10:51:00+01:00
+author: "Elias Kempf"
+authorAvatar: "img/project-segmentation-of-layout-based-documents/gopher.png"
+tags: []
+categories: []
+image: "img/project-segmentation-of-layout-based-documents/rect-title.jpg"
+draft: true
+---
+
+PDF is a widely used file format and in most cases very convenient to use for
+representing text. However, PDF is layout-based, i.e., text is only saved
+character by character and not even necessarily in the right order. This makes
+tasks like keyword search or text extraction pretty difficult. The goal of this
+project is to detect individual words, text blocks, and the reading order of a
+PDF document to allow for reconstruction of plain text.
+
+## Content
+
+1. [Introduction](#introduction)
+
+2. [Problem definition](#problem-definition)
+
+3. [Designing the algorithm](#designing-the-algorithm)
+
+ - [Prerequisites](#1-prerequisites)
+
+ - [The algorithm](#2-the-algorithm)
+
+ - [Computing possible cuts](#2-1-computing-possible-cuts)
+
+ - [Choosing a "best cut"](#2-2-choosing-a-best-cut)
+
+ - [Splitting the page](#2-3-splitting-the-page)
+
+4. [Evaluation of segmentation](#evaluation-of-segmentation)
+
+ - [Visual evaluation](#1-visual-evaluation)
+
+ - [Algorithmic evaluation](#2-algorithmic-evaluation)
+
+ - [Evaluation results](#3-evaluation-results)
+
+5. [Problems and future improvements](#problems-and-future-improvements)
+
+## Introduction
+
+Most layout-based text document formats like PDF do not include information
+about word- or paragraph boundaries nor the reading order. That is, in most
+layout-based documents only information about individual characters, figures,
+or shapes and of their corresponding attributes (e.g., position, font, size,
+color, etc.) are given. To further complicate things no whitespace characters
+are provided, meaning we have to determine word or paragraph boundaries by the
+positions of the characters alone. Also, the distance between two words can vary
+within a document. While this does not pose a challenge for human readers
+and determining in which order headlines, columns, or paragraphs should be
+read is obvious to us, the same problems are much more difficult to solve for
+computers.
+
+For example, let us consider the layout of the PDF above. As a reader, we can
+easily determine that we should read the headline first, then both of the
+authors, and lastly the left and right column. A computer, however, solving this
+task, may not detect the columns of the document and thus mix up the lines of
+both columns.\
+Our goal is to devise an algorithm that is capable of segmenting a PDF document,
+meaning the algorithm should be able to recognize individual words and text
+blocks and their corresponding reading order. We define a text block as a set of
+words that are adjacent in the PDFs layout. Note that sections or paragraphs can
+consist of multiple text blocks (e.g., in the PDF above the paragraph "Word
+identification" is split into two text blocks by the figure). In this project, we
+will only consider PDF documents but the approach may be easily generalized to
+work with other visual text formats that provide the layout information described
+above.
+
+## Problem definition
+
+We now want to give a precise definition of the problem we are trying to solve.
+We are given a list of characters, figures, and shapes extracted from a PDF,
+where each of these objects comes with the coordinates of the lower-left and
+upper-right corner of its bounding box, its color, and in the case of a
+character also with its font name, font size, and the additional specifiers
+'bold' and 'italic'.\
+Our goal is to produce a list of rectangles that represent the bounding boxes
+of the detected words and text blocks ordered by reading order.
+
+## Designing the algorithm
+
+Before we can start designing an algorithm, we need to establish two things:
+The prerequisites that will be necessary to run the algorithm and the approach
+the algorithm is supposed to use to solve our problem.
+
+### 1. Prerequisites
+
+Our algorithm is supposed to process layout-based text documents. So we need
+some data structures that can hold the relevant information of a document for
+the algorithm to run on.
+Naturally, we want to represent the characters of the document. So we consider a
+class 'Character' that holds all the necessary information of a character in a
+layout-based document. Similarly, we define classes 'Figure' and 'Shape' for
+figures and shapes in the document respectively.
+
+Finally, we may now represent our document page by page using a class 'Page'
+which holds a list of all its characters, figures, and shapes, and some
+additional information like page number and the dimensions of the page (these
+may differ from document to document or even from page to page!).
+
+We now have established sound data structures we can later run our algorithm on,
+but we still need a way of extracting the information needed for the
+instantiation of our classes from our document. For this, we incorporate a
+two-step process: the first step is to parse the (in our case PDF-) document
+into a format that contains all characters, figures, and shapes and their
+corresponding attributes. For this task, we use
+[PdfAct](https://github.com/ad-freiburg/pdfact) to extract this information. We
+then use a simple, self-implemented parser that converts the extracted
+information into instantiations of our classes.\
+We are now able to convert any given PDF to a list of page objects that
+hold all information our algorithm will need later on.
+
+### 2. The algorithm
+
+For our algorithm, we use a simple recursive XY-Cut approach (heavily inspired
+by the approach described in this [paper](https://www.ecse.rpi.edu/~nagy/PDF_chrono/1992_Seth_ViswanathanPrototype_IEEE_Computer92.pdf)).
+The structure of the
+algorithm then becomes extremely simple:
+```
+function segment-document(document):
+ for every page in document do
+ yield recursive-xy(page)
+```
+where recursive-xy is defined as
+```
+function recursive-xy(page):
+ cuts := possible-cuts(page)
+
+ best-cut := choose-best-cut(cuts)
+
+ if not best-cut then
+ return empty_list()
+
+ p1, p2 := split-page(page, best-cut)
+
+ return list(best-cut, recursive-xy(p1), recursive-xy(p2))
+```
+We define a cut to be a tuple \\(([a, b], \textrm{dir})\\) where the first entry
+is an interval that specifies where we are allowed to cut and the second entry
+is a boolean value indicating the direction of the cut (we choose \\(\text{dir}\\)
+to be \\(\textbf{true}\\) if we cut vertically through the page and
+\\(\textbf{false}\\) if we cut horizontally). From here on cuts whose second
+entry is \\(\textbf{true}\\) will be referred to as "x-cuts" and cuts whose
+second entry is \\(\textbf{false}\\) as "y-cuts".\
+The algorithm is built upon three subroutines that compute all possible cuts on
+the current page, evaluate which of them should be made, and create two
+"subpages" by applying the chosen cut to the given page. The algorithm then
+builds a nested list representation of a binary tree (called an XY-tree) that
+has the form
+$$[\text{best-cut}, \ \text{recursive-xy(p1)}, \ \text{recursive-xy(p2)}],$$
+where the first entry is the current node that represents the cut made in this
+recursion level and the second and third entry represent the left and the right
+subtree which are the results of the segmentation of the first and second
+subpage respectively.\
+We will now go into more detail regarding every one of these subroutines.
+
+### 2.1. Computing possible cuts
+
+The computation of possible cuts can be intuitively understood as the process
+of determining where a horizontal or vertical line may be drawn through the
+whole subpage without crossing any characters, figures, or shapes on the
+subpage. We will split the computation of all possible cuts into the calculation
+of x-cuts and y-cuts respectively.\
+For the sake of example, let us consider how to determine all the possible
+x-cuts on a given page. The given page will contain a list of entities (where
+Entity is one of Character, Figure, or Shape) that each has a "bounding box"
+(that is, the smallest rectangle the entity fits into). A bounding box may be
+represented by only two points (i.e., the lower left and the upper right corner
+of the rectangle). Let's consider a page with only three entities:
+$$[\\{e\_1: (0, 1), (2, 3)\\}, \\{e\_2: (1, 0), (3, 1)\\},
+\\{e\_3: (4, 2), (5, 4)\\}].$$
+Because we only are interested in x-cuts right now, we don't need to consider
+the y-coordinates of the bounding boxes. Therefore we only consider the
+"projection" of the entities onto the x-axis:
+$$[\\{e\_1: [0, 2]\\}, \\{e\_2: [1, 3]\\}, \\{e\_3: [4, 5]\\}].$$
+This yields a list of intervals representing the sections of the x-axis that
+are "blocked" by our entities. We now compute to the union of these intervals
+which yields
+$$[0, 2] \cup [1, 3] \cup [4, 5] = [0, 3] \cup [4, 5].$$
+Note that in the resulting list none of the intervals overlap. We now have
+computed a list of intervals where we would collide with the bounding box of an
+entity if we would cut through them. So our final step is to compute the "gaps"
+of this list. That is if we assume the list of intervals to be sorted by their
+starting point, for each pair of consecutive intervals \\([a\_i, b\_i]\\) and
+\\([a\_{i+1}, b\_{i+1}]\\) we take the gap \\([b\_i, a\_{i+1}]\\) between them.
+Our example yields exactly one such gap, i.e., \\([3, 4]\\). This interval
+represents the first entry of a possible x-cut we can make on our example page.\
+In this fashion, we may compute all the intervals of x-cuts and y-cuts that are
+possible on a given page. We then pair these intervals with their corresponding
+boolean value indicating x- or y-cut. This yields a list of all cuts we need to
+consider for the given page.\
+Note that in PDF some characters of a word may have a really small space between
+them which then could be detected as a possible interval for a cut. To avoid (or
+at least reduce) this problem we define a threshold \\(m \geq 0\\) and only
+consider cuts whose interval has a length of at least \\(m\\). We use
+\\(m = 0.5\\) which for most documents is bigger than the distance between the
+characters of a word but still smaller than the distance between two words.
+
+### 2.2. Choosing a "best cut"
+
+Now that we have computed all possible cuts, we need some way of choosing the
+best one of them. This begs the question of what a "best cut" is even supposed
+to be. Without going into too much detail, we say a cut is "good" (or even "the
+best") if it is consistent with the layout and the natural reading
+order of the current page (e.g., we split between headline
+and text body before splitting the body into individual lines or split into
+individual lines before splitting into individual words). There are many simple
+but naive ways to choose a "best" cut.\
+The most primitive approach used in this project is the "alternating-cut"
+strategy. This strategy will simply start by returning a y-cut and then start
+alternating between x- and y-cuts. Although it is possible to get somewhat
+decent results on PDFs with very specific layout (e.g., the first chosen y-cut
+separates headline and text body and the following x-cut separates the body
+into columns), this is in general not a useful way of choosing the best
+cut because it essentially ignores everything except the direction of the cut.\
+An obvious improvement while still being extremely simple is to choose cuts by
+their size (i.e., the length of the corresponding interval). We call this the
+"largest-cut" strategy. This is reasonable because in almost all PDFs distance
+is used for a clear visual distinction of the different parts of the document
+(e.g., the distance between a headline and the following body of text will be
+larger than the distance of two lines within the text body). This strategy
+will perform much better in detecting things like columns in the text than the
+previous approach. For example, if we have text separated into two columns the
+alternating strategy may choose to split horizontally between lines (and thus
+across the columns) instead of vertically between the columns simply because
+it chose an x-cut in the previous step. The largest approach will prefer to split between
+columns over splitting between lines because the distance between columns will
+simply be larger. While better than just alternating between directions this
+strategy also has many problems. One such problem is that in PDF vertical
+distance and horizontal distance aren't always directly comparable which will
+lead us to our next strategy.\
+While distance is a key factor in determining the layout of a PDF
+simply choosing the largest cut all the time is not always correct. In
+particular, not only the size but also the direction of a cut is important. For
+example, the distance between two columns can be as large or even larger than
+the distance between a headline and the remaining text. In general, it is
+noticeable that when given an x-cut and a y-cut of similar size it is often
+better to prefer the y-cut over the x-cut. This leads us to our next approach
+the "weighted-largest-cut" strategy. This strategy yields a "largest" cut but
+favors y-cuts by multiplying their size by some factor \\(r \geq 1\\) (in my
+experiments \\(r = 2.5\\) brought forth the best results). This strategy
+performed really well on my test corpus of about 20 PDF documents. Especially in
+splitting headlines, columns, and sections, the strategy produced good results.
+It is easy to see that even though this strategy might be a pretty decent
+heuristic for choosing a best cut, it still is far from perfect. It will most
+likely not work as well on a larger corpus of documents because the value of
+\\(r\\) was adjusted to work well on my set of test documents and does not need
+to be good in general.\
+All the previous strategies are focusing on only one or two properties of a cut.
+So naturally we would be interested in a strategy that can look at multiple (or
+better arbitrarily many) properties at the same time. To achieve this we will
+consider a more general approach. First we define a parameter \\(P\\) to be a
+function from the set of possible cuts to the interval \\([0, 1]\\) that
+assesses the quality of a cut with respect to some property (we consider \\(1\\)
+to be the best achievable quality and \\(0\\) the lowest).\
+An example of such a parameter is \\(P\_{size}\\) which assigns each cut \\(c\\)
+its own size divided by the size of the largest cut \\(c\_{max}\\) that has the
+same direction as \\(c\\):
+$$P\_{size}\(c\) = \frac{size\(c\)}{size(c\_{max})}$$
+The larger the size of a cut the higher its \\(P\_{size}\\) value will be, the
+largest cut will have a value of \\(1\\).\
+Now we define another function
+$$f : [0, 1]^n \rightarrow [0, 1]$$
+where \\(n\\) is the number of properties we want to consider and \\(f\\) has
+the following two properties: for all \\(x\_1,\dots,x\_n,y\_1,\dots,y\_n
+\in [0, 1]\\) the implication
+$$\bigwedge\_{i=1}^n x\_i \leq y\_i \Longrightarrow f(x\_1,\dots,x\_n) \leq
+f(y\_1,\dots,y\_n)$$
+holds (i.e., \\(f\\) is monotone in every variable) and for all
+\\(x\_1,\dots,x\_n \in [0, 1]\\)
+$$\bigwedge\_{i=1}^n x\_i = 1 \Longleftrightarrow f(x\_1,\dots,x\_n) = 1.$$
+We then use one parameter for every property we want to take into account
+(properties can be size, position, or direction of a cut, fonts and font sizes
+on both sides of the cut, etc.). This gives us a set of parameters
+\\(\\{P\_1,\dots,P\_n\\}\\). We may now choose a best cut by selecting a cut
+\\(c\\) that maximizes the expression \\(f(P\_1\(c\),\dots,P\_n\(c\))\\). A
+further refinement may be made by incorporating a weight
+\\(w\_i \in \mathbb{R}^+\\) for every parameter that determines how much impact
+a change to that parameter should have on the value of \\(f\\).\
+In practice I mostly used the three parameters \\(P\_{pos}, P\_{fs}\\), and the
+above mentioned \\(P\_{size}\\).\
+\\(P\_{pos}\\) assigns each cut its relative position on the current subpage
+with respect to its direction. That is, for height \\(h\\) and width \\(w\\)
+of the subpage and a cut \\(c = ([a, b], \textrm{dir})\\):
+$$P\_{pos}\(c\) = \begin{cases} 1 - \frac{a}{w} & if \textrm{ dir} =
+\textbf{true} \\\\\frac{a}{h} & if \textrm{ dir} = \textbf{false}\end{cases}.$$
+Note that because cuts cannot overlap it does not matter if we use \\(a, b\\),
+or \\(a + \frac{b - a}{2}\\) in the above definition. \\(P\_{pos}\\) will prefer
+cuts whose positions match the natural reading order (we read from top to
+bottom and from left to right), thus the closer a y-cut is to the top of the page
+the higher its value and similarly the closer an x-cut is to the left margin the
+higher its value will be.\
+\\(P\_{fs}\\) will look at the different font sizes of the subpages a cut
+produces and prefer cuts whose subpages have similar font sizes. More precisely,
+let \\(p\_1\(c\)\\) and \\(p\_2\(c\)\\) be the two subpages created by a cut
+\\(c\\), \\(\textrm{avg-fs}(p\_i\(c\))\\) the average, and
+\\(\textrm{max-fs}(p\_i\(c\))\\) the maximum font size of subpage \\(p\_i\\).
+We now define \\(\textrm{dev-fs}(p\_i\(c\)) = \textrm{max-fs}(p\_i\(c\)) -
+\textrm{avg-fs}(p\_i\(c\)\\) as the deviation between the maximum and the
+average font size\. We then have:
+$$P\_{fs}\(c\) = \frac{1}{1 + \max\\{\textrm{dev-fs}(p\_1\(c\)),
+\textrm{dev-fs}(p\_2\(c\))\\}}.$$
+The more similar the font sizes of the worse subpage are the better a cuts
+\\(P\_{fs}\\) value will be.\
+Using these parameters produces reasonable results but is neither strictly
+better nor strictly worse than "weighted-largest-cut". It does really well in
+respect to word order within a line because it will split the words of the line
+from left to right. Similarly, it will almost always start by splitting the
+headline from the text which is good for title pages but not for pages without
+headlines where for example it would make more sense to split between text and
+page numbering (usually located at the very bottom of the page) first.\
+While in general, this approach is definitely more refined than the previous
+strategies, it is still nontrivial to find sets of parameters and weights that
+lead to proper results in the selection of cuts. Experimenting with these values
+by hand becomes tedious really fast and thus is not really feasible for tuning
+to achieve the best possible results. However, a machine learning model is very
+well suited to be trained for such a purpose and we will briefly touch this
+topic again in [a later section](#problems-and-future-improvements).
+
+### 2.3. Splitting the page
+
+Unlike the previous subroutine, applying a chosen cut to a given page is pretty
+straightforward. We create two subpages and assign all entities of the given
+page to one of those. Let \\(P\\) the set of entities of the given page
+$$P = \\{\\{e\_i: (x\_1, y\_1), (x\_2, y\_2) \\} \ \vert \ 1 \leq i \leq n\\}.$$
+For an x-cut \\(([a, b], \textbf{true})\\), the first subpage then has the
+following set of entities
+$$P\_1 = \\{\\{e\_i: (x\_1, y\_1), (x\_2, y\_2)\\} \ \vert \ x\_2 \leq a\\},$$
+similarly the entity set of the second subpage is
+$$P\_2 = \\{\\{e\_i: (x\_1, y\_1), (x\_2, y\_2)\\} \ \vert \ x\_1 \geq b\\}.$$
+From the way we computed the possible cuts we know in fact that each entity in
+\\(P\\) has to either be in \\(P\_1\\) or in \\(P\_2\\).\
+Here is an example of entities on a page, each represented by their bounding
+box, being distributed to the first subpage (marked in green) and second subpage
+(marked in red) by an x-cut (marked in blue).
+
+In the case of a y-cut we would use \\(y\_1\\) and \\(y\_2\\) in the definition
+of \\(P\_1\\) and \\(P\_2\\) respectively.
+
+## Evaluation of segmentation
+
+Now that we have a sound foundation for our algorithm, we need a way to assess
+how well the algorithm performs on a document with a given choosing strategy.
+
+### 1. Visual evaluation
+
+Initially, it suffices to evaluate the segmentation made by the algorithm
+"visually". That is, we traverse the resulting XY-tree via a depth-first search
+and compute the starting and endpoint of every cut we encounter. We then use
+these points to draw a line for each cut using
+[pdf-drawer](https://ad-git.informatik.uni-freiburg.de/ck1028/pdf-drawer).\
+A visualization of a segmentation done using the "weighted-largest-cut" strategy
+(described [here](#2-2-choosing-a-best-cut)) created with pdf-drawer looks like
+this:
+
+An attempt was made to visualize the depth of a cut in the XY-tree by the
+thickness and the color the lines were drawn in (cuts are drawn thinner and
+change color from red to green to blue the deeper they are in the tree).
+
+### 2. Algorithmic evaluation
+
+While a visual evaluation is great to get a feeling of how a particular choosing
+strategy operates, we need a better method of assessing quality more objectively
+than "just by looking at it". We would also be interested in a quality measure
+that allows objective comparisons of segmentations. Therefore we are looking for
+methods that measure how similar a segmentation by the algorithm is to a given
+ground truth for a document.\
+This leads us to the first and simplest approach to compare the result of our
+algorithm to a ground truth. We create an "optimal" XY-tree up to a certain
+depth for every page of the given document (we build the tree by manually
+choosing the cut that is best for every situation). For a segmentation computed
+by the algorithm, we can now traverse both trees the same way and for every node
+increase the current score if the cut represented by the node is correct with
+respect to our ground truth or decrease the score if it is not. For this
+approach, it is reasonable to weigh the adjustments made to the score less the
+deeper in the tree we currently are. This yields a score for each page and we
+may assign a score to the whole document by taking the mean of these scores:
+$$\text{score}(doc) = \frac{1}{\vert Pages \vert} \sum\_{p \in Pages}
+\text{score}(p).$$
+This approach still has some major flaws. First of all, if a cut is not correct
+there is no proper way to evaluate the correctness of the subtrees of that node
+(they might be correct with respect to the cut of the parent node but that cut
+was not correct, to begin with). Second, we have no easy way of generating the
+ground truth from a given document (generating a sound ground truth is about
+equally as complex as always choosing the correct cut right from the start).\
+Our second approach is to describe the layout of a page by an ordered list of
+rectangles. For example, we may describe the general layout of this page by
+using these rectangles (in the order: red, turquoise, green, yellow, blue):
+
+Note, that we may adjust the accuracy of such a description by adding or
+removing rectangles as we see fit.\
+After that, we reconstruct a similar list of rectangles (similar in accuracy to
+the given ground truth) from the cuts in the XY-tree produced by the algorithm.
+We now can compare these two lists by applying the following quality measures.
+Let \\(L\_G\\) be the ordered list of ground truths and \\(L\_A\\) the list
+constructed from the result of our algorithm. We then define
+$$R^+ = \vert \\{r \ \vert \ r \in L\_A, r \not\in L\_G\\} \vert,$$
+$$R^- = \vert \\{r \ \vert \ r \in L\_G, r \not\in L\_A\\} \vert,$$
+$$Inv = \vert \\{ (r, r') \ \vert \ r, r' \in L\_A \cap L\_G,
+r <\_{L\_G} r', r' <\_{L\_A}r\\} \vert.$$
+That is, \\(R^+\\) is the number of elements that are in the list of the
+algorithm but not in the ground truth list, \\(R^-\\) is the number of elements
+that are in the ground truth list but not in the list of the algorithm, and
+\\(Inv\\) is the number of inversions of the elements in \\(L\_A \cap L\_G\\)
+in terms of the ordering of \\(L\_G\\) and \\(L\_A\\) respectively. Intuitively
+\\(Inv\\) can be understood as the number of pairs of rectangles that appear in
+both lists but not in the same order.\
+We may interpret \\(R^+\\) as the amount of "wrong" rectangles found by the
+algorithm, \\(R^-\\) as the amount of "correct" rectangles the algorithm could
+not find, and \\(Inv\\) as a measure for the similarity of the orderings of the
+two lists. Similar to the previous approach we may now either consider the
+individual \\(R^+\\), \\(R^-\\), and \\(Inv\\) scores per page or take the
+average over the whole document.
+
+### 3. Evaluation results
+
+To assess the quality of our cut choosing strategies we created "ground truths"
+for 10 of our PDF documents. Every ground truth consists of one ordered list of
+rectangles per page of the document (as described above). These were created
+manually and thus are neither perfect nor really in-depth (they mainly focus on
+correct headline and column detection).\
+The following table reports the results of the evaluation. All values are reported
+as the average over all documents. Additionally to the absolute \\(R^+\\) and
+\\(R^-\\) values we provided percentage values representing the ratio of "wrong"
+rectangles to all rectangles found in case of \\(R^+\\) and the ratio of "missed"
+rectangles to all rectangles expected by the ground truth for \\(R^-\\). For each
+of the evaluation criteria, the value of the worst-performing strategy is marked
+in red and the value of the best-performing one in blue.
+
+| Strategy | \\(R^+\\) | \\(R^+ \ (\\%)\\) | \\(R^-\\) | \\(R^- \ (\\%)\\) | \\(Inv\\) |
+|----------|-----------|------------------|-----------|------------------|-----------|
+| alternating | \\(\color{red}{1400.1275}\\) | \\(\color{red}{99.86}\\) | \\(\color{red}{2.7425}\\) | \\(\color{red}{85.56}\\) | \\(0.0\\) |
+| largest | \\(1398.4168\\) | \\(99.64\\) | \\(1.0318\\) | \\(29.6\\) | \\(0.0\\) |
+| weighted-largest | \\(\color{blue}{1398.2875}\\) | \\(\color{blue}{99.59}\\) | \\(\color{blue}{0.9025}\\) | \\(\color{blue}{26.08}\\) | \\(0.0\\) |
+| arbitrary (\\(P\_{size}, P\_{pos}, P\_{fs}\\)) | \\(1398.8075\\) | \\(99.61\\) | \\(1.4225\\) | \\(49.83\\) | \\(0.0\\) |
+
+As expected the "alternating-cut" strategy performs worst out of all evaluated
+strategies. The "largest-cut" strategy outperforms it by a big margin but is still
+worse than the best-performing "weighted-largest-cut" strategy. Surprisingly the
+approach for arbitrary parameters is outperformed by both the "largest-cut" and
+the "weighted-largest-cut" strategy. This could be due to a sub-par selection of
+parameters and weights as opposed to the well fine-tuned (but most likely
+overfitted) weight for the "weighted-largest-cut" strategy. Another reason might
+be that headline and column detection can still be pretty easily done by these
+simple strategies and that is what the ground truths are mainly focused on. Also
+very noticeable is the lack of any inversions on the evaluated documents. For the
+alternating strategy, this can be explained by the low amount of correct
+rectangles found because inversions will only be computed on correctly
+identified rectangles. Furthermore, because the ground truths are relatively
+shallow and thus do not dictate many expected rectangles low values for
+inversions are to be expected. The complete absence of detected inversions is
+still surprising and suggests that even these simple strategies have some
+capabilities to respect reading order (like considering headlines before
+the body, left column before right column, etc.).\
+Note that because the list of rectangles output by the algorithm is not yet
+filtered (i.e., adjusted to the granularity of the ground truth) before performing
+the evaluation all strategies report unusually high \\(R^+\\) values.
+
+## Problems and future improvements
+
+Although up to this point we have established a stable baseline, there are still
+some problems that need to be addressed. First of all, we do not yet have a cut
+choosing strategy that performs reasonably well on a larger set of documents.
+While it is somewhat doable to tweak a given choosing strategy to work well on a
+specific document, it is pretty difficult to generalize such a strategy to
+perform decently on a larger corpus. Second, the overall execution time of the
+algorithm and the JSON parser could still be improved. Especially the computation
+of the possible cuts (on a page with many entities many intervals are already
+included in previously considered intervals) and the parameters used for choosing
+cuts can be improved. Lastly, we can not yet use the results of our segmentation
+to extract text.\
+In the future, I want to include some extensions and further improvements to
+this project. One obvious improvement is the use of a machine learning model for
+choosing cuts. The choice and proper weighting of parameters (as described
+[here](#2-2-choosing-a-best-cut)) is a task well suited to be learned by
+a neural network and I am confident a well-trained model can help us achieve a
+more general but still reasonable strategy to choose cuts. It will also be
+exciting to apply the results of the algorithm to various tasks that require
+layout information (like extraction of text). I look forward to continuing to
+work on this project and hopefully expand it to a proper bachelor's thesis!
+
+## Acknowledgments
+
+Special thanks to the head of the Chair Algorithms and Data Structures,
+[Hannah Bast](https://ad.informatik.uni-freiburg.de/staff/bast), and my
+supervisor,
+[Claudius Korzen](https://ad.informatik.uni-freiburg.de/staff/korzen), who
+wrote this compelling
+[paper](https://ad-publications.cs.uni-freiburg.de/benchmark.pdf), that peaked
+my interest in this particular topic and which was used in many examples
+throughout this post!
\ No newline at end of file
diff --git a/static/img/project-segmentation-of-layout-based-documents/benchmark.jpg b/static/img/project-segmentation-of-layout-based-documents/benchmark.jpg
new file mode 100644
index 0000000..0988318
Binary files /dev/null and b/static/img/project-segmentation-of-layout-based-documents/benchmark.jpg differ
diff --git a/static/img/project-segmentation-of-layout-based-documents/gopher.png b/static/img/project-segmentation-of-layout-based-documents/gopher.png
new file mode 100644
index 0000000..3bfe85d
Binary files /dev/null and b/static/img/project-segmentation-of-layout-based-documents/gopher.png differ
diff --git a/static/img/project-segmentation-of-layout-based-documents/rect-title.jpg b/static/img/project-segmentation-of-layout-based-documents/rect-title.jpg
new file mode 100644
index 0000000..646d26a
Binary files /dev/null and b/static/img/project-segmentation-of-layout-based-documents/rect-title.jpg differ
diff --git a/static/img/project-segmentation-of-layout-based-documents/rectangle-evaluation.jpg b/static/img/project-segmentation-of-layout-based-documents/rectangle-evaluation.jpg
new file mode 100644
index 0000000..dd3e781
Binary files /dev/null and b/static/img/project-segmentation-of-layout-based-documents/rectangle-evaluation.jpg differ
diff --git a/static/img/project-segmentation-of-layout-based-documents/split-page.png b/static/img/project-segmentation-of-layout-based-documents/split-page.png
new file mode 100644
index 0000000..dc7dfac
Binary files /dev/null and b/static/img/project-segmentation-of-layout-based-documents/split-page.png differ
diff --git a/static/img/project-segmentation-of-layout-based-documents/title-extended-v3.jpg b/static/img/project-segmentation-of-layout-based-documents/title-extended-v3.jpg
new file mode 100644
index 0000000..02492b9
Binary files /dev/null and b/static/img/project-segmentation-of-layout-based-documents/title-extended-v3.jpg differ
diff --git a/static/img/project-segmentation-of-layout-based-documents/uml-1.png b/static/img/project-segmentation-of-layout-based-documents/uml-1.png
new file mode 100644
index 0000000..b6206e5
Binary files /dev/null and b/static/img/project-segmentation-of-layout-based-documents/uml-1.png differ
diff --git a/static/img/project-segmentation-of-layout-based-documents/uml-2.png b/static/img/project-segmentation-of-layout-based-documents/uml-2.png
new file mode 100644
index 0000000..c1061c2
Binary files /dev/null and b/static/img/project-segmentation-of-layout-based-documents/uml-2.png differ
diff --git a/static/img/project-segmentation-of-layout-based-documents/visual-evaluation.jpg b/static/img/project-segmentation-of-layout-based-documents/visual-evaluation.jpg
new file mode 100644
index 0000000..d2c2f50
Binary files /dev/null and b/static/img/project-segmentation-of-layout-based-documents/visual-evaluation.jpg differ