We have built an open-source web harvester in Python to download, align and extract text of parallel documents of given web pages (including paragraphs). We used this web crawler to download the parallel documents od DEplain-web. For reproducibility, we made the code and the list of web pages available. Please, use this code to crawl the web documents with a closed license to extend the document simplification of DEplain-web. If you use one of the alignment methods, you can also extend the sentence simplification data of DEplain-web.
You can find instruction on how to install and use the web harvester here: https://github.com/rstodden/data_collection_german_simplification.
This code is licensed under GPL-3.0 license.