Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md

README.md

DEplain: Web Harvester

We have built an open-source web harvester in Python to download, align and extract text of parallel documents of given web pages (including paragraphs). We used this web crawler to download the parallel documents od DEplain-web. For reproducibility, we made the code and the list of web pages available. Please, use this code to crawl the web documents with a closed license to extend the document simplification of DEplain-web. If you use one of the alignment methods, you can also extend the sentence simplification data of DEplain-web.

Installation

You can find instruction on how to install and use the web harvester here: https://github.com/rstodden/data_collection_german_simplification.

License

This code is licensed under GPL-3.0 license.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A__Web_Harvester

A__Web_Harvester

README.md

DEplain: Web Harvester

Installation

License

Files

A__Web_Harvester

Directory actions

More options

Directory actions

More options

Latest commit

History

A__Web_Harvester

Folders and files

parent directory

README.md

DEplain: Web Harvester

Installation

License