Skip to content
Vincent Chen edited this page Apr 8, 2015 · 15 revisions

What is DIRT?

DIRT (Dynamic Identification of Reused Text) aims to allow users (primarily academics) to find passages that are shared by pairs of documents within a corpus. It will allow them to view pairs of documents and their common passages, as well as show which documents within a corpus have common passages with one particular document within the same corpus, known as the focus document.

DIRT also aims to be extensible to support other languages, although ancient Chinese will be the focus for the prototype. DIRT should be able to find matches in a UTF-8 encoded corpus in any language, with a language specific module improving the permissiveness of matching.

Clone this wiki locally