-
Notifications
You must be signed in to change notification settings - Fork 1
Home
Vincent Chen edited this page Apr 8, 2015
·
15 revisions
DIRT (Dynamic Identification of Reused Text) aims to allow users (primarily academics) to find passages that are shared by pairs of documents within a corpus. It will allow them to view pairs of documents and their common passages, as well as show which documents within a corpus have common passages with one particular document within the same corpus, known as the focus document.
DIRT also aims to be extensible to support other languages, although ancient Chinese will be the focus for the prototype. DIRT should be able to find matches in a UTF-8 encoded corpus in any language, with a language specific module improving the permissiveness of matching.