-
Hey there I'm currently checking out different platforms for investigations to set up for a small organization. What are the main differences between Datashare and alternatives, such as the OCCRP's Aleph? Why did you opt to build Datashare instead of using something that already existed? Why would you recommend it instead of the others? Thanks for bothering :) |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi @ninoppp, I won't start a long and exhaustive comparison with Aleph or Blacklight (which are both great!) but I can give you a little context about Datashare and why we created it. In 2016, ICIJ (the NGO behind Datashare) started to develop a tool to extract named entities, - people, organization, locations - combining the most popular Natural Language Processing libraries. At that time, we were a very small organization and we hired one developer to work on that specific task. With Named Entities, we wanted to be able to enrich our search results when dealing with huge investigations (such as the Panama Papers). So at the beginning, Datashare was just a standalone cli tool that was able to receive a chunk of text and then extract all named entities combining CoreNLP, OpenNLP, GateNLP, etc. The first results were promising and we decided to start to integrate those Named Entities to our existing indexes. Since 2015, we were using Blacklight (an open source RoR app) to explore and share documents with our partners. Problem: Blacklight is based on the Apache Solr search engine which was already struggling to deal with the massive trove of documents on our servers. Adding named entities to Blacklight would have been very hard. Some might say even impossible. So we started to look for an alternative using ElasticSearch, the most popular and fastest search engine on the market at that time. Unfortunately, back in 2016 very few user interfaces for ElasticSearch existed. Aleph (which uses ElasticSearch) was only on its early phase and we needed a solution that was able to:
Also during the same period, our organization got funding from the Brown Institute to build a secure protocol to share documents between journalists. Datashare needed not only to be super-powerful-extendable-scalable-fast-secure-relaiable on our servers, but also on personal laptops for people who wanted to mine their own documents without great tech skills. In other word, we needed some sort of unicorn. Obviously there was nothing on the market that was already doing all those things at the same time. This is why we progressively pivoted from a Named Entities cli-tool to a self-hosted search engine for documents. That was by far the best option to build a platform that would fit all our needs. It was obviously very ambitious, but searching into millions of documents is at the very center of everything we do at ICIJ. It makes sense for us to keep some sort of control on the code that powers everything we do. People tends to compare Datashare with Aleph. They are right, we have many feature in common. But in fact we were mostly inspired by Blacklight in the first place which we implemented for many years, even before Aleph existed. So Datashare, Blacklight, Aleph (OCCRP), Overview (Columbia) or Giant (The Guardian): what's the best solution for you? I couldn't say. All platforms are great, developed by amazing people with different objectives and agenda. The beauty of open source is to propose alternatives to users. I don't think it's a big deal to develop similar open source tools, in the contrary, it's a pretty healthy configuration and a great way to encourage innovation. |
Beta Was this translation helpful? Give feedback.
Hi @ninoppp,
I won't start a long and exhaustive comparison with Aleph or Blacklight (which are both great!) but I can give you a little context about Datashare and why we created it.
In 2016, ICIJ (the NGO behind Datashare) started to develop a tool to extract named entities, - people, organization, locations - combining the most popular Natural Language Processing libraries. At that time, we were a very small organization and we hired one developer to work on that specific task.
With Named Entities, we wanted to be able to enrich our search results when dealing with huge investigations (such as the Panama Papers). So at the beginning, Datashare was just a standalone cli tool that was ab…