Skip to content

CaillaudPA/elasticsearch-pdf-importer

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Elasticsearch PDF importer

It allows you import PDF files to elasticsearch and search in them.

Requirements

  • Elasticsearch (version 6)
  • ingest-attachment plugin (see the doc)

If you haven't installed ingest-attachment plugin run this in your server:

sudo bin/elasticsearch-plugin ingest-attachment

Installation

Installing composer package
composer require eze/elasticsearch-pdf-importer
Installing the Attachment Processor in a Pipeline

You need to create a pipeline with the attachment processor. For it, you can choose following:

  • Create a symfony's command (see here)
  • Create a php file and run it (see here)
  • Or via curl in command line:
PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "indexed_chars": -1
      }
    }
  ]
}

How to use

The basic is create a Index, a Document and call to importer.

$client = (new \Eze\Elastic\Factory())->getClient('localhost:9200');
$resolver = new \Eze\Elastic\Importer\Reader\ReaderResolver([
    new \Eze\Elastic\Importer\Reader\UrlReader(),
    new \Eze\Elastic\Importer\Reader\FileReader()
]);
$importer = new \Eze\Elastic\Importer\AttachmentImporter($client, $resolver);

$file = 'PATH_TO_PDF_FILE.pdf';

$index = new Eze\Elastic\Model\Index('INDEX', 'TYPE', 'ID:OPTIONAL');
$document = new Eze\Elastic\Model\Document();
$document->setFile($file)->setIndex($index);
$id = $importer->import($document);

You can add more field calling to:

$document->addField('FIELD-NAME-ONE', 'VALUE)
    ->addField('FIELD-NAME-TWO', 'VALUE)
    ->addField('FIELD-NAME-THREE', 'VALUE);

Also you can do data processing before send its to elasticsearch, you only need to do an implementation of ProcessorInterface

I have implemented a processor to reduce pdf size with Ghostscript via command line.

Requirements: php need to allow exec function, server need to have installed ghostscript libgs-dev imagemagick on ubuntu server

$client = (new \Eze\Elastic\Factory())->getClient('localhost:9200');
$resolver = new \Eze\Elastic\Importer\Reader\ReaderResolver([
    new \Eze\Elastic\Importer\Reader\UrlReader(),
    new \Eze\Elastic\Importer\Reader\FileReader()
]);
$processor = new \Eze\Elastic\Importer\Processor\GhostscriptProcessor();
$importer = new \Eze\Elastic\Importer\AttachmentImporter($client, $resolver, $processor);
//
// or..
//
/**
$manyProcessor = new \Eze\Elastic\Importer\Processor\MultiProcessor([
    $processor1,
    $processor2,
    $processor3,
]);

$importer = new \Eze\Elastic\Importer\AttachmentImporter($client, $resolver, $manyProcessor);
*/

$file = 'PATH_TO_PDF_FILE.pdf';

$index = new Eze\Elastic\Model\Index('INDEX', 'TYPE', 'ID:OPTIONAL');
$document = new Eze\Elastic\Model\Document();
$document->setFile($file)->setIndex($index);
$id = $importer->import($document);

About

Allows import PDF files to elasticsearch v6

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • PHP 100.0%