diff --git a/docs/source/using_doctr/using_models.rst b/docs/source/using_doctr/using_models.rst index 1a46c2bb7..007f8b295 100644 --- a/docs/source/using_doctr/using_models.rst +++ b/docs/source/using_doctr/using_models.rst @@ -23,26 +23,50 @@ Available architectures The following architectures are currently supported: * :py:meth:`linknet_resnet18 ` +* :py:meth:`linknet_resnet34 ` +* :py:meth:`linknet_resnet50 ` * :py:meth:`db_resnet50 ` * :py:meth:`db_mobilenet_v3_large ` We also provide 2 models working with any kind of rotated documents: -* :py:meth:`linknet_resnet18_rotation ` -* :py:meth:`db_resnet50_rotation ` +* :py:meth:`linknet_resnet18_rotation ` (TensorFlow) +* :py:meth:`db_resnet50_rotation ` (PyTorch) For a comprehensive comparison, we have compiled a detailed benchmark on publicly available datasets: -+------------------------------------------------------------------+----------------------------+----------------------------+---------+ -| | FUNSD | CORD | | -+=================================+=================+==============+============+===============+============+===============+=========+ -| **Architecture** | **Input shape** | **# params** | **Recall** | **Precision** | **Recall** | **Precision** | **FPS** | -+---------------------------------+-----------------+--------------+------------+---------------+------------+---------------+---------+ -| db_resnet50 | (1024, 1024, 3) | 25.2 M | 82.14 | 87.64 | 92.49 | 89.66 | 2.1 | -+---------------------------------+-----------------+--------------+------------+---------------+------------+---------------+---------+ -| db_mobilenet_v3_large | (1024, 1024, 3) | 4.2 M | 79.35 | 84.03 | 81.14 | 66.85 | | -+---------------------------------+-----------------+--------------+------------+---------------+------------+---------------+---------+ ++-----------------------------------------------------------------------------------+----------------------------+----------------------------+--------------------+ +| | FUNSD | CORD | | ++================+=================================+=================+==============+============+===============+============+===============+====================+ +| **Backend** | **Architecture** | **Input shape** | **# params** | **Recall** | **Precision** | **Recall** | **Precision** | **sec/it (B: 1)** | ++----------------+---------------------------------+-----------------+--------------+------------+---------------+------------+---------------+--------------------+ +| TensorFlow | db_resnet50 | (1024, 1024, 3) | 25.2 M | 81.22 | 86.66 | 92.46 | 89.62 | 1.2 | ++----------------+---------------------------------+-----------------+--------------+------------+---------------+------------+---------------+--------------------+ +| Tensorflow | db_mobilenet_v3_large | (1024, 1024, 3) | 4.2 M | 78.27 | 82.77 | 80.99 | 66.57 | 0.5 | ++----------------+---------------------------------+-----------------+--------------+------------+---------------+------------+---------------+--------------------+ +| TensorFlow | linknet_resnet18 | (1024, 1024, 3) | 11.5 M | 78.23 | 83.77 | 82.88 | 82.42 | 0.7 | ++----------------+---------------------------------+-----------------+--------------+------------+---------------+------------+---------------+--------------------+ +| Tensorflow | linknet_resnet18_rotation | (1024, 1024, 3) | 11.5 M | 81.12 | 82.13 | 83.55 | 80.14 | 0.6 | ++----------------+---------------------------------+-----------------+--------------+------------+---------------+------------+---------------+--------------------+ +| TensorFlow | linknet_resnet34 | (1024, 1024, 3) | 21.6 M | 82.14 | 87.64 | 85.55 | 86.02 | 0.8 | ++----------------+---------------------------------+-----------------+--------------+------------+---------------+------------+---------------+--------------------+ +| Tensorflow | linknet_resnet50 | (1024, 1024, 3) | 28.8 M | 79.00 | 84.79 | 85.89 | 65.75 | 1.1 | ++----------------+---------------------------------+-----------------+--------------+------------+---------------+------------+---------------+--------------------+ +| PyTorch | db_resnet34 | (1024, 1024, 3) | 22.4 M | | | | | | ++----------------+---------------------------------+-----------------+--------------+------------+---------------+------------+---------------+--------------------+ +| PyTorch | db_resnet50 | (1024, 1024, 3) | 25.4 M | 79.17 | 86.31 | 92.96 | 91.23 | 1.1 | ++----------------+---------------------------------+-----------------+--------------+------------+---------------+------------+---------------+--------------------+ +| PyTorch | db_resnet50_rotation | (1024, 1024, 3) | 25.4 M | 83.30 | 91.07 | 91.63 | 90.53 | 1.6 | ++----------------+---------------------------------+-----------------+--------------+------------+---------------+------------+---------------+--------------------+ +| PyTorch | db_mobilenet_v3_large | (1024, 1024, 3) | 4.2 M | 80.06 | 84.12 | 80.51 | 66.51 | 0.5 | ++----------------+---------------------------------+-----------------+--------------+------------+---------------+------------+---------------+--------------------+ +| PyTorch | linknet_resnet18 | (1024, 1024, 3) | 11.5 M | | | | | | ++----------------+---------------------------------+-----------------+--------------+------------+---------------+------------+---------------+--------------------+ +| PyTorch | linknet_resnet34 | (1024, 1024, 3) | 21.6 M | | | | | | ++----------------+---------------------------------+-----------------+--------------+------------+---------------+------------+---------------+--------------------+ +| PyTorch | linknet_resnet50 | (1024, 1024, 3) | 28.8 M | | | | | | ++----------------+---------------------------------+-----------------+--------------+------------+---------------+------------+---------------+--------------------+ All text detection models above have been evaluated using both the training and evaluation sets of FUNSD and CORD (cf. :ref:`datasets`). @@ -50,7 +74,7 @@ Explanations about the metrics being used are available in :ref:`metrics`. *Disclaimer: both FUNSD subsets combined have 199 pages which might not be representative enough of the model capabilities* -FPS (Frames per second) is computed after a warmup phase of 100 tensors (where the batch size is 1), by measuring the average number of processed tensors per second over 1000 samples. Those results were obtained on a `c5.x12large `_ AWS instance (CPU Xeon Platinum 8275L). +Seconds per iteration (with a batch size of 1) is computed after a warmup phase of 100 tensors, by measuring the average number of processed tensors per second over 1000 samples. Those results were obtained on a `11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz`. Detection predictors @@ -58,11 +82,13 @@ Detection predictors :py:meth:`detection_predictor ` wraps your detection model to make it easily useable with your favorite deep learning framework seamlessly. - >>> import numpy as np - >>> from doctr.models import detection_predictor - >>> predictor = detection_predictor('db_resnet50') - >>> dummy_img = (255 * np.random.rand(800, 600, 3)).astype(np.uint8) - >>> out = model([dummy_img]) +.. code:: python3 + + import numpy as np + from doctr.models import detection_predictor + predictor = detection_predictor('db_resnet50') + dummy_img = (255 * np.random.rand(800, 600, 3)).astype(np.uint8) + out = model([dummy_img]) You can pass specific boolean arguments to the predictor: @@ -72,8 +98,10 @@ You can pass specific boolean arguments to the predictor: For instance, this snippet will instantiates a detection predictor able to detect text on rotated documents while preserving the aspect ratio: - >>> from doctr.models import detection_predictor - >>> predictor = detection_predictor('db_resnet50_rotation', pretrained=True, assume_straight_pages=False, preserve_aspect_ratio=True) +.. code:: python3 + + from doctr.models import detection_predictor + predictor = detection_predictor('db_resnet50_rotation', pretrained=True, assume_straight_pages=False, preserve_aspect_ratio=True) NB: for the moment, `db_resnet50_rotation` is pretrained in Pytorch only and `linknet_resnet18_rotation` in Tensorflow only. @@ -94,75 +122,81 @@ The following architectures are currently supported: * :py:meth:`crnn_mobilenet_v3_large ` * :py:meth:`sar_resnet31 ` * :py:meth:`master ` +* :py:meth:`vitstr_small ` +* :py:meth:`vitstr_base ` +* :py:meth:`parseq ` For a comprehensive comparison, we have compiled a detailed benchmark on publicly available datasets: -.. list-table:: Text recognition model zoo - :header-rows: 1 - - * - Architecture - - Input shape - - # params - - FUNSD - - CORD - - FPS - * - crnn_vgg16_bn - - (32, 128, 3) - - 15.8M - - 87.18 - - 92.93 - - 12.8 - * - crnn_mobilenet_v3_small - - (32, 128, 3) - - 2.1M - - 86.21 - - 90.56 - - - * - crnn_mobilenet_v3_large - - (32, 128, 3) - - 4.5M - - 86.95 - - 92.03 - - - * - sar_resnet31 - - (32, 128, 3) - - 56.2M - - **87.70** - - **93.41** - - 2.7 - * - master - - (32, 128, 3) - - 67.7M - - 87.62 - - 93.27 - - ++-----------------------------------------------------------------------------------+----------------------------+----------------------------+--------------------+ +| | FUNSD | CORD | | ++================+=================================+=================+==============+============+===============+============+===============+====================+ +| **Backend** | **Architecture** | **Input shape** | **# params** | **Exact** | **Partial** | **Exact** | **Partial** | **sec/it (B: 64)** | ++----------------+---------------------------------+-----------------+--------------+------------+---------------+------------+---------------+--------------------+ +| TensorFlow | crnn_vgg16_bn | (32, 128, 3) | 15.8 M | 88.12 | 88.85 | 94.68 | 95.10 | 0.9 | ++----------------+---------------------------------+-----------------+--------------+------------+---------------+------------+---------------+--------------------+ +| Tensorflow | crnn_mobilenet_v3_small | (32, 128, 3) | 2.1 M | 86.88 | 87.61 | 92.28 | 92.73 | 0.25 | ++----------------+---------------------------------+-----------------+--------------+------------+---------------+------------+---------------+--------------------+ +| TensorFlow | crnn_mobilenet_v3_large | (32, 128, 3) | 4.5 M | 87.44 | 88.12 | 94.14 | 94.55 | 0.34 | ++----------------+---------------------------------+-----------------+--------------+------------+---------------+------------+---------------+--------------------+ +| Tensorflow | master | (32, 128, 3) | 58.8 M | 87.44 | 88.21 | 93.83 | 94.25 | 22.3 | ++----------------+---------------------------------+-----------------+--------------+------------+---------------+------------+---------------+--------------------+ +| TensorFlow | sar_resnet31 | (32, 128, 3) | 57.2 M | 87.67 | 88.48 | 94.21 | 94.66 | 7.1 | ++----------------+---------------------------------+-----------------+--------------+------------+---------------+------------+---------------+--------------------+ +| Tensorflow | vitstr_small | (32, 128, 3) | 21.4 M | 83.01 | 83.84 | 86.57 | 87.00 | 2.0 | ++----------------+---------------------------------+-----------------+--------------+------------+---------------+------------+---------------+--------------------+ +| TensorFlow | vitstr_base | (32, 128, 3) | 85.2 M | 85.98 | 86.70 | 90.47 | 90.95 | 5.8 | ++----------------+---------------------------------+-----------------+--------------+------------+---------------+------------+---------------+--------------------+ +| TensorFlow | parseq | (32, 128, 3) | 23.8 M | 81.62 | 82.29 | 79.13 | 79.52 | 3.6 | ++----------------+---------------------------------+-----------------+--------------+------------+---------------+------------+---------------+--------------------+ +| PyTorch | crnn_vgg16_bn | (32, 128, 3) | 15.8 M | 86.54 | 87.41 | 94.29 | 94.69 | 0.6 | ++----------------+---------------------------------+-----------------+--------------+------------+---------------+------------+---------------+--------------------+ +| PyTorch | crnn_mobilenet_v3_small | (32, 128, 3) | 2.1 M | 87.25 | 87.99 | 93.91 | 94.34 | 0.05 | ++----------------+---------------------------------+-----------------+--------------+------------+---------------+------------+---------------+--------------------+ +| PyTorch | crnn_mobilenet_v3_large | (32, 128, 3) | 4.5 M | 87.38 | 88.09 | 94.46 | 94.92 | 0.08 | ++----------------+---------------------------------+-----------------+--------------+------------+---------------+------------+---------------+--------------------+ +| PyTorch | master | (32, 128, 3) | 58.7 M | | | | | 17.6 | ++----------------+---------------------------------+-----------------+--------------+------------+---------------+------------+---------------+--------------------+ +| PyTorch | sar_resnet31 | (32, 128, 3) | 55.4 M | | | | | 4.9 | ++----------------+---------------------------------+-----------------+--------------+------------+---------------+------------+---------------+--------------------+ +| PyTorch | vitstr_small | (32, 128, 3) | 21.4 M | | | | | 1.5 | ++----------------+---------------------------------+-----------------+--------------+------------+---------------+------------+---------------+--------------------+ +| PyTorch | vitstr_base | (32, 128, 3) | 85.2 M | | | | | 4.1 | ++----------------+---------------------------------+-----------------+--------------+------------+---------------+------------+---------------+--------------------+ +| PyTorch | parseq | (32, 128, 3) | 23.8 M | | | | | 2.2 | ++----------------+---------------------------------+-----------------+--------------+------------+---------------+------------+---------------+--------------------+ + All text recognition models above have been evaluated using both the training and evaluation sets of FUNSD and CORD (cf. :ref:`datasets`). Explanations about the metric being used (exact match) are available in :ref:`metrics`. While most of our recognition models were trained on our french vocab (cf. :ref:`vocabs`), you can easily access the vocab of any model as follows: - >>> from doctr.models import recognition_predictor - >>> predictor = recognition_predictor('crnn_vgg16_bn') - >>> print(predictor.model.cfg['vocab']) +.. code:: python3 + + from doctr.models import recognition_predictor + predictor = recognition_predictor('crnn_vgg16_bn') + print(predictor.model.cfg['vocab']) *Disclaimer: both FUNSD subsets combine have 30595 word-level crops which might not be representative enough of the model capabilities* -FPS (Frames per second) is computed after a warmup phase of 100 tensors (where the batch size is 1), by measuring the average number of processed tensors per second over 1000 samples. Those results were obtained on a `c5.x12large `_ AWS instance (CPU Xeon Platinum 8275L). +Seconds per iteration (with a batch size of 64) is computed after a warmup phase of 100 tensors, by measuring the average number of processed tensors per second over 1000 samples. Those results were obtained on a `11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz`. Recognition predictors ^^^^^^^^^^^^^^^^^^^^^^ :py:meth:`recognition_predictor ` wraps your recognition model to make it easily useable with your favorite deep learning framework seamlessly. - >>> import numpy as np - >>> from doctr.models import recognition_predictor - >>> predictor = recognition_predictor('crnn_vgg16_bn') - >>> dummy_img = (255 * np.random.rand(50, 150, 3)).astype(np.uint8) - >>> out = model([dummy_img]) +.. code:: python3 + + import numpy as np + from doctr.models import recognition_predictor + predictor = recognition_predictor('crnn_vgg16_bn') + dummy_img = (255 * np.random.rand(50, 150, 3)).astype(np.uint8) + out = model([dummy_img]) End-to-End OCR @@ -173,76 +207,74 @@ The task consists of both localizing and transcribing textual elements in a give Available architectures ^^^^^^^^^^^^^^^^^^^^^^^ -You can use any combination of detection and recognition models supporte by docTR. +You can use any combination of detection and recognition models supported by docTR. For a comprehensive comparison, we have compiled a detailed benchmark on publicly available datasets: -+----------------------------------------+--------------------------------------+--------------------------------------+ -| | FUNSD | CORD | -+========================================+============+===============+=========+============+===============+=========+ -| **Architecture** | **Recall** | **Precision** | **FPS** | **Recall** | **Precision** | **FPS** | -+----------------------------------------+------------+---------------+---------+------------+---------------+---------+ -| db_resnet50 + crnn_vgg16_bn | 71.25 | 76.02 | 0.85 | 84.00 | 81.42 | 1.6 | -+----------------------------------------+------------+---------------+---------+------------+---------------+---------+ -| db_resnet50 + master | 71.03 | 76.06 | | 84.49 | 81.94 | | -+----------------------------------------+------------+---------------+---------+------------+---------------+---------+ -| db_resnet50 + sar_resnet31 | 71.25 | 76.29 | 0.27 | 84.50 | **81.96** | 0.83 | -+----------------------------------------+------------+---------------+---------+------------+---------------+---------+ -| db_resnet50 + crnn_mobilenet_v3_small | 69.85 | 74.80 | | 80.85 | 78.42 | 0.83 | -+----------------------------------------+------------+---------------+---------+------------+---------------+---------+ -| db_resnet50 + crnn_mobilenet_v3_large | 70.57 | 75.57 | | 82.57 | 80.08 | 0.83 | -+----------------------------------------+------------+---------------+---------+------------+---------------+---------+ -| db_mobilenet_v3_large + crnn_vgg16_bn | 67.73 | 71.73 | | 71.65 | 59.03 | | -+----------------------------------------+------------+---------------+---------+------------+---------------+---------+ -| Gvision text detection | 59.50 | 62.50 | | 75.30 | 70.00 | | -+----------------------------------------+------------+---------------+---------+------------+---------------+---------+ -| Gvision doc. text detection | 64.00 | 53.30 | | 68.90 | 61.10 | | -+----------------------------------------+------------+---------------+---------+------------+---------------+---------+ -| AWS textract | **78.10** | **83.00** | | **87.50** | 66.00 | | -+----------------------------------------+------------+---------------+---------+------------+---------------+---------+ ++---------------------------------------------------------------------------+----------------------------+----------------------------+ +| | FUNSD | CORD | ++================+==========================================================+============================+============+===============+ +| **Backend** | **Architecture** | **Recall** | **Precision** | **Recall** | **Precision** | ++----------------+----------------------------------------------------------+------------+---------------+------------+---------------+ +| TensorFlow | db_resnet50 + crnn_vgg16_bn | 70.82 | 75.56 | 83.97 | 81.40 | ++----------------+----------------------------------------------------------+------------+---------------+------------+---------------+ +| TensorFlow | db_resnet50 + crnn_mobilenet_v3_small | 69.63 | 74.29 | 81.08 | 78.59 | ++----------------+----------------------------------------------------------+------------+---------------+------------+---------------+ +| TensorFlow | db_resnet50 + crnn_mobilenet_v3_large | 70.01 | 74.70 | 83.28 | 80.73 | ++----------------+----------------------------------------------------------+------------+---------------+------------+---------------+ +| TensorFlow | db_resnet50 + sar_resnet31 | 68.75 | 73.76 | 78.56 | 76.24 | ++----------------+----------------------------------------------------------+------------+---------------+------------+---------------+ +| TensorFlow | db_resnet50 + master | 68.75 | 73.76 | 78.56 | 76.24 | ++----------------+----------------------------------------------------------+------------+---------------+------------+---------------+ +| TensorFlow | db_resnet50 + vitstr_small | 64.58 | 68.91 | 74.66 | 72.37 | ++----------------+----------------------------------------------------------+------------+---------------+------------+---------------+ +| TensorFlow | db_resnet50 + vitstr_base | 66.89 | 71.37 | 79.11 | 76.68 | ++----------------+----------------------------------------------------------+------------+---------------+------------+---------------+ +| TensorFlow | db_resnet50 + parseq | 65.77 | 70.18 | 71.57 | 69.37 | ++----------------+----------------------------------------------------------+------------+---------------+------------+---------------+ +| PyTorch | db_resnet50 + crnn_vgg16_bn | 67.82 | 73.35 | 84.84 | 83.27 | ++----------------+----------------------------------------------------------+------------+---------------+------------+---------------+ +| PyTorch | db_resnet50 + crnn_mobilenet_v3_small | 67.89 | 74.01 | 84.43 | 82.85 | ++----------------+----------------------------------------------------------+------------+---------------+------------+---------------+ +| PyTorch | db_resnet50 + crnn_mobilenet_v3_large | 68.45 | 74.63 | 84.86 | 83.27 | ++----------------+----------------------------------------------------------+------------+---------------+------------+---------------+ +| PyTorch | db_resnet50 + sar_resnet31 | | | | | ++----------------+----------------------------------------------------------+------------+---------------+------------+---------------+ +| PyTorch | db_resnet50 + master | | | | | ++----------------+----------------------------------------------------------+------------+---------------+------------+---------------+ +| PyTorch | db_resnet50 + vitstr_small | | | | | ++----------------+----------------------------------------------------------+------------+---------------+------------+---------------+ +| PyTorch | db_resnet50 + vitstr_base | | | | | ++----------------+----------------------------------------------------------+------------+---------------+------------+---------------+ +| PyTorch | db_resnet50 + parseq | | | | | ++----------------+----------------------------------------------------------+------------+---------------+------------+---------------+ +| None | Gvision text detection | 59.50 | 62.50 | 75.30 | 59.03 | ++----------------+----------------------------------------------------------+------------+---------------+------------+---------------+ +| None | Gvision doc. text detection | 64.00 | 53.30 | 68.90 | 61.10 | ++----------------+----------------------------------------------------------+------------+---------------+------------+---------------+ +| None | AWS textract | 78.10 | 83.00 | 87.50 | 66.00 | ++----------------+----------------------------------------------------------+------------+---------------+------------+---------------+ +| None | Azure Form Recognizer (v3.2) | 79.42 | 85.89 | 89.62 | 88.93 | ++----------------+----------------------------------------------------------+------------+---------------+------------+---------------+ + All OCR models above have been evaluated using both the training and evaluation sets of FUNSD and CORD (cf. :ref:`datasets`). Explanations about the metrics being used are available in :ref:`metrics`. *Disclaimer: both FUNSD subsets combine have 199 pages which might not be representative enough of the model capabilities* -FPS (Frames per second) is computed after a warmup phase of 100 tensors (where the batch size is 1), by measuring the average number of processed frames per second over 1000 samples. Those results were obtained on a `c5.x12large `_ AWS instance (CPU Xeon Platinum 8275L). - -Since you may be looking for specific use cases, we also performed this benchmark on private datasets with various document types below. Unfortunately, we are not able to share those at the moment since they contain sensitive information. - - -+----------------------------------------------+----------------------------+----------------------------+----------------------------+----------------------------+----------------------------+----------------------------+ -| | Receipts | Invoices | IDs | US Tax Forms | Resumes | Road Fines | -+==============================================+============+===============+============+===============+============+===============+============+===============+============+===============+============+===============+ -| **Architecture** | **Recall** | **Precision** | **Recall** | **Precision** | **Recall** | **Precision** | **Recall** | **Precision** | **Recall** | **Precision** | **Recall** | **Precision** | -+----------------------------------------------+------------+---------------+------------+---------------+------------+---------------+------------+---------------+------------+---------------+------------+---------------+ -| db_resnet50 + crnn_vgg16_bn (ours) | 78.70 | 81.12 | 65.80 | 70.70 | 50.25 | 51.78 | 79.08 | 92.83 | | | | | -+----------------------------------------------+------------+---------------+------------+---------------+------------+---------------+------------+---------------+------------+---------------+------------+---------------+ -| db_resnet50 + master (ours) | **79.00** | **81.42** | 65.57 | 69.86 | 51.34 | 52.90 | 78.86 | 92.57 | | | | | -+----------------------------------------------+------------+---------------+------------+---------------+------------+---------------+------------+---------------+------------+---------------+------------+---------------+ -| db_resnet50 + sar_resnet31 (ours) | 78.94 | 81.37 | 65.89 | **70.79** | **51.78** | **53.35** | 79.04 | 92.78 | | | | | -+----------------------------------------------+------------+---------------+------------+---------------+------------+---------------+------------+---------------+------------+---------------+------------+---------------+ -| db_resnet50 + crnn_mobilenet_v3_small (ours) | 76.81 | 79.15 | 64.89 | 69.61 | 45.03 | 46.38 | 78.96 | 92.11 | 85.91 | 87.20 | 84.85 | 85.86 | -+----------------------------------------------+------------+---------------+------------+---------------+------------+---------------+------------+---------------+------------+---------------+------------+---------------+ -| db_resnet50 + crnn_mobilenet_v3_large (ours) | 78.01 | 80.39 | 65.36 | 70.11 | 48.00 | 49.43 | 79.39 | 92.62 | 87.68 | 89.00 | 85.65 | 86.67 | -+----------------------------------------------+------------+---------------+------------+---------------+------------+---------------+------------+---------------+------------+---------------+------------+---------------+ -| db_mobilenet_v3_large + crnn_vgg16_bn (ours) | 78.36 | 74.93 | 63.04 | 68.41 | 39.36 | 41.75 | 72.14 | 89.97 | | | | | -+----------------------------------------------+------------+---------------+------------+---------------+------------+---------------+------------+---------------+------------+---------------+------------+---------------+ -| Gvision doc. text detection | 68.91 | 59.89 | 63.20 | 52.85 | 43.70 | 29.21 | 69.79 | 65.68 | | | | | -+----------------------------------------------+------------+---------------+------------+---------------+------------+---------------+------------+---------------+------------+---------------+------------+---------------+ -| AWS textract | 75.77 | 77.70 | **70.47** | 69.13 | 46.39 | 43.32 | **84.31** | **98.11** | | | | | -+----------------------------------------------+------------+---------------+------------+---------------+------------+---------------+------------+---------------+------------+---------------+------------+---------------+ - Two-stage approaches ^^^^^^^^^^^^^^^^^^^^ Those architectures involve one stage of text detection, and one stage of text recognition. The text detection will be used to produces cropped images that will be passed into the text recognition block. Everything is wrapped up with :py:meth:`ocr_predictor `. - >>> import numpy as np - >>> from doctr.models import ocr_predictor - >>> model = ocr_predictor('db_resnet50', 'crnn_vgg16_bn', pretrained=True) - >>> input_page = (255 * np.random.rand(800, 600, 3)).astype(np.uint8) - >>> out = model([input_page]) +.. code:: python3 + + import numpy as np + from doctr.models import ocr_predictor + model = ocr_predictor('db_resnet50', 'crnn_vgg16_bn', pretrained=True) + input_page = (255 * np.random.rand(800, 600, 3)).astype(np.uint8) + out = model([input_page]) You can pass specific boolean arguments to the predictor: @@ -257,8 +289,10 @@ Those 3 are going straight to the detection predictor, as mentioned above (in th For instance, this snippet instantiates an end-to-end ocr_predictor working with rotated documents, which preserves the aspect ratio of the documents, and returns polygons: - >>> from doctr.model import ocr_predictor - >>> model = ocr_predictor('linknet_resnet18_rotation', pretrained=True, assume_straight_pages=False, preserve_aspect_ratio=True) +.. code:: python3 + + from doctr.model import ocr_predictor + model = ocr_predictor('linknet_resnet18_rotation', pretrained=True, assume_straight_pages=False, preserve_aspect_ratio=True) What should I do with the output? @@ -364,4 +398,3 @@ For reference, here is a sample XML byte string output: - diff --git a/pyproject.toml b/pyproject.toml index 044fe0f83..025807c3d 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -57,7 +57,7 @@ dependencies = [ [project.optional-dependencies] tf = [ "tensorflow>=2.11.0,<3.0.0", # cf. https://github.com/mindee/doctr/pull/1182 - "tf2onnx>=1.14.0,<2.0.0", + "tf2onnx>=1.15.1,<2.0.0", # cf.https://github.com/onnx/tensorflow-onnx/releases/tag/v1.15.1 ] torch = [ "torch>=1.12.0,<3.0.0", diff --git a/scripts/evaluate.py b/scripts/evaluate.py index f4e8aaefe..20da633bd 100644 --- a/scripts/evaluate.py +++ b/scripts/evaluate.py @@ -40,6 +40,7 @@ def main(args): args.recognition, pretrained=True, reco_bs=args.batch_size, + preserve_aspect_ratio=False, assume_straight_pages=not args.rotation, ) diff --git a/scripts/evaluate_kie.py b/scripts/evaluate_kie.py index 1aaf3f9ae..3d16197d9 100644 --- a/scripts/evaluate_kie.py +++ b/scripts/evaluate_kie.py @@ -42,6 +42,7 @@ def main(args): args.recognition, pretrained=True, reco_bs=args.batch_size, + preserve_aspect_ratio=False, assume_straight_pages=not args.rotation, )