Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix DLLs to be Tesseract 5.2 #667

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 62 additions & 0 deletions .github/workflows/build-windows-libs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
name: build-windows-libs

on:
push:
branches: [master]
pull_request:
branches: [master]

defaults:
run:
shell: cmd

jobs:
build:
strategy:
fail-fast: false
matrix:
arch: [x86, x64]
runs-on: windows-latest
steps:
- name: Install dependencies
run: vcpkg install giflib:${{ matrix.arch }}-windows-static libjpeg-turbo:${{ matrix.arch }}-windows-static liblzma:${{ matrix.arch }}-windows-static libpng:${{ matrix.arch }}-windows-static tiff:${{ matrix.arch }}-windows-static zlib:${{ matrix.arch }}-windows-static
- name: Checkout Leptonica
uses: actions/checkout@v4
with:
repository: DanBloomberg/leptonica
ref: 1.82.0
path: leptonica
- name: Build Leptonica ${{ matrix.arch }}
run: |
mkdir vs17-${{ matrix.arch }}
cd vs17-${{ matrix.arch }}
cmake .. -G "Visual Studio 17 2022" -A ${{ matrix.arch == 'x86' && 'Win32' || 'x64' }} -DSW_BUILD=OFF -DBUILD_SHARED_LIBS=ON -DCMAKE_TOOLCHAIN_FILE=%VCPKG_INSTALLATION_ROOT%\scripts\buildsystems\vcpkg.cmake -DVCPKG_TARGET_TRIPLET=${{ matrix.arch }}-windows-static -DCMAKE_INSTALL_PREFIX=..\..\build\${{ matrix.arch }}
cmake --build . --config Release --target install
working-directory: leptonica
- name: Checkout Tesseract
uses: actions/checkout@v4
with:
repository: tesseract-ocr/tesseract
ref: 5.2.0
path: tesseract
- name: Build Tesseract ${{ matrix.arch }}
run: |
mkdir vs17-${{ matrix.arch }}
cd vs17-${{ matrix.arch }}
cmake .. -G "Visual Studio 17 2022" -A ${{ matrix.arch == 'x86' && 'Win32' || 'x64' }} -DAUTO_OPTIMIZE=OFF -DSW_BUILD=OFF -DBUILD_SHARED_LIBS=ON -DBUILD_TRAINING_TOOLS=OFF -DCMAKE_INSTALL_PREFIX=..\..\build\${{ matrix.arch }}
cmake --build . --config Release --target install
working-directory: tesseract
- name: Calculate hash of Leptonica
run: certutil -hashfile leptonica\vs17-${{ matrix.arch }}\bin\Release\leptonica-1.82.0.dll SHA256
- name: Calculate hash of Tesseract
run: certutil -hashfile tesseract\vs17-${{ matrix.arch }}\bin\Release\tesseract52.dll SHA256
- name: Archive Leptonica ${{ matrix.arch }}
uses: actions/upload-artifact@v4
with:
name: leptonica-${{ matrix.arch }}
path: leptonica\vs17-${{ matrix.arch }}\bin\Release\leptonica-1.82.0.dll
- name: Archive Tesseract ${{ matrix.arch }}
uses: actions/upload-artifact@v4
with:
name: tesseract-${{ matrix.arch }}
path: tesseract\vs17-${{ matrix.arch }}\bin\Release\tesseract52.dll
56 changes: 11 additions & 45 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,66 +2,32 @@ name: CI

on:
push:
branches: [ 'macos-fix' ]
branches: [master]
pull_request:
branches: [ 'main', 'develop' ]
paths:
- '**.cs'
- '**.csproj'

env:
DOTNET_NOLOGO: true
branches: [master]

jobs:
build-and-test:
name: build-and-test-${{ matrix.os }}
strategy:
fail-fast: false
matrix:
os: [ 'ubuntu-latest', 'windows-latest', 'macOS-latest' ]
os: [windows-2019, windows-2022]
runs-on: ${{ matrix.os }}

steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v4

- name: Setup .NET Core 3.1.x
uses: actions/setup-dotnet@v1
- name: Setup .NET
uses: actions/setup-dotnet@v4
with:
dotnet-version: '3.1.x'

- name: Setup .NET 6.0.x
uses: actions/setup-dotnet@v1
with:
dotnet-version: '6.0.x'

- name: Install Ubuntu dependencies
if: runner.os == 'Linux'
run: |
sudo apt-get update &&
sudo apt-get install -y tesseract-ocr libleptonica-dev &&
ls -la /usr/local/lib &&
ls -la /usr/lib/x86_64-linux-gnu

- name: Install macOS dependencies
if: runner.os == 'macOS'
run: |
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" &&
brew tap-new local/versions &&
brew extract --version=4.1.1 tesseract local/versions &&
brew install local/versions/[email protected] mono-libgdiplus

- name: Install dependencies
- name: Restore dependencies
run: dotnet restore src/Tesseract.sln

- name: Build
run: dotnet build --configuration Release --no-restore src/Tesseract.sln
run: dotnet build src/Tesseract.sln --no-restore

- name: Test
run: dotnet test --configuration Release --no-build --verbosity normal --logger trx --results-directory "TestResults" src/Tesseract.sln

- name: Upload dotnet test results
uses: actions/upload-artifact@v2
with:
name: dotnet-results
path: TestResults
if: ${{ always() }}
run: dotnet test src/Tesseract.sln --no-build --verbosity normal
15 changes: 8 additions & 7 deletions docs/Compling_tesseract_and_leptonica.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
* [Index](./ReadMe.md)

## Notes
Build instructions for Tesseract 4.1.1 and leptonica 1.80.0. Please note that build systems do change so while the following
Build instructions for Tesseract 5.2.0 and leptonica 1.82.0. Please note that build systems do change so while the following
has been tested with the listed versions building against any other versions including master may not work as expected and
aren't supported.

Expand All @@ -22,12 +22,12 @@ linking leptonica into tesseract which increases file size (since the leptonica
vcpkg install giflib:x64-windows-static libjpeg-turbo:x64-windows-static liblzma:x64-windows-static libpng:x64-windows-static tiff:x64-windows-static zlib:x64-windows-static
git clone https://github.com/DanBloomberg/leptonica.git & cd leptonica
git checkout -b 1.82.0 1.82.0
mkdir vs16-x86 & cd vs16-x86
mkdir vs17-x86 & cd vs17-x86
cmake .. -G "Visual Studio 17 2022" -A Win32 -DSW_BUILD=OFF -DBUILD_SHARED_LIBS=ON -DCMAKE_TOOLCHAIN_FILE=%VCPKG_HOME%\scripts\buildsystems\vcpkg.cmake -DVCPKG_TARGET_TRIPLET=x86-windows-static -DCMAKE_INSTALL_PREFIX=..\..\build\x86
cmake --build . --config Release --target install
cd ..
mkdir vs16-x64 & cd vs16-x64
cmake .. -G "Visual Studio 17 2022" -A x64 -DSW_BUILD=OFF -DBUILD_SHARED_LIBS=ON -DCMAKE_TOOLCHAIN_FILE=%VCPKG_HOME%\scripts\buildsystems\vcpkg.cmake -DVCPKG_TARGET_TRIPLET=x64-windows-static -DCMAKE_INSTALL_PREFIX=..\..\build\x64
mkdir vs17-x64 & cd vs17-x64
cmake .. -G "Visual Studio 17 2022" -A x64 -DSW_BUILD=OFF -DBUILD_SHARED_LIBS=ON -DCMAKE_TOOLCHAIN_FILE=%VCPKG_HOME%\scripts\buildsystems\vcpkg.cmake -DVCPKG_TARGET_TRIPLET=x64-windows-static -DCMAKE_INSTALL_PREFIX=..\..\build\x64
cmake --build . --config Release --target install
```
4. Build Tesseract:
Expand All @@ -38,11 +38,11 @@ linking leptonica into tesseract which increases file size (since the leptonica
cd tesserct
git checkout -b 5.2.0 5.2.0
mkdir vs17-x86 & cd vs17-x86
cmake .. -G "Visual Studio 17 2022" -A Win32 -DAUTO_OPTIMIZE=OFF -DSW_BUILD=OFF -DBUILD_TRAINING_TOOLS=OFF -DCMAKE_INSTALL_PREFIX=..\..\build\x86
cmake .. -G "Visual Studio 17 2022" -A Win32 -DAUTO_OPTIMIZE=OFF -DSW_BUILD=OFF -DBUILD_SHARED_LIBS=ON -DBUILD_TRAINING_TOOLS=OFF -DCMAKE_INSTALL_PREFIX=..\..\build\x86
cmake --build . --config Release --target install
cd ..
mkdir vs17-x64 & cd vs17-x64
cmake .. -G "Visual Studio 17 2022" -A x64 -DAUTO_OPTIMIZE=OFF -DSW_BUILD=OFF -DBUILD_TRAINING_TOOLS=OFF -DCMAKE_INSTALL_PREFIX=..\..\build\x64
cmake .. -G "Visual Studio 17 2022" -A x64 -DAUTO_OPTIMIZE=OFF -DSW_BUILD=OFF -DBUILD_SHARED_LIBS=ON -DBUILD_TRAINING_TOOLS=OFF -DCMAKE_INSTALL_PREFIX=..\..\build\x64
cmake --build . --config Release --target install
```

Expand All @@ -53,7 +53,8 @@ linking leptonica into tesseract which increases file size (since the leptonica

### Tesseract Notes:

* Like Leptonica, Tesseract needs to be built to use shared libraries
* For portability architecture optimizations have been disabled using ``-DAUTO_OPTIMIZE=OFF`.
This however will disable platform specific optimizations (AVX, SSE4.1, etc) which would likely
result in better performance if your guarantied they will be available.
* Like leptonica Self Build has also been disabled using ``-DSW_BUILD=OFF``.
* Like Leptonica, Self Build has also been disabled using ``-DSW_BUILD=OFF``.
59 changes: 0 additions & 59 deletions docs/Compling_tesseract_and_leptonica.md.bak

This file was deleted.

Empty file removed src/InternalTrace.3044.log
Empty file.
Empty file removed src/InternalTrace.3144.log
Empty file.
Empty file removed src/InternalTrace.3536.log
Empty file.
Empty file removed src/InternalTrace.7132.log
Empty file.
Empty file removed src/InternalTrace.8476.log
Empty file.
4 changes: 2 additions & 2 deletions src/Tesseract.Net48Tests/Tesseract.Net48Tests.csproj
Original file line number Diff line number Diff line change
Expand Up @@ -194,13 +194,13 @@
<None Include="..\Tesseract\x64\leptonica-1.82.0.dll" Link="x64\leptonica-1.82.0.dll">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</None>
<None Include="..\Tesseract\x64\tesseract50.dll" Link="x64\tesseract50.dll">
<None Include="..\Tesseract\x64\tesseract52.dll" Link="x64\tesseract52.dll">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</None>
<None Include="..\Tesseract\x86\leptonica-1.82.0.dll" Link="x86\leptonica-1.82.0.dll">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</None>
<None Include="..\Tesseract\x86\tesseract50.dll" Link="x86\tesseract50.dll">
<None Include="..\Tesseract\x86\tesseract52.dll" Link="x86\tesseract52.dll">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</None>
</ItemGroup>
Expand Down
4 changes: 2 additions & 2 deletions src/Tesseract.NetCore31Tests/Tesseract.NetCore31Tests.csproj
Original file line number Diff line number Diff line change
Expand Up @@ -192,13 +192,13 @@
<None Include="..\Tesseract\x64\leptonica-1.82.0.dll" Link="x64\leptonica-1.82.0.dll">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</None>
<None Include="..\Tesseract\x64\tesseract50.dll" Link="x64\tesseract50.dll">
<None Include="..\Tesseract\x64\tesseract52.dll" Link="x64\tesseract52.dll">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</None>
<None Include="..\Tesseract\x86\leptonica-1.82.0.dll" Link="x86\leptonica-1.82.0.dll">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</None>
<None Include="..\Tesseract\x86\tesseract50.dll" Link="x86\tesseract50.dll">
<None Include="..\Tesseract\x86\tesseract52.dll" Link="x86\tesseract52.dll">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</None>
</ItemGroup>
Expand Down
2 changes: 1 addition & 1 deletion src/Tesseract.Tests/BaseApiTests.cs
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ public class BaseApiTests
public void CanGetVersion()
{
var version = Interop.TessApi.BaseApiGetVersion();
Assert.That(version, Does.StartWith("5.0.0"));
Assert.That(version, Does.StartWith("5.2.0"));
}
}
}
8 changes: 4 additions & 4 deletions src/Tesseract.Tests/EngineTests.cs
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
using NUnit.Framework;
using NUnit.Framework;
using System;
using System.Collections.Generic;
using System.Drawing;
Expand All @@ -20,7 +20,7 @@ public void CanGetVersion()
{
using (var engine = CreateEngine())
{
Assert.That(engine.Version, Does.StartWith("5.0.0"));
Assert.That(engine.Version, Does.StartWith("5.2.0"));
}
}

Expand Down Expand Up @@ -71,7 +71,7 @@ public void CanParseMultipageTifOneByOne()
[TestCase(PageSegMode.SingleColumn, "This is a lot of 12 point text to test the")]
[TestCase(PageSegMode.SingleLine, "This is a lot of 12 point text to test the")]
[TestCase(PageSegMode.SingleWord, "This")]
[TestCase(PageSegMode.SingleChar, "T")]
[TestCase(PageSegMode.SingleChar, "hl")]
[TestCase(PageSegMode.SingleBlockVertText, "A line of text", Ignore = "#490")]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file actually contains an image of the letter "T". However, the margin on the right-hand side of "T" is smaller than the left, and I think that's causing the auto-thresholding algorithm to invert the thresholding and recognize the text as "hl" instead. I updated the PNG and got the expected result:
OLD:
https://github.com/charlesw/tesseract/blob/master/src/Tesseract.Tests/Data/Ocr/PSM_SingleChar.png
NEW:
PSM_SingleChar

Copy link
Contributor Author

@Methuselah96 Methuselah96 Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planning to look into that and make sure that was a regression with the Tesseract library itself, and not an issue with the C# wrapper, thanks for looking into it. I'm surprised that the Tesseract library would return more than one character when it's explicitly instructed to only return a single character.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You make an interesting point about Tesseract returning 2 characters despite the PageSegMode; might be worth digging into deeper as a potential Tesseract library defect.

public void CanParseText_UsingMode(PageSegMode mode, String expectedText)
{
Expand Down Expand Up @@ -135,7 +135,7 @@ public void CanProcessBitmap()
var text = page.GetText();

const string expectedText =
"This is a lot of 12 point text to test the\nocr code and see if it works on all types\nof file format.\n\nThe quick brown dog jumped over the\nlazy fox. The quick brown dog jumped\nover the lazy fox. The quick brown dog\njumped over the lazy fox. The quick\nbrown dog jumped over the lazy fox.\n";
"This is a lot of 12 point text to test the\nocr code and see if it works on all types\nof file format.\n\nThe quick brown dog jumped over the\n\nlazy fox. The quick brown dog jumped\nover the lazy fox. The quick brown dog\njumped over the lazy fox. The quick\nbrown dog jumped over the lazy fox.\n\n";

Assert.That(text, Is.EqualTo(expectedText));
}
Expand Down
14 changes: 7 additions & 7 deletions src/Tesseract.Tests/Results/EngineTests/CanPrintVariables.txt
Original file line number Diff line number Diff line change
Expand Up @@ -339,7 +339,7 @@ tessedit_train_from_boxes 0 Generate training data from boxed chars
tessedit_make_boxes_from_boxes 0 Generate more boxes from boxed chars
tessedit_train_line_recognizer 0 Break input into lines and remap boxes if present
tessedit_dump_pageseg_images 0 Dump intermediate images made during page segmentation
tessedit_do_invert 1 Try inverting the image in `LSTMRecognizeWord`
tessedit_do_invert 1 Try inverted line image if necessary (deprecated, will be removed in release 6, use the 'invert_threshold' parameter instead)
thresholding_debug 0 Debug the thresholding process
tessedit_ambigs_training 0 Perform training for ambiguities
tessedit_adaption_debug 0 Generate and print debug information for adaption
Expand Down Expand Up @@ -488,7 +488,6 @@ matcher_avg_noise_size 12 Avg. noise blob length
matcher_clustering_max_angle_delta 0.015 Maximum angle delta for prototype clustering
classify_misfit_junk_penalty 0 Penalty to apply when a non-alnum is vertically out of its expected textline position
rating_scale 1.5 Rating scaling factor
certainty_scale 20 Certainty scaling factor
tessedit_class_miss_scale 0.00390625 Scale factor for features not used
classify_adapted_pruning_factor 2.5 Prune poor adapted results this much worse than best result
classify_adapted_pruning_threshold -1 Threshold at which classify_adapted_pruning_factor starts
Expand Down Expand Up @@ -531,11 +530,12 @@ language_model_penalty_chartype 0.3 Penalty for inconsistent character type
language_model_penalty_font 0 Penalty for inconsistent font
language_model_penalty_spacing 0.05 Penalty for inconsistent spacing
language_model_penalty_increment 0.01 Penalty increment
thresholding_window_size 0.33 Window size for measuring local statistics (to be multiplied by image DPI). This parameter is used by the Sauvola thresolding method
thresholding_kfactor 0.34 Factor for reducing threshold due to variance. This parameter is used by the Sauvola thresolding method. Normal range: 0.2-0.5
thresholding_tile_size 0.33 Desired tile size (to be multiplied by image DPI). This parameter is used by the LeptonicaOtsu thresolding method
thresholding_smooth_kernel_size 0 Size of convolution kernel applied to threshold array (to be multiplied by image DPI). Use 0 for no smoothing. This parameter is used by the LeptonicaOtsu thresolding method
thresholding_score_fraction 0.1 Fraction of the max Otsu score. This parameter is used by the LeptonicaOtsu thresolding method. For standard Otsu use 0.0, otherwise 0.1 is recommended
invert_threshold 0.7 For lines with a mean confidence below this value, OCR is also tried with an inverted image
thresholding_window_size 0.33 Window size for measuring local statistics (to be multiplied by image DPI). This parameter is used by the Sauvola thresholding method
thresholding_kfactor 0.34 Factor for reducing threshold due to variance. This parameter is used by the Sauvola thresholding method. Normal range: 0.2-0.5
thresholding_tile_size 0.33 Desired tile size (to be multiplied by image DPI). This parameter is used by the LeptonicaOtsu thresholding method
thresholding_smooth_kernel_size 0 Size of convolution kernel applied to threshold array (to be multiplied by image DPI). Use 0 for no smoothing. This parameter is used by the LeptonicaOtsu thresholding method
thresholding_score_fraction 0.1 Fraction of the max Otsu score. This parameter is used by the LeptonicaOtsu thresholding method. For standard Otsu use 0.0, otherwise 0.1 is recommended
noise_cert_basechar -8 Hingepoint for base char certainty
noise_cert_disjoint -1 Hingepoint for disjoint certainty
noise_cert_punc -3 Threshold for new punc char certainty
Expand Down
2 changes: 1 addition & 1 deletion src/Tesseract/Interop/Constants.cs
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ namespace Tesseract.Interop
internal static class Constants
{
public const string LeptonicaDllName = "leptonica-1.82.0";
public const string TesseractDllName = "tesseract50";
public const string TesseractDllName = "tesseract52";

// tesseract uses an int to represent true false values.
public const int TRUE = 1;
Expand Down
Binary file modified src/Tesseract/x64/leptonica-1.82.0.dll
Binary file not shown.
Binary file removed src/Tesseract/x64/tesseract.exe
Binary file not shown.
Binary file removed src/Tesseract/x64/tesseract50.dll
Binary file not shown.
Binary file added src/Tesseract/x64/tesseract52.dll
Binary file not shown.
Binary file modified src/Tesseract/x86/leptonica-1.82.0.dll
Binary file not shown.
Binary file removed src/Tesseract/x86/tesseract.exe
Binary file not shown.
Binary file removed src/Tesseract/x86/tesseract50.dll
Binary file not shown.
Binary file added src/Tesseract/x86/tesseract52.dll
Binary file not shown.
Loading