-
Notifications
You must be signed in to change notification settings - Fork 749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix DLLs to be Tesseract 5.2 #667
Open
Methuselah96
wants to merge
4
commits into
charlesw:master
Choose a base branch
from
Methuselah96:fix-dlls-5.2
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
name: build-windows-libs | ||
|
||
on: | ||
push: | ||
branches: [master] | ||
pull_request: | ||
branches: [master] | ||
|
||
defaults: | ||
run: | ||
shell: cmd | ||
|
||
jobs: | ||
build: | ||
strategy: | ||
fail-fast: false | ||
matrix: | ||
arch: [x86, x64] | ||
runs-on: windows-latest | ||
steps: | ||
- name: Install dependencies | ||
run: vcpkg install giflib:${{ matrix.arch }}-windows-static libjpeg-turbo:${{ matrix.arch }}-windows-static liblzma:${{ matrix.arch }}-windows-static libpng:${{ matrix.arch }}-windows-static tiff:${{ matrix.arch }}-windows-static zlib:${{ matrix.arch }}-windows-static | ||
- name: Checkout Leptonica | ||
uses: actions/checkout@v4 | ||
with: | ||
repository: DanBloomberg/leptonica | ||
ref: 1.82.0 | ||
path: leptonica | ||
- name: Build Leptonica ${{ matrix.arch }} | ||
run: | | ||
mkdir vs17-${{ matrix.arch }} | ||
cd vs17-${{ matrix.arch }} | ||
cmake .. -G "Visual Studio 17 2022" -A ${{ matrix.arch == 'x86' && 'Win32' || 'x64' }} -DSW_BUILD=OFF -DBUILD_SHARED_LIBS=ON -DCMAKE_TOOLCHAIN_FILE=%VCPKG_INSTALLATION_ROOT%\scripts\buildsystems\vcpkg.cmake -DVCPKG_TARGET_TRIPLET=${{ matrix.arch }}-windows-static -DCMAKE_INSTALL_PREFIX=..\..\build\${{ matrix.arch }} | ||
cmake --build . --config Release --target install | ||
working-directory: leptonica | ||
- name: Checkout Tesseract | ||
uses: actions/checkout@v4 | ||
with: | ||
repository: tesseract-ocr/tesseract | ||
ref: 5.2.0 | ||
path: tesseract | ||
- name: Build Tesseract ${{ matrix.arch }} | ||
run: | | ||
mkdir vs17-${{ matrix.arch }} | ||
cd vs17-${{ matrix.arch }} | ||
cmake .. -G "Visual Studio 17 2022" -A ${{ matrix.arch == 'x86' && 'Win32' || 'x64' }} -DAUTO_OPTIMIZE=OFF -DSW_BUILD=OFF -DBUILD_SHARED_LIBS=ON -DBUILD_TRAINING_TOOLS=OFF -DCMAKE_INSTALL_PREFIX=..\..\build\${{ matrix.arch }} | ||
cmake --build . --config Release --target install | ||
working-directory: tesseract | ||
- name: Calculate hash of Leptonica | ||
run: certutil -hashfile leptonica\vs17-${{ matrix.arch }}\bin\Release\leptonica-1.82.0.dll SHA256 | ||
- name: Calculate hash of Tesseract | ||
run: certutil -hashfile tesseract\vs17-${{ matrix.arch }}\bin\Release\tesseract52.dll SHA256 | ||
- name: Archive Leptonica ${{ matrix.arch }} | ||
uses: actions/upload-artifact@v4 | ||
with: | ||
name: leptonica-${{ matrix.arch }} | ||
path: leptonica\vs17-${{ matrix.arch }}\bin\Release\leptonica-1.82.0.dll | ||
- name: Archive Tesseract ${{ matrix.arch }} | ||
uses: actions/upload-artifact@v4 | ||
with: | ||
name: tesseract-${{ matrix.arch }} | ||
path: tesseract\vs17-${{ matrix.arch }}\bin\Release\tesseract52.dll |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,66 +2,32 @@ name: CI | |
|
||
on: | ||
push: | ||
branches: [ 'macos-fix' ] | ||
branches: [master] | ||
pull_request: | ||
branches: [ 'main', 'develop' ] | ||
paths: | ||
- '**.cs' | ||
- '**.csproj' | ||
|
||
env: | ||
DOTNET_NOLOGO: true | ||
branches: [master] | ||
|
||
jobs: | ||
build-and-test: | ||
name: build-and-test-${{ matrix.os }} | ||
strategy: | ||
fail-fast: false | ||
matrix: | ||
os: [ 'ubuntu-latest', 'windows-latest', 'macOS-latest' ] | ||
os: [windows-2019, windows-2022] | ||
runs-on: ${{ matrix.os }} | ||
|
||
steps: | ||
- uses: actions/checkout@v2 | ||
- uses: actions/checkout@v4 | ||
|
||
- name: Setup .NET Core 3.1.x | ||
uses: actions/setup-dotnet@v1 | ||
- name: Setup .NET | ||
uses: actions/setup-dotnet@v4 | ||
with: | ||
dotnet-version: '3.1.x' | ||
|
||
- name: Setup .NET 6.0.x | ||
uses: actions/setup-dotnet@v1 | ||
with: | ||
dotnet-version: '6.0.x' | ||
|
||
- name: Install Ubuntu dependencies | ||
if: runner.os == 'Linux' | ||
run: | | ||
sudo apt-get update && | ||
sudo apt-get install -y tesseract-ocr libleptonica-dev && | ||
ls -la /usr/local/lib && | ||
ls -la /usr/lib/x86_64-linux-gnu | ||
|
||
- name: Install macOS dependencies | ||
if: runner.os == 'macOS' | ||
run: | | ||
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" && | ||
brew tap-new local/versions && | ||
brew extract --version=4.1.1 tesseract local/versions && | ||
brew install local/versions/[email protected] mono-libgdiplus | ||
|
||
- name: Install dependencies | ||
- name: Restore dependencies | ||
run: dotnet restore src/Tesseract.sln | ||
|
||
- name: Build | ||
run: dotnet build --configuration Release --no-restore src/Tesseract.sln | ||
run: dotnet build src/Tesseract.sln --no-restore | ||
|
||
- name: Test | ||
run: dotnet test --configuration Release --no-build --verbosity normal --logger trx --results-directory "TestResults" src/Tesseract.sln | ||
|
||
- name: Upload dotnet test results | ||
uses: actions/upload-artifact@v2 | ||
with: | ||
name: dotnet-results | ||
path: TestResults | ||
if: ${{ always() }} | ||
run: dotnet test src/Tesseract.sln --no-build --verbosity normal |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file actually contains an image of the letter "T". However, the margin on the right-hand side of "T" is smaller than the left, and I think that's causing the auto-thresholding algorithm to invert the thresholding and recognize the text as "hl" instead. I updated the PNG and got the expected result:
OLD:
https://github.com/charlesw/tesseract/blob/master/src/Tesseract.Tests/Data/Ocr/PSM_SingleChar.png
NEW:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was planning to look into that and make sure that was a regression with the Tesseract library itself, and not an issue with the C# wrapper, thanks for looking into it. I'm surprised that the Tesseract library would return more than one character when it's explicitly instructed to only return a single character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You make an interesting point about Tesseract returning 2 characters despite the PageSegMode; might be worth digging into deeper as a potential Tesseract library defect.