Comparing changes

I've removed that attribute in 68bc260, because the language files no longer were distributed as seperate files, but if we for example only want to use the English training data, the closure size of Tesseract gets quite large (around 1.2 GB), which is a bit much just to be able to run NixOS VM tests. For this reason I've also switched the VM tests back to using only the English language. Tested using the following VM tests (the ones that have OCR enabled) on x86_64-linux: * nixos/tests/chromium.nix -A stable * nixos/tests/emacs-daemon.nix * nixos/tests/installer.nix -A luksroot * nixos/tests/lightdm.nix * nixos/tests/plasma5.nix * nixos/tests/sddm.nix Signed-off-by: aszlig <aszlig@redmoonstudios.org>

The changes are a bit too big to include it here in the commit message, so if you want the details of what changed, please visit this URL: http://leptonica.org/source/version-notes.html I have also provided openjpeg, giflib and libwebp as dependencies so that Leptonica is able to read/write those file formats. Additionally I've added a patch that uses pkgconfig to resolve all dependencies (except giflib), because unlike AC_CHECK_LIB() the PKG_CHECK_MODULES() macro defines *_LIBS variables to include the linker search path. Unfortunately that patch alone is not enough, because the *_LIBS variable are substituted by the upstream configure.ac to *not* include the linker search paths, so we need to remove the AC_SUBST() calls within PKG_CHECK_MODULES(). The only dependency that's not yet using PKG_CHECK_MODULES() is giflib, because giflib doesn't have a pkg-config description file, therefore we're using substituteInPlace to insert the linker search path after the lept.pc file was generated by configure. Another thing that we no longer need is the dependency on libpng version 1.2, because Leptonica now also works with more recent libpng versions. Tested by building the package itself and also the following packages that immediately depend on leptonica: * k2pdfopt * tesseract * jbig2enc All of these packages succeeded to build on x86_64-linux. The main reason why I'm bumping Leptonica to version 1.74.1 is that we need at least version 1.74 to bump Tesseract to the latest upstream version. Signed-off-by: aszlig <aszlig@redmoonstudios.org>

Upstream changelog: * Made some fine tuning to the hOCR output. * Added TSV as another optional output format. * Fixed ABI break introduced in 3.04.00 with the AnalyseLayout() method. * text2image tool - Enable all OpenType ligatures available in a font. This feature requires Pango 1.38 or newer. * Training tools - Replaced asserts with tprintf() and exit(1). * Fixed Cygwin compatibility. * Improved multipage tiff processing. * Improved the embedded pdf font (pdf.ttf). * Enable selection of OCR engine mode from command line. * Changed tesseract command line parameter '-psm' to '--psm'. * Added new C API for orientation and script detection, removed the old one. * Increased minimum autoconf version to 2.59. * Removed dead code. * Fixed many compiler warning. * Fixed memory and resource leaks. * Fixed some issues with the 'Cube' OCR engine. * Fixed some openCL issues. * Added option to build Tesseract with CMake build system. * Implemented CPPAN support for easy Windows building. The upstream URL of the change log is: https://github.com/tesseract-ocr/tesseract/releases/tag/3.05.00 Tested by building against the following packages that directly depend on it: * vapoursynth (with ocrSupport = true) * pyocr (fails) * vobsub2srt Also tested against the following NixOS VM tests that have OCR enabled: * nixos/tests/chromium.nix -A stable * nixos/tests/emacs-daemon.nix * nixos/tests/installer.nix -A luksroot * nixos/tests/lightdm.nix * nixos/tests/plasma5.nix * nixos/tests/sddm.nix All of the packages and tests except pyocr build/succeed on x86_64-linux. Fixing pyocr is outside of the scope of this commit and will happen very soon. Signed-off-by: aszlig <aszlig@redmoonstudios.org>

Upstream changes for version 0.4.5: * Clean up exceptions raised when OCR fails: * Now, all tools raise only exceptions inheriting from pyocr.PyocrException * There is now one and only one TesseractError (shared between pyocr.libtesseract and pyocr.tesseract) Upstream changes for version 0.4.6: * hOCR outputs: Generate valid XHTML files The full upstream changelog can be found at: https://github.com/jflesch/pyocr/blob/master/ChangeLog Note that because of the version bump of Tesseract neither version 0.4.4 nor version 0.4.6 succeed to build, so we need to fix this up soon. Signed-off-by: aszlig <aszlig@redmoonstudios.org>

This is from the commit message I've written for the upstream pull request (openpaperwork/pyocr#62): This is a bit more involved, because Tesseract 3.05.00 comes not only with improvements but also with a few quirks we need to deal with. The first quirk is that the order arguments of the `tesseract' command now matters and the list of configurations has to be at the end of the command line. So we add a new attribute tesseract_flags to the BaseBuilder class that contains a list of all the flags to pass to `tesseract', the tesseract_configs attribute however remains pretty much the same but now only really contains a list of configs instead of being mixed with flag arguments. Another quirk has to do with Leptonica >= 1.74 which Tesseract 3.05.00 now requires. Leptonica has special handling of files that reside in /tmp and assumes that it's an internal temporary file of Leptonica. In order to deal with it, we now run Tesseract in a temporary directory, which contains the input/output files and use the relative name of these files because Leptonica only searches for path names beginning with /tmp. Fortunately the last item we need to address is not really a quirk, but an API change. In Tesseract 3.05.00 there is now a new function called TessBaseAPIDetectOrientationScript(), which doesn't fill the OSResults object anymore but now allows to pass the values we're interested in directly by reference. We need to use this new function because the old function TessBaseAPIDetectOS() now *always* returns false. I've tested this specifically on NixOS and in conjunction with Paperwork (the only package that's using pyocr so far) and all the tests of the dependency chain are now succeeding. However, I didn't do manual tests of Paperwork though. Signed-off-by: aszlig <aszlig@redmoonstudios.org>

Tesseract 4 has got a new long short-term memory neural networking based OCR engine which really helps a lot in terms of accuracy and our VM tests. I ran the new version across a bunch of different screenshots and comparing the results to the 3.x branch and it really makes a big difference, especially with various font rendering settings. The only downside of this is that version 4 hasn't been released yet and is in alpha state right now, but it will eventually get there and the only solutions that came into my mind sticking to version 3 were really sub-par: * Use several passes with different color negation on the screenshots. * Train Tesseract 3 specifically for screenshots. This is sub-par because we'd need to do it for Tesseract 4 from scratch again. * Change the test systems so that it specifically uses *only* OCR an font when displaying. I've actually tried this but this also isn't accurate enough with our default font rendering setup. * Turn off special font rendering settings for our tests. In conjunction with changing to an OCR font this might work but it won't catch all the cases, because applications might use their own font rendering. Given that version 4 is faster[1] when it comes to OCR detection and also the points just mentioned I think even using the alpha version just for tests isn't going to hurt anybody. [1]: https://github.com/tesseract-ocr/tesseract/wiki/4.0-Accuracy-and-Performance Signed-off-by: aszlig <aszlig@redmoonstudios.org>

First of all, we're now using ImageMagick to improve the screenshot so that Tesseract has an esier time to recognize the text. The resulting image of this post-processing is a scaled up black-and-white version with the backgrounds almost entirely removed and the text edges a bit blurred, so the screen shots now more or less resemble an image from a scanner rather. This is what Tesseract is trained for by default. As mentioned in the previous commit we now also use Tesseract 4, which further improves the quality of text recognition. I've spent countless hours just to test different postprocessing variants and testing what works best for our tests and this is the one that worked best so far. It's certainly not perfect and I'd like to avoid the scaling step but we're way better off than before. In addition to this, the OCR process is now done without an intermediate file, solely using pipes. I've tested this using the following VM tests which have OCR enabled: * nixos/tests/chromium.nix -A stable * nixos/tests/emacs-daemon.nix * nixos/tests/installer.nix -A luksroot * nixos/tests/lightdm.nix * nixos/tests/plasma5.nix * nixos/tests/sddm.nix All of the tests still succeed and comparing some of the recognition results to the earlier results it now also detects a lot more text than before this commit. Signed-off-by: aszlig <aszlig@redmoonstudios.org>

@shlevy

This reverts commit 0a6a063. The commit replaced the text to search for from ALICE to BOB, because our OCR detection only caught "BOB FOOBAR" but missed "ALICE FOOBAR" completely. With the improvements to our OCR system this no longer is the case and the test passes successfully with this reverted. Signed-off-by: aszlig <aszlig@redmoonstudios.org> Cc: @shlevy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparing changes

Open a pull request

Commits on Apr 11, 2017

This comparison is taking too long to generate.