Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: NixOS/nixpkgs
base: c8c340b05ac2
Choose a base ref
...
head repository: NixOS/nixpkgs
compare: 5d5c0d590f35
Choose a head ref
  • 8 commits
  • 9 files changed
  • 1 contributor

Commits on Apr 11, 2017

  1. tesseract: Reintroduce enableLanguages

    I've removed that attribute in 68bc260,
    because the language files no longer were distributed as seperate files,
    but if we for example only want to use the English training data, the
    closure size of Tesseract gets quite large (around 1.2 GB), which is a
    bit much just to be able to run NixOS VM tests.
    
    For this reason I've also switched the VM tests back to using only the
    English language.
    
    Tested using the following VM tests (the ones that have OCR enabled) on
    x86_64-linux:
    
     * nixos/tests/chromium.nix -A stable
     * nixos/tests/emacs-daemon.nix
     * nixos/tests/installer.nix -A luksroot
     * nixos/tests/lightdm.nix
     * nixos/tests/plasma5.nix
     * nixos/tests/sddm.nix
    
    Signed-off-by: aszlig <aszlig@redmoonstudios.org>
    aszlig committed Apr 11, 2017
    Configuration menu
    Copy the full SHA
    288a791 View commit details
    Browse the repository at this point in the history
  2. leptonica: 1.72 -> 1.74.1

    The changes are a bit too big to include it here in the commit message,
    so if you want the details of what changed, please visit this URL:
    
    http://leptonica.org/source/version-notes.html
    
    I have also provided openjpeg, giflib and libwebp as dependencies so
    that Leptonica is able to read/write those file formats.
    
    Additionally I've added a patch that uses pkgconfig to resolve all
    dependencies (except giflib), because unlike AC_CHECK_LIB() the
    PKG_CHECK_MODULES() macro defines *_LIBS variables to include the linker
    search path.
    
    Unfortunately that patch alone is not enough, because the *_LIBS
    variable are substituted by the upstream configure.ac to *not* include
    the linker search paths, so we need to remove the AC_SUBST() calls
    within PKG_CHECK_MODULES().
    
    The only dependency that's not yet using PKG_CHECK_MODULES() is giflib,
    because giflib doesn't have a pkg-config description file, therefore
    we're using substituteInPlace to insert the linker search path after the
    lept.pc file was generated by configure.
    
    Another thing that we no longer need is the dependency on libpng version
    1.2, because Leptonica now also works with more recent libpng versions.
    
    Tested by building the package itself and also the following packages
    that immediately depend on leptonica:
    
     * k2pdfopt
     * tesseract
     * jbig2enc
    
    All of these packages succeeded to build on x86_64-linux.
    
    The main reason why I'm bumping Leptonica to version 1.74.1 is that we
    need at least version 1.74 to bump Tesseract to the latest upstream
    version.
    
    Signed-off-by: aszlig <aszlig@redmoonstudios.org>
    aszlig committed Apr 11, 2017
    Configuration menu
    Copy the full SHA
    42bb63f View commit details
    Browse the repository at this point in the history
  3. tesseract: 3.04.01 -> 3.05.00

    Upstream changelog:
    
     * Made some fine tuning to the hOCR output.
     * Added TSV as another optional output format.
     * Fixed ABI break introduced in 3.04.00 with the AnalyseLayout()
       method.
     * text2image tool - Enable all OpenType ligatures available in a font.
       This feature requires Pango 1.38 or newer.
     * Training tools - Replaced asserts with tprintf() and exit(1).
     * Fixed Cygwin compatibility.
     * Improved multipage tiff processing.
     * Improved the embedded pdf font (pdf.ttf).
     * Enable selection of OCR engine mode from command line.
     * Changed tesseract command line parameter '-psm' to '--psm'.
     * Added new C API for orientation and script detection, removed the old
       one.
     * Increased minimum autoconf version to 2.59.
     * Removed dead code.
     * Fixed many compiler warning.
     * Fixed memory and resource leaks.
     * Fixed some issues with the 'Cube' OCR engine.
     * Fixed some openCL issues.
     * Added option to build Tesseract with CMake build system.
     * Implemented CPPAN support for easy Windows building.
    
    The upstream URL of the change log is:
    
    https://github.com/tesseract-ocr/tesseract/releases/tag/3.05.00
    
    Tested by building against the following packages that directly depend
    on it:
    
     * vapoursynth (with ocrSupport = true)
     * pyocr (fails)
     * vobsub2srt
    
    Also tested against the following NixOS VM tests that have OCR enabled:
    
     * nixos/tests/chromium.nix -A stable
     * nixos/tests/emacs-daemon.nix
     * nixos/tests/installer.nix -A luksroot
     * nixos/tests/lightdm.nix
     * nixos/tests/plasma5.nix
     * nixos/tests/sddm.nix
    
    All of the packages and tests except pyocr build/succeed on
    x86_64-linux.
    
    Fixing pyocr is outside of the scope of this commit and will happen very
    soon.
    
    Signed-off-by: aszlig <aszlig@redmoonstudios.org>
    aszlig committed Apr 11, 2017
    Configuration menu
    Copy the full SHA
    c381fa9 View commit details
    Browse the repository at this point in the history
  4. pyocr: 0.4.4 -> 0.4.6

    Upstream changes for version 0.4.5:
    
     * Clean up exceptions raised when OCR fails:
     * Now, all tools raise only exceptions inheriting from
       pyocr.PyocrException
     * There is now one and only one TesseractError (shared between
       pyocr.libtesseract and pyocr.tesseract)
    
    Upstream changes for version 0.4.6:
    
     * hOCR outputs: Generate valid XHTML files
    
    The full upstream changelog can be found at:
    
    https://github.com/jflesch/pyocr/blob/master/ChangeLog
    
    Note that because of the version bump of Tesseract neither version 0.4.4
    nor version 0.4.6 succeed to build, so we need to fix this up soon.
    
    Signed-off-by: aszlig <aszlig@redmoonstudios.org>
    aszlig committed Apr 11, 2017
    Configuration menu
    Copy the full SHA
    121751e View commit details
    Browse the repository at this point in the history
  5. pyocr: Add patch to support Tesseract 3.05.00

    This is from the commit message I've written for the upstream pull
    request (openpaperwork/pyocr#62):
    
        This is a bit more involved, because Tesseract 3.05.00 comes not
        only with improvements but also with a few quirks we need to deal
        with.
    
        The first quirk is that the order arguments of the `tesseract'
        command now matters and the list of configurations has to be at the
        end of the command line. So we add a new attribute tesseract_flags
        to the BaseBuilder class that contains a list of all the flags to
        pass to `tesseract', the tesseract_configs attribute however remains
        pretty much the same but now only really contains a list of configs
        instead of being mixed with flag arguments.
    
        Another quirk has to do with Leptonica >= 1.74 which Tesseract
        3.05.00 now requires. Leptonica has special handling of files that
        reside in /tmp and assumes that it's an internal temporary file of
        Leptonica. In order to deal with it, we now run Tesseract in a
        temporary directory, which contains the input/output files and use
        the relative name of these files because Leptonica only searches for
        path names beginning with /tmp.
    
        Fortunately the last item we need to address is not really a quirk,
        but an API change. In Tesseract 3.05.00 there is now a new function
        called TessBaseAPIDetectOrientationScript(), which doesn't fill the
        OSResults object anymore but now allows to pass the values we're
        interested in directly by reference. We need to use this new
        function because the old function TessBaseAPIDetectOS() now *always*
        returns false.
    
    I've tested this specifically on NixOS and in conjunction with Paperwork
    (the only package that's using pyocr so far) and all the tests of the
    dependency chain are now succeeding. However, I didn't do manual tests
    of Paperwork though.
    
    Signed-off-by: aszlig <aszlig@redmoonstudios.org>
    aszlig committed Apr 11, 2017
    Configuration menu
    Copy the full SHA
    49cf934 View commit details
    Browse the repository at this point in the history
  6. tesseract: Package version 4.x from Git master

    Tesseract 4 has got a new long short-term memory neural networking based
    OCR engine which really helps a lot in terms of accuracy and our VM
    tests.
    
    I ran the new version across a bunch of different screenshots and
    comparing the results to the 3.x branch and it really makes a big
    difference, especially with various font rendering settings.
    
    The only downside of this is that version 4 hasn't been released yet and
    is in alpha state right now, but it will eventually get there and the
    only solutions that came into my mind sticking to version 3 were really
    sub-par:
    
     * Use several passes with different color negation on the screenshots.
     * Train Tesseract 3 specifically for screenshots. This is sub-par
       because we'd need to do it for Tesseract 4 from scratch again.
     * Change the test systems so that it specifically uses *only* OCR an
       font when displaying. I've actually tried this but this also isn't
       accurate enough with our default font rendering setup.
     * Turn off special font rendering settings for our tests. In
       conjunction with changing to an OCR font this might work but it won't
       catch all the cases, because applications might use their own font
       rendering.
    
    Given that version 4 is faster[1] when it comes to OCR detection and also
    the points just mentioned I think even using the alpha version just for
    tests isn't going to hurt anybody.
    
    [1]: https://github.com/tesseract-ocr/tesseract/wiki/4.0-Accuracy-and-Performance
    
    Signed-off-by: aszlig <aszlig@redmoonstudios.org>
    aszlig committed Apr 11, 2017
    Configuration menu
    Copy the full SHA
    7b5263e View commit details
    Browse the repository at this point in the history
  7. nixos/testing: Improve quality of OCR

    First of all, we're now using ImageMagick to improve the screenshot so
    that Tesseract has an esier time to recognize the text. The resulting
    image of this post-processing is a scaled up black-and-white version
    with the backgrounds almost entirely removed and the text edges a bit
    blurred, so the screen shots now more or less resemble an image from a
    scanner rather. This is what Tesseract is trained for by default.
    
    As mentioned in the previous commit we now also use Tesseract 4, which
    further improves the quality of text recognition.
    
    I've spent countless hours just to test different postprocessing
    variants and testing what works best for our tests and this is the one
    that worked best so far. It's certainly not perfect and I'd like to
    avoid the scaling step but we're way better off than before.
    
    In addition to this, the OCR process is now done without an intermediate
    file, solely using pipes.
    
    I've tested this using the following VM tests which have OCR enabled:
    
     * nixos/tests/chromium.nix -A stable
     * nixos/tests/emacs-daemon.nix
     * nixos/tests/installer.nix -A luksroot
     * nixos/tests/lightdm.nix
     * nixos/tests/plasma5.nix
     * nixos/tests/sddm.nix
    
    All of the tests still succeed and comparing some of the recognition
    results to the earlier results it now also detects a lot more text than
    before this commit.
    
    Signed-off-by: aszlig <aszlig@redmoonstudios.org>
    aszlig committed Apr 11, 2017
    1 Configuration menu
    Copy the full SHA
    a443bdc View commit details
    Browse the repository at this point in the history
  8. Revert "sddm: Fix test."

    This reverts commit 0a6a063.
    
    The commit replaced the text to search for from ALICE to BOB, because
    our OCR detection only caught "BOB FOOBAR" but missed "ALICE FOOBAR"
    completely.
    
    With the improvements to our OCR system this no longer is the case and
    the test passes successfully with this reverted.
    
    Signed-off-by: aszlig <aszlig@redmoonstudios.org>
    Cc: @shlevy
    aszlig committed Apr 11, 2017
    Configuration menu
    Copy the full SHA
    5d5c0d5 View commit details
    Browse the repository at this point in the history