Major tesseract improvements #52379

erikarvstedt · 2018-12-16T14:26:48Z

Decouple Tesseract from the tessdata language corpus, thus allowing a
lightweight installation for users that want to dynamically provide their
language data via environment vars.
Most importantly, this allows users to override the supported languages without
triggering a time-consuming recompilation. This brings a significant speed boost.
Provide all languages as individual derivations.

Here's a quick demo:

{ pkgs ? (import <nixpkgs> {})}:

with pkgs;
{
  inherit tesseract tesseract4;

  # Assemble the tessdata from individual language derivations without
  # downloading the whole tessdata corpus.
  # This downloads in a just a few seconds. Previously, this triggered a full
  # tessdata download (1.2 GB unzipped) + a full recompilation.
  tesseractWithCustomLangs = tesseract.override {
    enableLanguages = [ "eng" "fra" ];
  };

  # Low-level custom tessdata
  tesseractWithUserTessdata = tesseract.override {
    tessdata = [
      tesseract.languages.eng
      (fetchurl {
        url = "https://github.com/tesseract-ocr/tessdata/blob/3cf1e2df1fe1d1da29295c9ef0983796c7958b7d/tel.traineddata";
        sha256 = "1h4xn6ccd24hv4ps5kg53vk1cxfcqk6w4v1k48lps8f3zpn5aix8";
      })
    ];
  };
}

Obsolete, historical content (for the archives)

To be discussed:

I'd recommend that we provide all tessdata languages as individual derivations.
Bonuses:

Increased user friendliness.
Languages are no longer downloaded and stored twice in /nix/store
(once as an individual derivation, once as part of languages.all).
Although for --optimised stores, the storage concerns are irrelevant.
We can get rid of some special-case handling in the tessdata definition in default.nix.

The extra maintenance burden is negligible due to the get-language-hashes.sh helper.

Edit 2:
We've arrived at the following:
Keep a simple, single fixed output drv combining all languages that's used by pkgs.tesseract and is cached by Hydra, as before. This keeps eval times low.
Provide individual derivations for all languages, but don't cache them with Hydra. These drvs are downloaded individually from the original repo source when tesseract.enableLanguages is modified.

Edit:
I've added a commit that implements separate lang drvs.
The evaluation time of nixpkgs.tesseract (which includes all languages) is increased by 6.8% (361ms vs 338ms, via hyperfine) but I think the added flexibility and simplicity is worth the little extra eval time.

Here's a quick script to verify that the new tessdata files are identical to the old ones:

new=$(nix-build --no-out-link -E '(import <nixpkgs-this-PR>  { config = {}; overlays = []; }).tesseract')
old=$(nix-build --no-out-link -E '(import <nixpkgs-unstable> { config = {}; overlays = []; }).tesseract')
for lang in $new/share/tessdata/*.traineddata; do
    oldLang=$old/share/tessdata/$(basename $lang)
    if ! cmp $lang $oldLang; then
        echo $lang differs
        exit 1
    fi
done

7c6f434c · 2018-12-16T14:40:54Z

Won't your approach to getting the language hashes fail if the language has already been downloaded before (but changed upstream)?

erikarvstedt · 2018-12-16T15:23:38Z

Thanks, of course it fails 🐱. Fixed here.

7c6f434c · 2018-12-16T15:29:16Z

If you already need to do all this, maybe just use nix-prefetch-url? I tihnk nix-build doesn't check TLS certificates.

erikarvstedt · 2018-12-16T16:31:27Z

Good point! I've switched to nix-prefetch-url.

7c6f434c · 2018-12-16T16:38:15Z

pkgs/applications/graphics/tesseract/default.nix

 , leptonica, libpng, libtiff, icu, pango, opencl-headers
 # Supported list of languages or `null' for all available languages
 , enableLanguages ? null
 # if you want just a specific list of languages, optionally specify a hash
 # to make tessdata a fixed output derivation.
 , enableLanguagesHash ? (if enableLanguages == null # all languages
-                         then "1h48xfzabhn0ldbx5ib67cp9607pr0zpblsy8z6fs4knn0zznfnw"
+                         then "11bi1hj2ihqrgvi9cam8mi70p4spm3syljkpnbglf4s8jkpfn15a"


Is this default ever used? If enableLanguages is null, languages.all is passed through without any extra processing…

Your're right, it's never used.
If we switch to single lang derivations only, we can get rid of enableLanguagesHash completely. Otherwise, we should remove the unused default value. Edit: I've removed the default value in a fixup commit.

I think if selected-languages case works, killing it might be suboptimal: there are some substantially two-language texts.

I meant that if we switch to single lang derivations only, tessdata could be just a list of fixed-output lang derivations, so no extra output hash would be needed.

7c6f434c · 2018-12-16T16:41:15Z

pkgs/applications/graphics/tesseract/default.nix

+          buildCommand = ''
+            mkdir $out
+            cd ${languages.all}
+            cp ${stdenv.lib.concatMapStringsSep " " (x: x + ".traineddata") enableLanguages} $out


Is there anything large enough not copied in the downloaded language packs? If no, you could just ln -s, I guess?

Yes. If we just need a single exotic language, like cym, all other languages (1.2 GB) are needlessly kept in the store.
The old behaviour was also to copy the files.
Edit: Note that we can get rid of this if-branch when we switch to individual lang derivations.

Oops, my bad, I thought each language is copied from the individual language pack.

In case you missed it: I've added a line to my previous reply.

I don't understand that if now. «If we mention all languages, use individual derivations instead of all»?

If just a list of derivations selected by language names would work without copying, that sounds better than that extra composition step.

I'm actually tending towards 3. What do you think?

There is also the option 2.1, which fetches the whole repository and sets empty meta.hydraPlatforms for the individual languages. Then only English is stored twice on Hydra (used for VM tests). Arguably, we could even avoid keeping any language except English on Hydra… (option 2.2)

2.2 might introduce a too large dependence (and burden) on GitHub: Every user fetching tesseract or pyocr would have download the whole of tessdata from GitHub.

2.1 sounds nice, along with removing nixpkgs.tesseractLanguages. Would you recommend that?

Ah, silly me, by removing nixpkgs.tesseractLanguages Hydra wouldn't store the individual languages in the first place. So I guess that's the best solution...

Please see "Edit 2" in the main PR description for a short summary of our current approach. I think we've hit the sweet spot.

7c6f434c · 2018-12-16T20:00:34Z

pkgs/applications/graphics/tesseract/get-language-hashes.sh


-nixSrc=$(sed "s/TESSDATA_REV/$tessdataRev/" <<'EOF'
+nixSrc=$(sed -e "s/TESSDATA_REV/$tessdataRev/" -e "s/LANGUAGE_CODES/$langCodesExpr/" <<'EOF'


Is it just me, or would just using shell (and exporting the list from Nix using concatStringsSep, if needed) be simpler in the current approach?

Could you clarify a bit what you mean by this?

if [[ $localLangs ]]; then langCodes=$(nix-instantiate --eval-only -E 'with import ../../../../ {config={}; overlays=[];}; builtins.toString (builtins.attrNames (builtins.removeAttrs tesseractLanguages [ "recurseForDerivations" "all" ])) ' | tr -d '"') else langCodes=$(echo $(curl -s https://github.com/tesseract-ocr/tessdata/tree/$tessdataRev \ | grep -ohP "(?<=/)[^/]+?(?=\.traineddata)" | sort)) fi for lang in $langCodes; do url = "https://github.com/tesseract-ocr/tessdata/raw/${tessdataRev}/${lang}.traineddata"; echo "${lang} = $(nix-prefetch-url "$url" 2>/dev/null)" done

(obviously untested)

Yep, that's infinitely more elegant, thanks for pointing it out! Fixed.

7c6f434c · 2018-12-17T08:44:37Z

pkgs/top-level/all-packages.nix

@@ -19486,6 +19486,7 @@ in
  termtosvg = callPackage ../tools/misc/termtosvg { };

  tesseract = callPackage ../applications/graphics/tesseract { };
+  tesseractWithoutData = callPackage ../applications/graphics/tesseract { enableLanguages = false; };


Maybe just use [] for the no-languages case?

It is already available as both tesseract.tesseractWithoutData and (if you agree with point 1) tesseract.override {enableLanguages = [];}, does it need a top-level attribute or maybe just a top-level comment?

# you can pass `enableLanguages` to `tesseract` to install tessearct with fewer supported languages and save space, # or even pass `enableLanguages = []` and then supply the language packs via environment.

I have no preference, I'll let you decide.

We need a top-level attr so that the no-languages drv is cached. Without it, we would incur a full compilation when enableLanguages is customized.

I think [] reduces the number of types (list or null seems to be a reasonable union type, list of null or false boolean should provide some value). So I would slightly prefer []

Given the size of the compiled part (less than a single language model), maybe symlink the files from there and only copy the files that need to be edited to replace the path? I think sed can even break symlinks automatically in the inplace mode. Then the binary build becomes a dependency of the toplevel attribute.

(I am not sure if Hydra actually keeps the build depepndencies — then the entire point is moot — but even if not, not a large problem)

I was just typing up exactly what you proposed here. 😸
With this option, the build script of tesseractWithData will be a bit more complex, but we lose the slightly unelegant extra top-level attr. I think that's preferable.
I'll have breakfast and then implement it. Thanks for your very thoughtful feedback!

Well, a simple way is just to add a symlink to the original as an extra output. A bit more duplication and a bit less elegant outcome, though.

7c6f434c · 2018-12-17T08:55:01Z

Just in case @viric has any comments, I mention him (he is less active these days, though).

7c6f434c · 2018-12-17T08:55:28Z

pkgs/applications/graphics/tesseract/default.nix

+      description = "OCR engine";
+      homepage = https://github.com/tesseract-ocr/tesseract;
+      license = stdenv.lib.licenses.asl20;
+      maintainers = with stdenv.lib.maintainers; [viric];


You might want to add yourself as a maintainer.

erikarvstedt · 2018-12-17T16:05:13Z

There's still one very serious problem to overcome.
Sorry for not mentioning it earlier, I've just come to realize its full scope:
pkgs.tesseract.overrideAttrs is not backwards-compatible, because the overridden attrs are not passed on to the underlying no-languages tesseract drv.

Supporting the old overrideAttrs functionality, while still providing a way to build custom lang tesseract drvs without full recompilation is possible, but very obscure and hacky:
It would involve storing the attrs of tesseractWithoutData in its build output via __structuredAttrs. The tesseractWithData builder could then compare these attrs to its own __structuredAttrs and start a compilation (instead of a copy) when attrs differ. If you're interested, I could expand on that.

Currently, I can see only one simple solution to deploy our new features:
Add a new arg (like withLanguages) to the tesseract package function.

This stuff will be easy when nixpkgs uses nix-native modules and all override* mechanisms are unified, till then it's all rather unsatisfying.

7c6f434c · 2018-12-17T16:11:46Z

Well, full compatibility might be hard to reach (but we normally allow this is the refactoring improves other things).

I guess the cleanest appoach would be to have an internal package set with newScope, so that one could do things like tesseract.override { tesseractWithoutData = tesseract.tesseractWithoutData.override {…}; }

erikarvstedt · 2018-12-17T18:13:04Z

Is there an existing package that's defined in a similar way?

7c6f434c · 2018-12-17T18:21:41Z

Well, I guess LibreOffice is similar.

erikarvstedt · 2018-12-17T18:24:47Z

Sure? <nixpkgs>/pkgs/applications/office/libreoffice doesn't contain the word scope.

7c6f434c · 2018-12-17T18:33:28Z

It's because it is called from the top-level; newScope per se is used, say, in winePackages.

Or maybe it should just be called a top-level tesseract-unwrapped — see, for example, weechat

erikarvstedt · 2018-12-17T19:12:51Z

If we use the weechat approach we're back to tesseractWithoutData, now called tesseract-unwrapped.

What about just adding a tesseractWithoutData arg to the tesseract package function, and also exposing tesseractWithoutData via passthru?
That would make your above example

tesseract.override { tesseractWithoutData = tesseract.tesseractWithoutData.override {…}; }

work with minimal effort.

k2pdfopt is the only place in nixpkgs where tesseract.overrideAttrs is used. It could adapted to a similar style:

tesseract_modded = tesseract.override {
  tesseractWithoutData = tesseract.tesseractWithoutData.overrideAttrs (attrs: {…})
}

7c6f434c · 2018-12-17T19:19:11Z

If we use the `weechat` approach we're back to `tesseractWithoutData`, now called `tesseract-unwrapped`.

Well, yes.

What about just adding a `tesseractWithoutData` arg to the tesseract package function, and also exposing `tesseractWithoutData` via `passthru`?

I am separately not sure if using unwrapped (or `tesseract.tesseract`) is better for the public interface, given what other packages do.

k2pdfopt is the only place in nixpkgs where `tesseract.overrideAttrs` is used. It could adapted to a similar style:

Indeed.

erikarvstedt · 2018-12-17T22:38:14Z

not sure if using unwrapped (or tesseract.tesseract) is better for the public interface

In the latest update I've settled for tesseract.tesseractBase.

I'm reasonably happy with what we have now. What do you think?

7c6f434c · 2018-12-19T13:41:30Z

@GrahamcOfBorg build tesseract tesseract_4 nixosTests.chromium python3Packages.pyocr k2pdfopt

7c6f434c · 2018-12-19T13:42:47Z

Well, in normal use all our version naming schemes do not seem to be too annoying… And of course we have all the possible approaches in use.

7c6f434c · 2018-12-19T13:44:05Z

pkgs/top-level/all-packages.nix

-  tesseract_4 = lowPrio (callPackage ../applications/graphics/tesseract/4.x.nix { });
+  inherit (callPackage ../applications/graphics/tesseract {})
+    tesseract
+    tesseract_4;


Missed that?

weirdly, my grep missed that.

erikarvstedt · 2018-12-19T13:44:28Z

So better remove that alias again? I'd prefer tesseract4 as it's more in line with the most popular versioned pkg names.

7c6f434c · 2018-12-19T13:46:48Z

Well, alias is OK.

erikarvstedt · 2018-12-19T13:53:26Z

Alright, let's do a final borg build.

7c6f434c · 2018-12-19T13:58:43Z

Hm. Do you think haveing tesseract3 and tesseract just pointing to tesseract3 (see: love) is a bad idea?

erikarvstedt · 2018-12-19T14:05:57Z

Like so? erikarvstedt@4770ab7
Yeah, that would be fine.

erikarvstedt · 2018-12-19T14:07:43Z

Ah, you mean the game engine 😄

inherit (callPackage ../applications/graphics/tesseract {})
    tesseract3
    tesseract4;
tesseract = tesseract3;

That's fine, too, I'll add it.

erikarvstedt · 2018-12-19T14:14:28Z

Ah, nice screwup. Tesseract 4 expects a different TESSDATA_PREFIX format than version 3, that's why the chromium test fails.
I'll be back in an hour, then I'll fix it.

erikarvstedt · 2018-12-19T16:40:43Z

@GrahamcOfBorg build tesseract_4 nixosTests.chromium python3Packages.pyocr k2pdfopt

7c6f434c · 2018-12-19T16:48:02Z

Is pyocr failure related or not?

erikarvstedt · 2018-12-19T16:51:01Z

It's unrelated to this PR.

erikarvstedt · 2018-12-19T16:59:58Z

Here's a similar failure from a PR that doesn't affect Tesseract. Unfortunately, I can't read the full logs right now.

7c6f434c · 2018-12-19T17:02:20Z

Ah indeed, it is about cuneiform.py

Rename default.nix -> tesseract3.nix Rename 4.x.nix -> tesseract4.nix This is needed for the following commits.

Tesseract is now decoupled from the tessdata language corpus. This avoids recompilation when building Tesseract with a custom set of languages. Update k2pdfopt to use the new wrapper interface.

This frees users from downloading all languages when building Tesseract with a custom set of languages. `enableLanguagesHash` is now obsolete.

This is more consistent with the naming of the most popular versioned pkgs.

erikarvstedt · 2018-12-20T14:56:53Z

@GrahamcOfBorg' bring us two beers!

That was quite a ride. Thanks, Michael, for your patience and your great feedback!

7c6f434c · 2018-12-20T20:16:42Z

As of patience, it is definitely me who should thank you!

Profpatsch · 2018-12-21T23:52:37Z

If we split bin and man, we can't run man tesseract inside nix-shell -p tesseract (or even from profiles built with nix-env, I haven't tested that). So I'd strongly vote against that split.

Nope, that’s not true.

> du -Lhs /nix/store/kj687mc1zi37yg4c718d5lwla5cs4s47-tesseract-3.05.00
1.1G	/nix/store/kj687mc1zi37yg4c718d5lwla5cs4s47-tesseract-3.05.00

This means I have to download >300MB (and extract them to >1GB) just to read the manpages? That seems excessive to me, and was the reason I submitted #43973 in the first place.

At least the man/doc/devdoc outputs should be separate, even if lib/bin/dev are shared in out.

7c6f434c · 2018-12-22T00:06:14Z

This means I have to download >300MB (and extract them to >1GB) just to read the manpages?

Well, just to read manpages you would download tesseract.tesseractBase (4.5 MiB unpacked)…

GrahamcOfBorg added 8.has: package (new) 10.rebuild-darwin: 1-10 10.rebuild-linux: 11-100 labels Dec 16, 2018

erikarvstedt force-pushed the tesseract branch from 3c4ff54 to fba3881 Compare December 16, 2018 14:47

7c6f434c reviewed Dec 16, 2018

View reviewed changes

GrahamcOfBorg added 10.rebuild-linux: 101-500 10.rebuild-linux: 11-100 and removed 10.rebuild-linux: 11-100 10.rebuild-linux: 101-500 labels Dec 16, 2018

erikarvstedt force-pushed the tesseract branch from 01e3738 to b1e8686 Compare December 16, 2018 22:03

7c6f434c reviewed Dec 17, 2018

View reviewed changes

erikarvstedt force-pushed the tesseract branch 2 times, most recently from 1e504e8 to d743a0c Compare December 17, 2018 22:46

7c6f434c reviewed Dec 19, 2018

View reviewed changes

GrahamcOfBorg added the 6.topic: nixos label Dec 19, 2018

erikarvstedt force-pushed the tesseract branch from 9fa055b to 3f345e0 Compare December 19, 2018 13:52

erikarvstedt force-pushed the tesseract branch from 3f345e0 to 6d149ee Compare December 19, 2018 13:55

GrahamcOfBorg added the 8.has: clean-up label Dec 19, 2018

erikarvstedt added 5 commits December 19, 2018 18:07

tesseract: change file layout

45d2a2d

Rename default.nix -> tesseract3.nix Rename 4.x.nix -> tesseract4.nix This is needed for the following commits.

tesseract: add a wrapper to setup languages

aaaed13

Tesseract is now decoupled from the tessdata language corpus. This avoids recompilation when building Tesseract with a custom set of languages. Update k2pdfopt to use the new wrapper interface.

tesseract: add separate language derivations

b818997

This frees users from downloading all languages when building Tesseract with a custom set of languages. `enableLanguagesHash` is now obsolete.

tesseract: rename to tesseract4, add alias

8d1ba99

This is more consistent with the naming of the most popular versioned pkgs.

tesseract: add tesseract3 top-level attr

0289f4a

erikarvstedt force-pushed the tesseract branch from 9381296 to 0289f4a Compare December 19, 2018 17:12

7c6f434c merged commit ede54f9 into NixOS:master Dec 20, 2018


		nixSrc=$(sed "s/TESSDATA_REV/$tessdataRev/" <<'EOF'
		nixSrc=$(sed -e "s/TESSDATA_REV/$tessdataRev/" -e "s/LANGUAGE_CODES/$langCodesExpr/" <<'EOF'

Major tesseract improvements #52379

Major tesseract improvements #52379

Conversation

erikarvstedt commented Dec 16, 2018 • edited

Obsolete, historical content (for the archives)

To be discussed:

7c6f434c commented Dec 16, 2018

erikarvstedt commented Dec 16, 2018

7c6f434c commented Dec 16, 2018

erikarvstedt commented Dec 16, 2018

Choose a reason for hiding this comment

erikarvstedt Dec 16, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikarvstedt Dec 16, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

7c6f434c commented Dec 17, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikarvstedt commented Dec 17, 2018

7c6f434c commented Dec 17, 2018

erikarvstedt commented Dec 17, 2018

7c6f434c commented Dec 17, 2018

erikarvstedt commented Dec 17, 2018

7c6f434c commented Dec 17, 2018

erikarvstedt commented Dec 17, 2018

7c6f434c commented Dec 17, 2018 via email

erikarvstedt commented Dec 17, 2018

7c6f434c commented Dec 19, 2018

7c6f434c commented Dec 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikarvstedt commented Dec 19, 2018

7c6f434c commented Dec 19, 2018

erikarvstedt commented Dec 19, 2018

7c6f434c commented Dec 19, 2018

erikarvstedt commented Dec 19, 2018

erikarvstedt commented Dec 19, 2018

erikarvstedt commented Dec 19, 2018

erikarvstedt commented Dec 19, 2018

7c6f434c commented Dec 19, 2018

erikarvstedt commented Dec 19, 2018

erikarvstedt commented Dec 19, 2018

7c6f434c commented Dec 19, 2018

erikarvstedt commented Dec 20, 2018

7c6f434c commented Dec 20, 2018

Profpatsch commented Dec 21, 2018 • edited

7c6f434c commented Dec 22, 2018 via email

erikarvstedt commented Dec 16, 2018 •

edited

erikarvstedt Dec 16, 2018 •

edited

erikarvstedt Dec 16, 2018 •

edited

Profpatsch commented Dec 21, 2018 •

edited