Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major tesseract improvements #52379

Merged
merged 5 commits into from Dec 20, 2018
Merged

Major tesseract improvements #52379

merged 5 commits into from Dec 20, 2018

Conversation

erikarvstedt
Copy link
Member

@erikarvstedt erikarvstedt commented Dec 16, 2018

  1. Decouple Tesseract from the tessdata language corpus, thus allowing a
    lightweight installation for users that want to dynamically provide their
    language data via environment vars.
  2. Most importantly, this allows users to override the supported languages without
    triggering a time-consuming recompilation. This brings a significant speed boost.
  3. Provide all languages as individual derivations.

Here's a quick demo:

{ pkgs ? (import <nixpkgs> {})}:

with pkgs;
{
  inherit tesseract tesseract4;

  # Assemble the tessdata from individual language derivations without
  # downloading the whole tessdata corpus.
  # This downloads in a just a few seconds. Previously, this triggered a full
  # tessdata download (1.2 GB unzipped) + a full recompilation.
  tesseractWithCustomLangs = tesseract.override {
    enableLanguages = [ "eng" "fra" ];
  };

  # Low-level custom tessdata
  tesseractWithUserTessdata = tesseract.override {
    tessdata = [
      tesseract.languages.eng
      (fetchurl {
        url = "https://github.com/tesseract-ocr/tessdata/blob/3cf1e2df1fe1d1da29295c9ef0983796c7958b7d/tel.traineddata";
        sha256 = "1h4xn6ccd24hv4ps5kg53vk1cxfcqk6w4v1k48lps8f3zpn5aix8";
      })
    ];
  };
}

Obsolete, historical content (for the archives)

To be discussed:

I'd recommend that we provide all tessdata languages as individual derivations.
Bonuses:

  • Increased user friendliness.
  • Languages are no longer downloaded and stored twice in /nix/store
    (once as an individual derivation, once as part of languages.all).
    Although for --optimised stores, the storage concerns are irrelevant.
  • We can get rid of some special-case handling in the tessdata definition in default.nix.

The extra maintenance burden is negligible due to the get-language-hashes.sh helper.

Edit 2:
We've arrived at the following:
Keep a simple, single fixed output drv combining all languages that's used by pkgs.tesseract and is cached by Hydra, as before. This keeps eval times low.
Provide individual derivations for all languages, but don't cache them with Hydra. These drvs are downloaded individually from the original repo source when tesseract.enableLanguages is modified.

Edit:
I've added a commit that implements separate lang drvs.
The evaluation time of nixpkgs.tesseract (which includes all languages) is increased by 6.8% (361ms vs 338ms, via hyperfine) but I think the added flexibility and simplicity is worth the little extra eval time.

Here's a quick script to verify that the new tessdata files are identical to the old ones:

new=$(nix-build --no-out-link -E '(import <nixpkgs-this-PR>  { config = {}; overlays = []; }).tesseract')
old=$(nix-build --no-out-link -E '(import <nixpkgs-unstable> { config = {}; overlays = []; }).tesseract')
for lang in $new/share/tessdata/*.traineddata; do
    oldLang=$old/share/tessdata/$(basename $lang)
    if ! cmp $lang $oldLang; then
        echo $lang differs
        exit 1
    fi
done

@7c6f434c
Copy link
Member

Won't your approach to getting the language hashes fail if the language has already been downloaded before (but changed upstream)?

@erikarvstedt
Copy link
Member Author

Thanks, of course it fails 🐱. Fixed here.

@7c6f434c
Copy link
Member

If you already need to do all this, maybe just use nix-prefetch-url? I tihnk nix-build doesn't check TLS certificates.

@erikarvstedt
Copy link
Member Author

Good point! I've switched to nix-prefetch-url.

, leptonica, libpng, libtiff, icu, pango, opencl-headers
# Supported list of languages or `null' for all available languages
, enableLanguages ? null
# if you want just a specific list of languages, optionally specify a hash
# to make tessdata a fixed output derivation.
, enableLanguagesHash ? (if enableLanguages == null # all languages
then "1h48xfzabhn0ldbx5ib67cp9607pr0zpblsy8z6fs4knn0zznfnw"
then "11bi1hj2ihqrgvi9cam8mi70p4spm3syljkpnbglf4s8jkpfn15a"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this default ever used? If enableLanguages is null, languages.all is passed through without any extra processing…

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your're right, it's never used.
If we switch to single lang derivations only, we can get rid of enableLanguagesHash completely. Otherwise, we should remove the unused default value. Edit: I've removed the default value in a fixup commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if selected-languages case works, killing it might be suboptimal: there are some substantially two-language texts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that if we switch to single lang derivations only, tessdata could be just a list of fixed-output lang derivations, so no extra output hash would be needed.

buildCommand = ''
mkdir $out
cd ${languages.all}
cp ${stdenv.lib.concatMapStringsSep " " (x: x + ".traineddata") enableLanguages} $out
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything large enough not copied in the downloaded language packs? If no, you could just ln -s, I guess?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. If we just need a single exotic language, like cym, all other languages (1.2 GB) are needlessly kept in the store.
The old behaviour was also to copy the files.
Edit: Note that we can get rid of this if-branch when we switch to individual lang derivations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, my bad, I thought each language is copied from the individual language pack.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case you missed it: I've added a line to my previous reply.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand that if now. «If we mention all languages, use individual derivations instead of all»?

If just a list of derivations selected by language names would work without copying, that sounds better than that extra composition step.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually tending towards 3. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also the option 2.1, which fetches the whole repository and sets empty meta.hydraPlatforms for the individual languages. Then only English is stored twice on Hydra (used for VM tests). Arguably, we could even avoid keeping any language except English on Hydra… (option 2.2)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2.2 might introduce a too large dependence (and burden) on GitHub: Every user fetching tesseract or pyocr would have download the whole of tessdata from GitHub.

2.1 sounds nice, along with removing nixpkgs.tesseractLanguages. Would you recommend that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, silly me, by removing nixpkgs.tesseractLanguages Hydra wouldn't store the individual languages in the first place. So I guess that's the best solution...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see "Edit 2" in the main PR description for a short summary of our current approach. I think we've hit the sweet spot.


nixSrc=$(sed "s/TESSDATA_REV/$tessdataRev/" <<'EOF'
nixSrc=$(sed -e "s/TESSDATA_REV/$tessdataRev/" -e "s/LANGUAGE_CODES/$langCodesExpr/" <<'EOF'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it just me, or would just using shell (and exporting the list from Nix using concatStringsSep, if needed) be simpler in the current approach?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify a bit what you mean by this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if [[ $localLangs ]]; then
    langCodes=$(nix-instantiate --eval-only -E 'with import ../../../../ {config={}; overlays=[];};
               builtins.toString (builtins.attrNames (builtins.removeAttrs
                            tesseractLanguages [ "recurseForDerivations" "all" ])) ' | tr -d '"')
else
    langCodes=$(echo $(curl -s https://github.com/tesseract-ocr/tessdata/tree/$tessdataRev \
                       | grep -ohP "(?<=/)[^/]+?(?=\.traineddata)" | sort))
fi

for lang in $langCodes; do
  url = "https://github.com/tesseract-ocr/tessdata/raw/${tessdataRev}/${lang}.traineddata";
  echo "${lang} = $(nix-prefetch-url "$url" 2>/dev/null)"
done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(obviously untested)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's infinitely more elegant, thanks for pointing it out! Fixed.

@@ -19486,6 +19486,7 @@ in
termtosvg = callPackage ../tools/misc/termtosvg { };

tesseract = callPackage ../applications/graphics/tesseract { };
tesseractWithoutData = callPackage ../applications/graphics/tesseract { enableLanguages = false; };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Maybe just use [] for the no-languages case?

  2. It is already available as both tesseract.tesseractWithoutData and (if you agree with point 1) tesseract.override {enableLanguages = [];}, does it need a top-level attribute or maybe just a top-level comment?

# you can pass `enableLanguages` to `tesseract` to install tessearct with fewer supported languages and save space,
# or even pass `enableLanguages = []` and then supply the language packs via environment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I have no preference, I'll let you decide.
  2. We need a top-level attr so that the no-languages drv is cached. Without it, we would incur a full compilation when enableLanguages is customized.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think [] reduces the number of types (list or null seems to be a reasonable union type, list of null or false boolean should provide some value). So I would slightly prefer []

Given the size of the compiled part (less than a single language model), maybe symlink the files from there and only copy the files that need to be edited to replace the path? I think sed can even break symlinks automatically in the inplace mode. Then the binary build becomes a dependency of the toplevel attribute.

(I am not sure if Hydra actually keeps the build depepndencies — then the entire point is moot — but even if not, not a large problem)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just typing up exactly what you proposed here. 😸
With this option, the build script of tesseractWithData will be a bit more complex, but we lose the slightly unelegant extra top-level attr. I think that's preferable.
I'll have breakfast and then implement it. Thanks for your very thoughtful feedback!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, a simple way is just to add a symlink to the original as an extra output. A bit more duplication and a bit less elegant outcome, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@7c6f434c
Copy link
Member

Just in case @viric has any comments, I mention him (he is less active these days, though).

description = "OCR engine";
homepage = https://github.com/tesseract-ocr/tesseract;
license = stdenv.lib.licenses.asl20;
maintainers = with stdenv.lib.maintainers; [viric];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to add yourself as a maintainer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erikarvstedt
Copy link
Member Author

There's still one very serious problem to overcome.
Sorry for not mentioning it earlier, I've just come to realize its full scope:
pkgs.tesseract.overrideAttrs is not backwards-compatible, because the overridden attrs are not passed on to the underlying no-languages tesseract drv.

Supporting the old overrideAttrs functionality, while still providing a way to build custom lang tesseract drvs without full recompilation is possible, but very obscure and hacky:
It would involve storing the attrs of tesseractWithoutData in its build output via __structuredAttrs. The tesseractWithData builder could then compare these attrs to its own __structuredAttrs and start a compilation (instead of a copy) when attrs differ. If you're interested, I could expand on that.

Currently, I can see only one simple solution to deploy our new features:
Add a new arg (like withLanguages) to the tesseract package function.

This stuff will be easy when nixpkgs uses nix-native modules and all override* mechanisms are unified, till then it's all rather unsatisfying.

@7c6f434c
Copy link
Member

Well, full compatibility might be hard to reach (but we normally allow this is the refactoring improves other things).

I guess the cleanest appoach would be to have an internal package set with newScope, so that one could do things like tesseract.override { tesseractWithoutData = tesseract.tesseractWithoutData.override {…}; }

@erikarvstedt
Copy link
Member Author

Is there an existing package that's defined in a similar way?

@7c6f434c
Copy link
Member

Well, I guess LibreOffice is similar.

@erikarvstedt
Copy link
Member Author

Sure? <nixpkgs>/pkgs/applications/office/libreoffice doesn't contain the word scope.

@7c6f434c
Copy link
Member

It's because it is called from the top-level; newScope per se is used, say, in winePackages.

Or maybe it should just be called a top-level tesseract-unwrapped — see, for example, weechat

@erikarvstedt
Copy link
Member Author

If we use the weechat approach we're back to tesseractWithoutData, now called tesseract-unwrapped.

What about just adding a tesseractWithoutData arg to the tesseract package function, and also exposing tesseractWithoutData via passthru?
That would make your above example

tesseract.override { tesseractWithoutData = tesseract.tesseractWithoutData.override {…}; }

work with minimal effort.

k2pdfopt is the only place in nixpkgs where tesseract.overrideAttrs is used. It could adapted to a similar style:

tesseract_modded = tesseract.override {
  tesseractWithoutData = tesseract.tesseractWithoutData.overrideAttrs (attrs: {…})
}

@7c6f434c
Copy link
Member

7c6f434c commented Dec 17, 2018 via email

@erikarvstedt
Copy link
Member Author

not sure if using unwrapped (or tesseract.tesseract) is better for the public interface

In the latest update I've settled for tesseract.tesseractBase.

I'm reasonably happy with what we have now. What do you think?

@erikarvstedt erikarvstedt force-pushed the tesseract branch 2 times, most recently from 1e504e8 to d743a0c Compare December 17, 2018 22:46
@7c6f434c
Copy link
Member

@GrahamcOfBorg build tesseract tesseract_4 nixosTests.chromium python3Packages.pyocr k2pdfopt

@7c6f434c
Copy link
Member

Well, in normal use all our version naming schemes do not seem to be too annoying… And of course we have all the possible approaches in use.

tesseract_4 = lowPrio (callPackage ../applications/graphics/tesseract/4.x.nix { });
inherit (callPackage ../applications/graphics/tesseract {})
tesseract
tesseract_4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weirdly, my grep missed that.

@erikarvstedt
Copy link
Member Author

So better remove that alias again? I'd prefer tesseract4 as it's more in line with the most popular versioned pkg names.

@7c6f434c
Copy link
Member

Well, alias is OK.

@erikarvstedt
Copy link
Member Author

Alright, let's do a final borg build.

@7c6f434c
Copy link
Member

Hm. Do you think haveing tesseract3 and tesseract just pointing to tesseract3 (see: love) is a bad idea?

@erikarvstedt
Copy link
Member Author

Like so? erikarvstedt@4770ab7
Yeah, that would be fine.

@erikarvstedt
Copy link
Member Author

Ah, you mean the game engine 😄

inherit (callPackage ../applications/graphics/tesseract {})
    tesseract3
    tesseract4;
tesseract = tesseract3;

That's fine, too, I'll add it.

@erikarvstedt
Copy link
Member Author

Ah, nice screwup. Tesseract 4 expects a different TESSDATA_PREFIX format than version 3, that's why the chromium test fails.
I'll be back in an hour, then I'll fix it.

@erikarvstedt
Copy link
Member Author

@GrahamcOfBorg build tesseract_4 nixosTests.chromium python3Packages.pyocr k2pdfopt

@7c6f434c
Copy link
Member

Is pyocr failure related or not?

@erikarvstedt
Copy link
Member Author

It's unrelated to this PR.

@erikarvstedt
Copy link
Member Author

Here's a similar failure from a PR that doesn't affect Tesseract. Unfortunately, I can't read the full logs right now.

@7c6f434c
Copy link
Member

Ah indeed, it is about cuneiform.py

Rename default.nix -> tesseract3.nix
Rename 4.x.nix -> tesseract4.nix

This is needed for the following commits.
Tesseract is now decoupled from the tessdata language corpus.

This avoids recompilation when building Tesseract with a custom set
of languages.

Update k2pdfopt to use the new wrapper interface.
This frees users from downloading all languages when building
Tesseract with a custom set of languages.

`enableLanguagesHash` is now obsolete.
This is more consistent with the naming of the most popular versioned pkgs.
@erikarvstedt
Copy link
Member Author

@GrahamcOfBorg' bring us two beers!

That was quite a ride. Thanks, Michael, for your patience and your great feedback!

@7c6f434c
Copy link
Member

As of patience, it is definitely me who should thank you!

@Profpatsch
Copy link
Member

Profpatsch commented Dec 21, 2018

  1. If we split bin and man, we can't run man tesseract inside nix-shell -p tesseract (or even from profiles built with nix-env, I haven't tested that). So I'd strongly vote against that split.

Nope, that’s not true.

> du -Lhs /nix/store/kj687mc1zi37yg4c718d5lwla5cs4s47-tesseract-3.05.00
1.1G	/nix/store/kj687mc1zi37yg4c718d5lwla5cs4s47-tesseract-3.05.00

This means I have to download >300MB (and extract them to >1GB) just to read the manpages? That seems excessive to me, and was the reason I submitted #43973 in the first place.

At least the man/doc/devdoc outputs should be separate, even if lib/bin/dev are shared in out.

@7c6f434c
Copy link
Member

7c6f434c commented Dec 22, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants