New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Major tesseract improvements #52379
Major tesseract improvements #52379
Conversation
Won't your approach to getting the language hashes fail if the language has already been downloaded before (but changed upstream)? |
3c4ff54
to
fba3881
Compare
Thanks, of course it fails 🐱. Fixed here. |
If you already need to do all this, maybe just use |
Good point! I've switched to |
, leptonica, libpng, libtiff, icu, pango, opencl-headers | ||
# Supported list of languages or `null' for all available languages | ||
, enableLanguages ? null | ||
# if you want just a specific list of languages, optionally specify a hash | ||
# to make tessdata a fixed output derivation. | ||
, enableLanguagesHash ? (if enableLanguages == null # all languages | ||
then "1h48xfzabhn0ldbx5ib67cp9607pr0zpblsy8z6fs4knn0zznfnw" | ||
then "11bi1hj2ihqrgvi9cam8mi70p4spm3syljkpnbglf4s8jkpfn15a" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this default ever used? If enableLanguages
is null
, languages.all
is passed through without any extra processing…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your're right, it's never used.
If we switch to single lang derivations only, we can get rid of enableLanguagesHash
completely. Otherwise, we should remove the unused default value. Edit: I've removed the default value in a fixup commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if selected-languages case works, killing it might be suboptimal: there are some substantially two-language texts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant that if we switch to single lang derivations only, tessdata
could be just a list of fixed-output lang derivations, so no extra output hash would be needed.
buildCommand = '' | ||
mkdir $out | ||
cd ${languages.all} | ||
cp ${stdenv.lib.concatMapStringsSep " " (x: x + ".traineddata") enableLanguages} $out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anything large enough not copied in the downloaded language packs? If no, you could just ln -s
, I guess?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. If we just need a single exotic language, like cym
, all other languages (1.2 GB) are needlessly kept in the store.
The old behaviour was also to copy the files.
Edit: Note that we can get rid of this if
-branch when we switch to individual lang derivations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, my bad, I thought each language is copied from the individual language pack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case you missed it: I've added a line to my previous reply.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand that if
now. «If we mention all languages, use individual derivations instead of all
»?
If just a list of derivations selected by language names would work without copying, that sounds better than that extra composition step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually tending towards 3. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also the option 2.1, which fetches the whole repository and sets empty meta.hydraPlatforms
for the individual languages. Then only English is stored twice on Hydra (used for VM tests). Arguably, we could even avoid keeping any language except English on Hydra… (option 2.2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2.2 might introduce a too large dependence (and burden) on GitHub: Every user fetching tesseract
or pyocr
would have download the whole of tessdata from GitHub.
2.1 sounds nice, along with removing nixpkgs.tesseractLanguages
. Would you recommend that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, silly me, by removing nixpkgs.tesseractLanguages
Hydra wouldn't store the individual languages in the first place. So I guess that's the best solution...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see "Edit 2" in the main PR description for a short summary of our current approach. I think we've hit the sweet spot.
|
||
nixSrc=$(sed "s/TESSDATA_REV/$tessdataRev/" <<'EOF' | ||
nixSrc=$(sed -e "s/TESSDATA_REV/$tessdataRev/" -e "s/LANGUAGE_CODES/$langCodesExpr/" <<'EOF' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it just me, or would just using shell (and exporting the list from Nix using concatStringsSep, if needed
) be simpler in the current approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify a bit what you mean by this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if [[ $localLangs ]]; then
langCodes=$(nix-instantiate --eval-only -E 'with import ../../../../ {config={}; overlays=[];};
builtins.toString (builtins.attrNames (builtins.removeAttrs
tesseractLanguages [ "recurseForDerivations" "all" ])) ' | tr -d '"')
else
langCodes=$(echo $(curl -s https://github.com/tesseract-ocr/tessdata/tree/$tessdataRev \
| grep -ohP "(?<=/)[^/]+?(?=\.traineddata)" | sort))
fi
for lang in $langCodes; do
url = "https://github.com/tesseract-ocr/tessdata/raw/${tessdataRev}/${lang}.traineddata";
echo "${lang} = $(nix-prefetch-url "$url" 2>/dev/null)"
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(obviously untested)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that's infinitely more elegant, thanks for pointing it out! Fixed.
01e3738
to
b1e8686
Compare
pkgs/top-level/all-packages.nix
Outdated
@@ -19486,6 +19486,7 @@ in | |||
termtosvg = callPackage ../tools/misc/termtosvg { }; | |||
|
|||
tesseract = callPackage ../applications/graphics/tesseract { }; | |||
tesseractWithoutData = callPackage ../applications/graphics/tesseract { enableLanguages = false; }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Maybe just use
[]
for the no-languages case? -
It is already available as both
tesseract.tesseractWithoutData
and (if you agree with point 1)tesseract.override {enableLanguages = [];}
, does it need a top-level attribute or maybe just a top-level comment?
# you can pass `enableLanguages` to `tesseract` to install tessearct with fewer supported languages and save space,
# or even pass `enableLanguages = []` and then supply the language packs via environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I have no preference, I'll let you decide.
- We need a top-level attr so that the no-languages drv is cached. Without it, we would incur a full compilation when
enableLanguages
is customized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think []
reduces the number of types (list or null seems to be a reasonable union type, list of null or false boolean should provide some value). So I would slightly prefer []
Given the size of the compiled part (less than a single language model), maybe symlink the files from there and only copy the files that need to be edited to replace the path? I think sed
can even break symlinks automatically in the inplace mode. Then the binary build becomes a dependency of the toplevel attribute.
(I am not sure if Hydra actually keeps the build depepndencies — then the entire point is moot — but even if not, not a large problem)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just typing up exactly what you proposed here. 😸
With this option, the build script of tesseractWithData
will be a bit more complex, but we lose the slightly unelegant extra top-level attr. I think that's preferable.
I'll have breakfast and then implement it. Thanks for your very thoughtful feedback!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, a simple way is just to add a symlink to the original as an extra output. A bit more duplication and a bit less elegant outcome, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just in case @viric has any comments, I mention him (he is less active these days, though). |
description = "OCR engine"; | ||
homepage = https://github.com/tesseract-ocr/tesseract; | ||
license = stdenv.lib.licenses.asl20; | ||
maintainers = with stdenv.lib.maintainers; [viric]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to add yourself as a maintainer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's still one very serious problem to overcome. Supporting the old Currently, I can see only one simple solution to deploy our new features: This stuff will be easy when nixpkgs uses nix-native modules and all |
Well, full compatibility might be hard to reach (but we normally allow this is the refactoring improves other things). I guess the cleanest appoach would be to have an internal package set with |
Is there an existing package that's defined in a similar way? |
Well, I guess LibreOffice is similar. |
Sure? |
It's because it is called from the top-level; Or maybe it should just be called a top-level |
If we use the What about just adding a
work with minimal effort. k2pdfopt is the only place in nixpkgs where
|
If we use the `weechat` approach we're back to `tesseractWithoutData`, now called `tesseract-unwrapped`.
Well, yes.
What about just adding a `tesseractWithoutData` arg to the tesseract package function, and also exposing `tesseractWithoutData` via `passthru`?
I am separately not sure if using unwrapped (or `tesseract.tesseract`) is better for the public interface, given what other packages do.
k2pdfopt is the only place in nixpkgs where `tesseract.overrideAttrs` is used. It could adapted to a similar style:
Indeed.
|
In the latest update I've settled for I'm reasonably happy with what we have now. What do you think? |
1e504e8
to
d743a0c
Compare
@GrahamcOfBorg build tesseract tesseract_4 nixosTests.chromium python3Packages.pyocr k2pdfopt |
Well, in normal use all our version naming schemes do not seem to be too annoying… And of course we have all the possible approaches in use. |
pkgs/top-level/all-packages.nix
Outdated
tesseract_4 = lowPrio (callPackage ../applications/graphics/tesseract/4.x.nix { }); | ||
inherit (callPackage ../applications/graphics/tesseract {}) | ||
tesseract | ||
tesseract_4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weirdly, my grep missed that.
So better remove that alias again? I'd prefer |
Well, alias is OK. |
9fa055b
to
3f345e0
Compare
Alright, let's do a final borg build. |
3f345e0
to
6d149ee
Compare
Hm. Do you think haveing |
Like so? erikarvstedt@4770ab7 |
Ah, you mean the game engine 😄
That's fine, too, I'll add it. |
Ah, nice screwup. Tesseract 4 expects a different |
@GrahamcOfBorg build tesseract_4 nixosTests.chromium python3Packages.pyocr k2pdfopt |
Is pyocr failure related or not? |
It's unrelated to this PR. |
Here's a similar failure from a PR that doesn't affect Tesseract. Unfortunately, I can't read the full logs right now. |
Ah indeed, it is about |
Rename default.nix -> tesseract3.nix Rename 4.x.nix -> tesseract4.nix This is needed for the following commits.
Tesseract is now decoupled from the tessdata language corpus. This avoids recompilation when building Tesseract with a custom set of languages. Update k2pdfopt to use the new wrapper interface.
This frees users from downloading all languages when building Tesseract with a custom set of languages. `enableLanguagesHash` is now obsolete.
This is more consistent with the naming of the most popular versioned pkgs.
9381296
to
0289f4a
Compare
@GrahamcOfBorg' bring us two beers! That was quite a ride. Thanks, Michael, for your patience and your great feedback! |
As of patience, it is definitely me who should thank you! |
Nope, that’s not true.
This means I have to download >300MB (and extract them to >1GB) just to read the manpages? That seems excessive to me, and was the reason I submitted #43973 in the first place. At least the man/doc/devdoc outputs should be separate, even if lib/bin/dev are shared in out. |
This means I have to download >300MB (and extract them to >1GB) just to read the manpages?
Well, just to read manpages you would download tesseract.tesseractBase (4.5 MiB unpacked)…
|
lightweight installation for users that want to dynamically provide their
language data via environment vars.
triggering a time-consuming recompilation. This brings a significant speed boost.
Here's a quick demo:
Obsolete, historical content (for the archives)
To be discussed:
I'd recommend that we provide all tessdata languages as individual derivations.
Bonuses:
(once as an individual derivation, once as part of languages.all).
Although for
--optimise
d stores, the storage concerns are irrelevant.default.nix
.The extra maintenance burden is negligible due to the
get-language-hashes.sh
helper.Edit 2:
We've arrived at the following:
Keep a simple, single fixed output drv combining all languages that's used by
pkgs.tesseract
and is cached by Hydra, as before. This keeps eval times low.Provide individual derivations for all languages, but don't cache them with Hydra. These drvs are downloaded individually from the original repo source when
tesseract.enableLanguages
is modified.Edit:
I've added a commit that implements separate lang drvs.
The evaluation time of
nixpkgs.tesseract
(which includes all languages) is increased by 6.8% (361ms vs 338ms, via hyperfine) but I think the added flexibility and simplicity is worth the little extra eval time.Here's a quick script to verify that the new tessdata files are identical to the old ones: