Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

python3Packages.tokenizers: init at 0.8.0 #91342

Merged
merged 1 commit into from Jul 2, 2020
Merged

Conversation

danieldk
Copy link
Contributor

@danieldk danieldk commented Jun 23, 2020

Motivation for this change

I wanted to update the transformers Python package. However, newer versions require the tokenizers Python package. This package is quite interesting by itself:

  • It main package is implemented in Rust.
  • It also contains some Python files.
  • It uses the Rust pyo3 crate for binding Python.
  • Rust pyo3 needs Rust nightly.
  • We can't just use RUSTC_BOOTSTRAP, since pyo3s build script checks the actual version.

So, to actually make this work I chose to:

  • Use buildRustPackage as the main driver, so that Rust crates are properly vendored.
  • Use maturin to build a Python wheel from the combined Rust + Python sources.
  • Use pip to install the wheel in the right location.

I had to patch the build script of the vendored pyo3 to remove the explicit Rust version check or it is not happy with the RUSTC_BOOTSTRAP cheat code.

As icing on the 馃嵃, the tests attempt to download some vocabularies. So, we have to put them in place ourselves for purity.

I haven't been able to test this on macOS yet, it will probably need Security. Builds without issues on my MacBook.

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS linux)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Ensured that relevant documentation is up to date
  • Fits CONTRIBUTING.md.

@danieldk
Copy link
Contributor Author

@GrahamcOfBorg build python3Packages.tokenizers

@danieldk
Copy link
Contributor Author

Also cc @pashashocky , since they are the maintainer of the tranformers derivation.

Copy link
Contributor

@jonringer jonringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to improve upon expression, so LGTM
able to be imported fine from python

https://github.com/NixOS/nixpkgs/pull/91342
2 packages built:
python37Packages.tokenizers python38Packages.tokenizers

still would like to see @FRidh 's thoughts on this

@danieldk
Copy link
Contributor Author

danieldk commented Jun 26, 2020

Added one more substituteInPlace to ensure that the wheel gets the correct name (tokenizers and not tokenizers-python).

@danieldk
Copy link
Contributor Author

@FRidh does this look good to you? Would be nice to be able to update the transformers package, since it's trailing behind quite a bit.

@danieldk danieldk changed the title python3Packages.tokenizers: init at 0.7.0 python3Packages.tokenizers: init at 0.8.0 Jul 1, 2020
@danieldk
Copy link
Contributor Author

danieldk commented Jul 1, 2020

Huggingface transformers 3.0.0 is out, together with tokenizers 0.8.0. So, I have updated this PR to 0.8.0.

@drewrisinger
Copy link
Contributor

I had a similar issue when packaging python3Packages.retworkx which uses pyo3 #84945, I'm honestly amazed that you got this to work at all. My solution was to use a pre-compiled wheel, which looks like it might be an option here? https://pypi.org/project/tokenizers/#files

Copy link
Contributor

@drewrisinger drewrisinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest moving the package disable to the tokenizers file, in keeping with the style of the python-packages.nix file.

pkgs/development/python-modules/tokenizers/default.nix Outdated Show resolved Hide resolved
pkgs/top-level/python-packages.nix Outdated Show resolved Hide resolved
@danieldk
Copy link
Contributor Author

danieldk commented Jul 2, 2020

I had a similar issue when packaging python3Packages.retworkx which uses pyo3 #84945, I'm honestly amazed that you got this to work at all. My solution was to use a pre-compiled wheel, which looks like it might be an option here? https://pypi.org/project/tokenizers/#files

I find it highly preferable to build from source if possible. Also, the derivation can be simplified once tokenizers switches to pyo3 0.11.0, since it does not require Rust nightly.

@drewrisinger
Copy link
Contributor

I find it highly preferable to build from source if possible. Also, the derivation can be simplified once tokenizers switches to pyo3 0.11.0, since it does not require Rust nightly.

Agree about build from source. I didn't realize that the path of building via buildRustPlatform was an option (or worth my time), so thanks for demonstrating that :) 馃帀 . Also, good to know about pyo3 0.11.0 & non-nightly. When I looked into it ~1-2 months ago, it wasn't an option, so glad to hear things are progressing. I'll probably update python3Packages.retworkx then in the future to use something like this or pyo3.

Copy link
Contributor

@drewrisinger drewrisinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Diff LGTM (except minor comments below)
  • Commits LGTM
  • Builds clean
  • Tests clean with ~ python; import tokenizers

pkgs/development/python-modules/tokenizers/default.nix Outdated Show resolved Hide resolved
pkgs/development/python-modules/tokenizers/default.nix Outdated Show resolved Hide resolved
pkgs/development/python-modules/tokenizers/default.nix Outdated Show resolved Hide resolved
Copy link
Member

@FRidh FRidh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. You may want to use buildRustPackage to build the wheel, and then pass that wheel to buildPythonPackage, but there is no need for that.

pkgs/top-level/python-packages.nix Outdated Show resolved Hide resolved
Copy link
Contributor

@drewrisinger drewrisinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(repeat of above checklist)

  • Diff LGTM
  • Commits LGTM
  • Builds clean
  • Tests clean with ~ python; import tokenizers

@FRidh FRidh merged commit 66d0b2a into NixOS:master Jul 2, 2020
@FRidh
Copy link
Member

FRidh commented Jul 2, 2020

Thank you for doing these explicit approvals, it makes it so much easier to filter on what may be merged and what not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants