Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

glibc: add patch to use utf-8 when LANG is unset #61202

Closed

Conversation

matthewbauer
Copy link
Member

@matthewbauer matthewbauer commented May 9, 2019

This allows the default builder to use utf-8 encoding. Previously,
when LANG and LC_ALL was left unset, glibc would default to “C”. This
is not good in many.

As a result of this, I think we can get rid of a lot of uses of LC_ALL
in Nixpkgs, where utf-8 was needed. These can be found by grepping
for:

(?:LC_ALL|LANG) *= *["']?(?:en_US|C)\.UTF-8["']?

A lot of times the ones using en_US.UTF-8 also add glibcLocales.

/cc @vaibhavsagar

@matthewbauer
Copy link
Member Author

For reference, this is what musl already does:

https://wiki.musl-libc.org/functional-differences-from-glibc.html#Default-locale

@Mic92
Copy link
Member

Mic92 commented May 9, 2019

@matthewbauer I would love to see this. However we should also take macOS into account. If we start to use UTF-8 by default on linux we also should on macOS or otherwise we break accidentally packages for macOS.

I don't have access to macOS however I roughly described in #54485 (comment) what needs to be done in order to get C.UTF-8 support.

@matthewbauer
Copy link
Member Author

@matthewbauer I would love to see this. However we should also take macOS into account. If we start to use UTF-8 by default on linux we also should on macOS or otherwise we break accidentally packages for macOS.

I don't have access to macOS however I roughly described in #54485 (comment) what needs to be done in order to get C.UTF-8 support.

Have you ever seen this problem appear on macOS? It's possible it happens but I can't recall it happening. Usually these hacks are glibc specific.

Copy link
Member

@peti peti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will glibc provide that UTF-8 locale without referencing the glibcLocale files? My understanding is that those files are necessary at run-time, but glibc does not know their location.

@matthewbauer
Copy link
Member Author

Some possible things that may work:

  • we can use PATH_LOCALE to put custom locales in the stdenv
  • create a C.UTF-8, using the same method done in https://reviews.freebsd.org/D17833
  • just set LC_CTYPE=UTF-8 everywhere on macOS

@matthewbauer
Copy link
Member Author

How will glibc provide that UTF-8 locale without referencing the glibcLocale files? My understanding is that those files are necessary at run-time, but glibc does not know their location.

I don't think it's necessary for C.UTF-8. It's definitely necessary for en_US.UTF-8, but that is due to the US localizations. UTF-8 and C should be baked in to glibc.

@Mic92
Copy link
Member

Mic92 commented May 9, 2019

@matthewbauer you don't want to set a invalid locale on macOS or things start to fail.

@@ -37,6 +37,7 @@ in rec {
export CMAKE_OSX_ARCHITECTURES=x86_64
# Workaround for https://openradar.appspot.com/22671534 on 10.11.
export gl_cv_func_getcwd_abort_bug=no
export LC_CTYPE=UTF-8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be still great if we could set LANG on macOS as well. Reason is, that $LC_CTYPE has a different semantics/priority compared to $LANG. LANG has a lower priority then all LC_* variables, so if an existing package already sets LC_ALL or similar to some value, it would be preferred.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem to work correctly:

$ env -i LANG=UTF-8 locale
LANG="UTF-8"
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=

vs.

$ env -i LANG=en_US.UTF-8 locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=

vs.

$ env -i LANG=en_US.UTF-8 locale
LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=

This is the same thing described in the FreeBSD proposal:

Introduce C.UTF-8 locale, using the same common CTYPE map as other UTF-8 locales ... and having all other components use C locale

UTF-8 is only valid for LC_CTYPE, which makes sense. This is a BSDism it looks like though, so documentation on it would be helpful. My thinking on this is that if a package really needs a non-UTF-8 encoding in 2019, they should specifically request it through setting LC_CTYPE or LC_ALL. Also just to show that LC_ALL does take precedence:

$ env -i LC_ALL=en_US.UTF-8 LC_CTYPE=UTF-8 locale
LANG=
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"

Copy link
Member

@Mic92 Mic92 May 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I meant by having a C.UTF-8 locale generated on macOS instead of using LC_CTYPE=UTF-8.

@ofborg ofborg bot added the 6.topic: stdenv Standard environment label May 9, 2019
@Mic92
Copy link
Member

Mic92 commented May 9, 2019

How will glibc provide that UTF-8 locale without referencing the glibcLocale files? My understanding is that those files are necessary at run-time, but glibc does not know their location.

The code of glibc does first look into the archive in its own installation prefix before falling back to $LOCALE_ARCHIVE, which is quite handy for our use case. In nixpkgs we put C.UTF-8 into $out/lib/locale/locale-archive of the glibc package.

@matthewbauer
Copy link
Member Author

I should verify that c65124a3ce0b59dcf71e4ad9c72f6616f67b7079 works correctly for all of the changed files. It's possible that somewhere the locale vars are set and they specifically want one of the glibcLocales included. This might make sense for certain kinds of tests, but also seems like kind of bad practice.

@edolstra
Copy link
Member

I don't know, it seems scary to patch Glibc in this way. Does upstream have an opinion about this?

@matthewbauer
Copy link
Member Author

I don't know, it seems scary to patch Glibc in this way. Does upstream have an opinion about this?

@edolstra Upstream ticket and discussions:

It looks like it's part of the roadmap, just never been done? I can't find any distros that have done the switch for the default, though many provide C.UTF-8 in their glibc package (as we already do). It seems reasonable to me to enable UTF-8 by default, though.

@vcunat
Copy link
Member

vcunat commented May 15, 2019

Why do these even use the huge full glibcLocales and not one with only this single language? 🤷‍♀️

@matthewbauer
Copy link
Member Author

Yeah unfortunately C.UTF-8 only works on Glibc/Musl. So we need to to either use something that works on macOS like en_US.UTF-8 or else apply a patch like this.

@matthewbauer matthewbauer changed the base branch from master to staging May 18, 2019 19:58
@FRidh FRidh added this to the 19.09 milestone May 28, 2019
@cw789 cw789 mentioned this pull request Jun 23, 2019
10 tasks
This allows the default builder to use utf-8 encoding. Previously,
when LANG and LC_ALL was left unset, glibc would default to “C”. This
is not good in many.

As a result of this, I think we can get rid of a lot of uses of LC_ALL
in Nixpkgs, where utf-8 was needed. These can be found by grepping
for:

  (?:LC_ALL|LANG) *= *["']?(?:en_US|C)\.UTF-8["']?
A couple of places set LANG=“en_US.UTF-8” to get unicode working in
the Nix builder. This is very very bad! Not everyone who uses Nix
lives in the U.S. and is an English speaker. Instead we should use the
“C.UTF-8”, which is an UTF-8 capable version of the unlocalized C
locale. Please use this one!

The previous commit makes Glibc fall back to this locale when LANG is
unset. This is in line with the Musl behavior. As a result, the Nix
builder is made UTF-8 capable without any LOCALE_ARCHIVE or LANG hacks.
Unfortunately, the default macOS locale does not have UTF-8 support.
You need to set this for correct behavior.
These break due to default locale not being what gnulib expects.
@@ -92,6 +92,11 @@ stdenv.mkDerivation ({
url = "https://salsa.debian.org/glibc-team/glibc/raw/49767c9f7de4828220b691b29de0baf60d8a54ec/debian/patches/localedata/locale-C.diff";
sha256 = "0irj60hs2i91ilwg5w7sqrxb695c93xg0ik7yhhq9irprd7fidn4";
})
(fetchurl {
url = "https://bugs.debian.org/cgi-bin/bugreport.cgi?att=1;bug=874160;filename=0001-Default-to-C.UTF-8-on-setlocale-.-if-no-env-vars-are.patch;msg=5";
name = "0001-Default-to-C.UTF-8-on-setlocale-.-if-no-env-vars-are.patch";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So debian switched to C.UTF-8 now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this is just a proposed patch

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matthewbauer matthewbauer modified the milestones: 19.09, 20.03 Sep 4, 2019
@matthewbauer matthewbauer removed this from the 20.03 milestone Nov 7, 2019
@lovesegfault
Copy link
Member

@matthewbauer This needs a rebase :)

@Mic92
Copy link
Member

Mic92 commented Mar 4, 2020

I got a osx vm now and could actually test the patch I proposed here: #54485 (comment)

@Mic92
Copy link
Member

Mic92 commented May 26, 2020

Alacritty also switched to LC_CTYPE=UTF-8 by default on macOS: alacritty/alacritty#3608 Maybe we just do the same.

@Mic92
Copy link
Member

Mic92 commented May 26, 2020

Instead of patching glibc to default to C.UTF-8 we could start instead with a less intrusive version and just that LANG=C.UTF-8 in stdenv.

@matthewbauer
Copy link
Member Author

Closing for now - glibc probably shouldn't be patched like this & probably should just set LANG=C.UTF-8 in stdenv.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants