New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace ICU ParagraphLayout with something else? #6922
Comments
I would strongly considering the option of "just winging it". Of all the languages OpenTTD is currently translated into, all except three use regular ASCII spaces to separate words and indicate acceptable line break points. The exceptions are the two Chinese langauges, and Japanese, and for those three you can more or less insert line breaks at any point. A slightly more correct algorithm would avoid inserting breaks before punctuation characters and after opening brackets, but I think those can be looked up in the Unicode database which should be included in ICU. The full Unicode algorithm is rather long. Implementing it in full should be outside the scope of OpenTTD, but implementing a heavily simplified version might be okay? |
IIRC the ASCII-space approach is already implemented when compiling without ICU, so adapting that with a minimal approach for non-ASCII locales might be the easiest way to implement your suggestion. OTOH, in this case we want ICU support, but not ICU-layout support, so that might require more invasive changes. Hm. Also, I'm not entirely sure if word-breaking is the only (or hardest) part of ICU-layout that needs to be replaced. I guess more investigation in the current code is needed, but I can't really justify the time needed for that... |
Just had a look at the code, it seems that there's already a split between Line 466 in 9cf999b
If we take this route, I wonder if we still need Harfbuzz? These (ancient) Harfbuzz docs suggest Harfbuzz only does shaping of text (given a run of text, a font, language, etc. decide what glyphs to use and how to position them). I believe the shaping in the fallback layouter is fairly simple and consists of simply summing the character widths (and I suspect the glyph selection is also a simple direct mapping of Unicode code points to glyphs in the font). I'm not sure if this would be sufficient or would really need something like Harfbuzz (which also poses the question how this interacts with the fonts we use, which I suspect are custom OpenTTD bitmap fonts, or do we use freetype stuff?). |
Shaping of text is important for various languages/writing systems, most visible for Arabic, but also some indian and asian scripts. Also, proper handling of mixed RTL/LTR content (e.g. English NewGRF texts embedded in arabic) is somewhat complicated. Aditionally, even for latin-based scripts, there are some things which need combining diacritical glyphs. For the Windows Uniscribe implementation I took the approach to handle the Bitmap font with the fallback layouter, as our bitmap font doesn't include anything that needs complex glyph shaping. If you select a language that needs more glyphs, we try to switch to a FreeType font anyway. |
I don't think anyone will work on a replacement as long as the current one is in place. So, for debian I recommend to configure with "--without-icu-layout". |
I tried running without ICU layout and compared the results with the language set to Arabic. I was actually expecting the layout and word wrapping to become worse, but it actually seems the entire glyph selection changed as well. E.g., this is the settings screen without ICU layout: And here's the settings screen with ICU layout: For easy comparison, here's a closeup. Without: With: I don't read Arabic, so it looks to me like the text is completely different (and if it is correct with ICU layout, it would be likely that it is wrong without ICU layout). Perhaps the text is essentially the same, but some characters are intended to kerned or otherwise combined, changing their appearance? I suspect that we are now no longer using ICU for any layouting, while I believe we could still use ICU for glyph selection and maybe some other stuff too? Any Arabic speakers reading along here that can tell me how the text in the two versions compares? Is the second one incorrect, or just a bit less elegant? Does it even say the same thing? Another thing is that on startup, OpenTTD now shows an error message saying there is no support for RTL-languages (which might not be entirely correct, there is at least some support) and suggesting to compile with ICU enabled (which might be counterproductive). Summarizing: I'm not quite ready to drop ICU layout support for the Debian packages just yet, this needs a bit more investigation (but I am intending to drop it, as suggested by Frosch to perhaps force some progress, or at least make the life of the ICU Debian maintainers a bit easier). |
The first image is completely broken (the text is left-to-right instead of right-to-left, and the glyphs has the wrong contextual shapes), the second one is mostly correct but the mixed Arabic and English text seems to be incorrectly re-ordered (the order of the Arabic and English text makes no sense, which likely means bidirectional text is incorrectly set up). This is surprising, since the ICU layout version should be the correct, which makes me wonder if the source text is correct (can you kink to it) or whether some thing is broken with ICU layout or how it is used. |
Oh, I swapped the images, w00ps. I just updated thet comment to fix the labeling (the images are still in the same order, so your comment still applies). So indeed, disabling ICU layout completely breaks RTL languages, then. That's a bit more severe than I had hoped, but perhaps fixing glyph selection / shaping and adding RTL detection would not be too hard. I also compared the Chinese (Traditional) language, and as @nielsmh predicted that only has some small word-wrapping differences, but otherwise looks identical. Looking at the ICU API, it seems it still offers some useful building blocks that could be used to implement the missing piecies:
An alternative for shaping could be to use Harfbuzz, which AFAIU supports (only) shaping. It also seems to suggest shaping is needed for more than arabic (also syriac, indic, thai, ...), but perhaps that's what ICU means with "arabic languages"? |
I also checked Hebrew, the other RTL language supported by OpenTTD now, which also has the character order reversed (but the glyphs seem to be the same, so it probably requires no shaping). |
Arabic languages are quite specifically the languages using the same family of writing systems. There's some differences in orthography and alphabet but I believe they are all supported by a single shaping engine and Unicode block. The ICU Indian languages like Devanagari (easily identified by their "hanging baseline") are a separate beast, and while they don't require the bi-directional algorithm, they instead require different, complex reordering of glyphs. They depend on tables in the font data for this, and for combining multiple code points into single glyphs. (The latter should be very similar to OpenType automatic ligatures for Latin alphabet text.) The conclusion must be, if only ICU is available then |
So, it sounds like adding harfbuzz as a dependency for shaping makes sense. We'll need something else for the bidirectionality detection then - The Harfbuzz docs suggest using ICU or fribidi. For linebreaking, the Harfbuzz docs also suggest using the ICU break iterator API. I'm about to upload a version to Debian that compiles without the ICU layout API. After that, I probably won't find the time to actually implement Harfbuzz in OpenTTD. Any takers for this project? |
I haven’t checked what ICU Layout does exactly, but ICU + HarfBuzz should have all the API one needs to replicate its functionality, so I think it isn’t a bad idea for OpenTTD to just copy it, replace the uses of the deprecated API with direct HarfBuzz calls and always use it. ICU shaping is limited and supports only a subset of Arabic and in general is of limited use, HarfBuzz should be used instead. |
I had my concerns about Harfbuzz (& Pango) previously, as they pull in a lot of extra dependencies via glib. However, now that ICU is Linux-only (well, not Windows or OSX) anyway, it's probably less of an issue |
Should we perhaps flag UI languages that require shaping in a way that prevents them from being used in builds that lack complex text support? An additional issue, tangential to this, is: What about user-supplied text, what is the primary reading direction for that? Things such as company names, signs, and chat messages. Just assume the UI language directionality, or somehow extend user-supplied strings with input language used? For instance, a German player downloading a savegame from a Saudi-Arabian player, that might cause some labels to render wrong. |
HarfBuzz does not require GLib (Pango does), it has it an optional dependency for Unicode functions, alternative dependency for this is ICU so projects using ICU naturally use it for Unicode functions in HarfBuzz. |
https://packages.debian.org/stretch/libharfbuzz0b Doesn't seem to be what Debian (or Arch) thinks... |
AFAIU, the direction can be derived from the individual characters. Currently, mixing RTL and LTR text in a single string is, AFAIK already used (for things like "OpenTTD" or "readme.txt" inside a translated string, see the screenshots a few comments back) and supported (using ICU, possibly through ICU layout to figure out what parts are RTL and what parts are LTR). So, that should work out, I believe. |
@LordAro, I suspect you can choose to compile HB with or without glib integration. Debian uses with glib, but since it is easy to install it, there's only a bigger install footprint for using HB on Debian, but if you need to compile everything yourself, you can just leave it out. |
This issue has been automatically marked as stale because it has not had any activity in the last two months. |
Although we all agree this needs a solution, it is not happening for 1.9. Another attempt in 1.10? Sorry, I hope this is not too much of an issue :( |
Aaand bumping again! Turns out rewriting the text rendering engine is quite a lot of work |
I don't know if it helps since we don't understand BiDi that well, but for OpenRCT2 we used FriBidi to reorder Arabic. This was implemented in OpenRCT2/OpenRCT2#12837. |
oh hi
or it's just done and there's not much more to do with it.
It's painful for everyone. I have https://unicode-org.atlassian.net/browse/ICU-6829 filed with the at-that-time ideas. But I'm no longer funded for ICU. Pull requests welcome if anyone wants to help out. |
OpenTTD currently uses the ICU ParagraphLayout API for doing
internationalized wordwrapping. This usage is a problematic, since
ICU deprecated and removed their layout API for being buggy and
unmaintained. The ParagraphLayout API that OpenTTD uses is still present
in ICU, but it relies on the removed Layout API. This gap can be be
filled using the icu-le-hb library, which implements the layout API on
top of the Harfbuzz library.
However, it seems icu-le-hb is more of a proof of concept and not really
maintained either. Additionally, because ParagraphLayout is in ICU and
icu-le-hb depends on ICU, this gives an undesirable circular dependency,
which is problematic for at least Debian, since it complicates updating
to new ICU versions,.
See https://bugs.debian.org/894159 for more info on this, in particular
https://bugs.debian.org/894159#45 and onwards. The Debian version of
this issue is https://bugs.debian.org/897233
It seems it would be beneficial to move away from the ICU
ParagraphLayout API, so it can be removed from Debian (OpenTTD is the
only user in Debian), but also because it has indeed proven to be buggy
in the past (OpenTTD talks about ICU-related crashes, though this was
before the icu-le-hb implementation).
In the above linked bug, the following was written:
Since the above was written, the ICU-in-Debian folks have asked about
any progress on the OpenTTD side of things, since they really like us to
remove this dependency from OpenTTD.
This has been discussed on IRC on a few occasions. Some observations:
when using Pango for layout, we'd need ICU anyway. Pango is fairly
heavy-handed, depending on (I think) glib and gdk rendering as well.
directionality detection and multi-line layouting.
an example implementation than a production library. I believe it is
mostly an API translation layer from HB to ICU Layout, without (much)
code that actually does something.
(parts of) it, that could be a feature. It seems we'll still need it
for collating, but it could be that the ParagraphLayout code is the
most buggy part (wishful thinking)?
on ICU ParagraphLayout which depends on ICU Layout (now provided by
icu-le-hb). Vendoring both icu-le-hb and the ParagraphLayout code
from ICU could solve this.
new library from that might be an interesting approach. The code can
then probably be cleaned up from there, since the ICU Layout API is
no longer required, and can be removed internally. It might also not
be needed to preserve the ParagraphLayout API exactly, if we feel
that should be cleaned up (to be closer to the HB API for example),
we can also do that.
LibreOffice can probably do this thing as well). Does everyone use
Pango? Or do they implement their own word-wrapping on top of HB
layouting?
Did anyone make any progress on this subject in the last few months? @LordAro, @TrueBrain?
The text was updated successfully, but these errors were encountered: