Fix: String validation could leave invalid Utf8 encoded strings #9096

rubidium42 · 2021-04-24T17:24:30Z

Motivation / Problem

In case a character was encoded in multiple bytes, but required fewer bytes to be encoded, the first byte would be copied to the output leaving an invalid Utf8 encoded string. Later uses of the validated string would use the same decode logic, which would yield a question mark and just read a single byte, so nothing dangerous happened.
Furthermore, because the next byte would not be a first byte of an encoded Utf8 character, the last few valid characters could be removed by the validation as well.

For example, a character array with: (bits) 11000000 10000000 'a', '\0' would yield 11000000, '\0'. Based on the existing could I would have expected it to yield '\0', but then I found that the Utf8Decode decodes the first two bytes to 0 but then checks whether the decode value is more than 0x80. If that is not the case, it return a question mark and a length of 1. This means that 1 byte is copied. Then it starts the loop again, seeing 1000000 which is not a valid first byte for a Utf8Encoded character thus it assumes a length of 4, which would read beyond the buffer and it terminates.

Description

Rewrote the logic. See the documentation in the diff for the details.
The example about will now return yield 'a', '\0' as validated string.

Limitations

If someone rewrites Utf8Decode to not be as strict, a multibyte encoded '\0' would not terminate the string in the validation, only the length of the string would. Later things that use Utf8Decode to get characters might then terminate on '\0'. But, this is all under the premise that someone wants to make it less strict in the first place, and then most of the logic in the validation would be moot.

Checklist for review

Some things are not automated, and forgotten often. This list is a reminder for the reviewers.

The bug fix is important enough to be backported? (label: 'backport requested')
This PR affects the save game format? (label 'savegame upgrade')
This PR affects the GS/AI API? (label 'needs review: Script API')
- ai_changelog.hpp, gs_changelog.hpp need updating.
- The compatibility wrappers (compat_*.nut) need updating.
This PR affects the NewGRF API? (label 'needs review: NewGRF')
- newgrf_debug_data.h may need updating.
- PR must be added to API tracker

src/string.cpp

In case a character was encoded in multiple bytes, but required fewer bytes to be encoded, the first byte would be copied to the output leaving an invalid Utf8 encoded string. Later uses of the validated string would use the same decode logic, which would yield a question mark and just read a single byte, so nothing dangerous happened. Furthermore, because the next byte would not be a first byte of an encoded Utf8 character, the last few valid characters could be removed by the validation as well.

…TTD#9096) In case a character was encoded in multiple bytes, but required fewer bytes to be encoded, the first byte would be copied to the output leaving an invalid Utf8 encoded string. Later uses of the validated string would use the same decode logic, which would yield a question mark and just read a single byte, so nothing dangerous happened. Furthermore, because the next byte would not be a first byte of an encoded Utf8 character, the last few valid characters could be removed by the validation as well.

In case a character was encoded in multiple bytes, but required fewer bytes to be encoded, the first byte would be copied to the output leaving an invalid Utf8 encoded string. Later uses of the validated string would use the same decode logic, which would yield a question mark and just read a single byte, so nothing dangerous happened. Furthermore, because the next byte would not be a first byte of an encoded Utf8 character, the last few valid characters could be removed by the validation as well.

TrueBrain reviewed Apr 24, 2021

View reviewed changes

src/string.cpp Outdated Show resolved Hide resolved

rubidium42 force-pushed the string_validation_improvement branch from 4e89581 to 03b092c Compare April 24, 2021 18:06

rubidium42 added the backport requested This PR should be backport to current release (RC / stable) label Apr 25, 2021

PeterN approved these changes Apr 29, 2021

View reviewed changes

PeterN merged commit f00564e into OpenTTD:master Apr 29, 2021

rubidium42 deleted the string_validation_improvement branch April 30, 2021 05:10

rubidium42 mentioned this pull request May 2, 2021

Prepare for 1.11.2 release #9160

Merged

LordAro added backported This PR is backported to a current release (RC / stable) and removed backport requested This PR should be backport to current release (RC / stable) labels May 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: String validation could leave invalid Utf8 encoded strings #9096

Fix: String validation could leave invalid Utf8 encoded strings #9096

rubidium42 commented Apr 24, 2021

Fix: String validation could leave invalid Utf8 encoded strings #9096

Fix: String validation could leave invalid Utf8 encoded strings #9096

Conversation

rubidium42 commented Apr 24, 2021

Motivation / Problem

Description

Limitations

Checklist for review