Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: String validation could leave invalid Utf8 encoded strings #9096

Merged
merged 1 commit into from Apr 29, 2021

Conversation

rubidium42
Copy link
Contributor

Motivation / Problem

In case a character was encoded in multiple bytes, but required fewer bytes to be encoded, the first byte would be copied to the output leaving an invalid Utf8 encoded string. Later uses of the validated string would use the same decode logic, which would yield a question mark and just read a single byte, so nothing dangerous happened.
Furthermore, because the next byte would not be a first byte of an encoded Utf8 character, the last few valid characters could be removed by the validation as well.

For example, a character array with: (bits) 11000000 10000000 'a', '\0' would yield 11000000, '\0'. Based on the existing could I would have expected it to yield '\0', but then I found that the Utf8Decode decodes the first two bytes to 0 but then checks whether the decode value is more than 0x80. If that is not the case, it return a question mark and a length of 1. This means that 1 byte is copied. Then it starts the loop again, seeing 1000000 which is not a valid first byte for a Utf8Encoded character thus it assumes a length of 4, which would read beyond the buffer and it terminates.

Description

Rewrote the logic. See the documentation in the diff for the details.
The example about will now return yield 'a', '\0' as validated string.

Limitations

If someone rewrites Utf8Decode to not be as strict, a multibyte encoded '\0' would not terminate the string in the validation, only the length of the string would. Later things that use Utf8Decode to get characters might then terminate on '\0'. But, this is all under the premise that someone wants to make it less strict in the first place, and then most of the logic in the validation would be moot.

Checklist for review

Some things are not automated, and forgotten often. This list is a reminder for the reviewers.

  • The bug fix is important enough to be backported? (label: 'backport requested')
  • This PR affects the save game format? (label 'savegame upgrade')
  • This PR affects the GS/AI API? (label 'needs review: Script API')
    • ai_changelog.hpp, gs_changelog.hpp need updating.
    • The compatibility wrappers (compat_*.nut) need updating.
  • This PR affects the NewGRF API? (label 'needs review: NewGRF')

src/string.cpp Outdated Show resolved Hide resolved
In case a character was encoded in multiple bytes, but required fewer bytes to be encoded, the first byte would be copied to the output leaving an invalid Utf8 encoded string. Later uses of the validated string would use the same decode logic, which would yield a question mark and just read a single byte, so nothing dangerous happened.
Furthermore, because the next byte would not be a first byte of an encoded Utf8 character, the last few valid characters could be removed by the validation as well.
@rubidium42 rubidium42 force-pushed the string_validation_improvement branch from 4e89581 to 03b092c Compare April 24, 2021 18:06
@rubidium42 rubidium42 added the backport requested This PR should be backport to current release (RC / stable) label Apr 25, 2021
@PeterN PeterN merged commit f00564e into OpenTTD:master Apr 29, 2021
@rubidium42 rubidium42 deleted the string_validation_improvement branch April 30, 2021 05:10
LordAro pushed a commit to LordAro/OpenTTD that referenced this pull request May 1, 2021
…TTD#9096)

In case a character was encoded in multiple bytes, but required fewer bytes to be encoded, the first byte would be copied to the output leaving an invalid Utf8 encoded string. Later uses of the validated string would use the same decode logic, which would yield a question mark and just read a single byte, so nothing dangerous happened.
Furthermore, because the next byte would not be a first byte of an encoded Utf8 character, the last few valid characters could be removed by the validation as well.
LordAro pushed a commit to LordAro/OpenTTD that referenced this pull request May 1, 2021
…TTD#9096)

In case a character was encoded in multiple bytes, but required fewer bytes to be encoded, the first byte would be copied to the output leaving an invalid Utf8 encoded string. Later uses of the validated string would use the same decode logic, which would yield a question mark and just read a single byte, so nothing dangerous happened.
Furthermore, because the next byte would not be a first byte of an encoded Utf8 character, the last few valid characters could be removed by the validation as well.
LordAro pushed a commit to LordAro/OpenTTD that referenced this pull request May 2, 2021
…TTD#9096)

In case a character was encoded in multiple bytes, but required fewer bytes to be encoded, the first byte would be copied to the output leaving an invalid Utf8 encoded string. Later uses of the validated string would use the same decode logic, which would yield a question mark and just read a single byte, so nothing dangerous happened.
Furthermore, because the next byte would not be a first byte of an encoded Utf8 character, the last few valid characters could be removed by the validation as well.
LordAro pushed a commit to LordAro/OpenTTD that referenced this pull request May 2, 2021
…TTD#9096)

In case a character was encoded in multiple bytes, but required fewer bytes to be encoded, the first byte would be copied to the output leaving an invalid Utf8 encoded string. Later uses of the validated string would use the same decode logic, which would yield a question mark and just read a single byte, so nothing dangerous happened.
Furthermore, because the next byte would not be a first byte of an encoded Utf8 character, the last few valid characters could be removed by the validation as well.
LordAro pushed a commit that referenced this pull request May 3, 2021
In case a character was encoded in multiple bytes, but required fewer bytes to be encoded, the first byte would be copied to the output leaving an invalid Utf8 encoded string. Later uses of the validated string would use the same decode logic, which would yield a question mark and just read a single byte, so nothing dangerous happened.
Furthermore, because the next byte would not be a first byte of an encoded Utf8 character, the last few valid characters could be removed by the validation as well.
@LordAro LordAro added backported This PR is backported to a current release (RC / stable) and removed backport requested This PR should be backport to current release (RC / stable) labels May 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backported This PR is backported to a current release (RC / stable)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants