New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: String validation could leave invalid Utf8 encoded strings #9096
Merged
PeterN
merged 1 commit into
OpenTTD:master
from
rubidium42:string_validation_improvement
Apr 29, 2021
Merged
Fix: String validation could leave invalid Utf8 encoded strings #9096
PeterN
merged 1 commit into
OpenTTD:master
from
rubidium42:string_validation_improvement
Apr 29, 2021
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
TrueBrain
reviewed
Apr 24, 2021
In case a character was encoded in multiple bytes, but required fewer bytes to be encoded, the first byte would be copied to the output leaving an invalid Utf8 encoded string. Later uses of the validated string would use the same decode logic, which would yield a question mark and just read a single byte, so nothing dangerous happened. Furthermore, because the next byte would not be a first byte of an encoded Utf8 character, the last few valid characters could be removed by the validation as well.
rubidium42
force-pushed
the
string_validation_improvement
branch
from
April 24, 2021 18:06
4e89581
to
03b092c
Compare
rubidium42
added
the
backport requested
This PR should be backport to current release (RC / stable)
label
Apr 25, 2021
PeterN
approved these changes
Apr 29, 2021
LordAro
pushed a commit
to LordAro/OpenTTD
that referenced
this pull request
May 1, 2021
…TTD#9096) In case a character was encoded in multiple bytes, but required fewer bytes to be encoded, the first byte would be copied to the output leaving an invalid Utf8 encoded string. Later uses of the validated string would use the same decode logic, which would yield a question mark and just read a single byte, so nothing dangerous happened. Furthermore, because the next byte would not be a first byte of an encoded Utf8 character, the last few valid characters could be removed by the validation as well.
LordAro
pushed a commit
to LordAro/OpenTTD
that referenced
this pull request
May 1, 2021
…TTD#9096) In case a character was encoded in multiple bytes, but required fewer bytes to be encoded, the first byte would be copied to the output leaving an invalid Utf8 encoded string. Later uses of the validated string would use the same decode logic, which would yield a question mark and just read a single byte, so nothing dangerous happened. Furthermore, because the next byte would not be a first byte of an encoded Utf8 character, the last few valid characters could be removed by the validation as well.
LordAro
pushed a commit
to LordAro/OpenTTD
that referenced
this pull request
May 2, 2021
…TTD#9096) In case a character was encoded in multiple bytes, but required fewer bytes to be encoded, the first byte would be copied to the output leaving an invalid Utf8 encoded string. Later uses of the validated string would use the same decode logic, which would yield a question mark and just read a single byte, so nothing dangerous happened. Furthermore, because the next byte would not be a first byte of an encoded Utf8 character, the last few valid characters could be removed by the validation as well.
LordAro
pushed a commit
to LordAro/OpenTTD
that referenced
this pull request
May 2, 2021
…TTD#9096) In case a character was encoded in multiple bytes, but required fewer bytes to be encoded, the first byte would be copied to the output leaving an invalid Utf8 encoded string. Later uses of the validated string would use the same decode logic, which would yield a question mark and just read a single byte, so nothing dangerous happened. Furthermore, because the next byte would not be a first byte of an encoded Utf8 character, the last few valid characters could be removed by the validation as well.
LordAro
pushed a commit
that referenced
this pull request
May 3, 2021
In case a character was encoded in multiple bytes, but required fewer bytes to be encoded, the first byte would be copied to the output leaving an invalid Utf8 encoded string. Later uses of the validated string would use the same decode logic, which would yield a question mark and just read a single byte, so nothing dangerous happened. Furthermore, because the next byte would not be a first byte of an encoded Utf8 character, the last few valid characters could be removed by the validation as well.
LordAro
added
backported
This PR is backported to a current release (RC / stable)
and removed
backport requested
This PR should be backport to current release (RC / stable)
labels
May 3, 2021
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation / Problem
In case a character was encoded in multiple bytes, but required fewer bytes to be encoded, the first byte would be copied to the output leaving an invalid Utf8 encoded string. Later uses of the validated string would use the same decode logic, which would yield a question mark and just read a single byte, so nothing dangerous happened.
Furthermore, because the next byte would not be a first byte of an encoded Utf8 character, the last few valid characters could be removed by the validation as well.
For example, a character array with: (bits) 11000000 10000000 'a', '\0' would yield 11000000, '\0'. Based on the existing could I would have expected it to yield '\0', but then I found that the Utf8Decode decodes the first two bytes to 0 but then checks whether the decode value is more than 0x80. If that is not the case, it return a question mark and a length of 1. This means that 1 byte is copied. Then it starts the loop again, seeing 1000000 which is not a valid first byte for a Utf8Encoded character thus it assumes a length of 4, which would read beyond the buffer and it terminates.
Description
Rewrote the logic. See the documentation in the diff for the details.
The example about will now return yield 'a', '\0' as validated string.
Limitations
If someone rewrites Utf8Decode to not be as strict, a multibyte encoded '\0' would not terminate the string in the validation, only the length of the string would. Later things that use Utf8Decode to get characters might then terminate on '\0'. But, this is all under the premise that someone wants to make it less strict in the first place, and then most of the logic in the validation would be moot.
Checklist for review
Some things are not automated, and forgotten often. This list is a reminder for the reviewers.