Skip to content

Commit

Permalink
Fix regex to remove UTF-8 chars invalid in XML 1.0
Browse files Browse the repository at this point in the history
The regex introduced in string_html_specialchars() function with commit
2b5d662 caused problems with multibyte
UTF-8 chars, as PCRE require that they are specified like '\x{NNNN}';
the syntax without braces '\xNN' only supports up to 2 hex digits [1].

Fixes #14744

[1] http://php.net/regexp.reference.escape
  • Loading branch information
dregad committed Nov 15, 2012
1 parent 7ae2d9a commit ff2e650
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion core/string_api.php
Expand Up @@ -912,7 +912,7 @@ function string_html_entities( $p_string ) {
function string_html_specialchars( $p_string ) {
# Remove any invalid character from the string per XML 1.0 specification
# http://www.w3.org/TR/2008/REC-xml-20081126/#NT-Char
$p_string = preg_replace( '/[^\x9\xA\xD\x20-\xD7FF\xE000-\xFFFD\x{10000}-\x{10FFFF}]/u', '', $p_string );
$p_string = preg_replace( '/[^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+/u', '', $p_string );

# achumakov: @ added to avoid warning output in unsupported codepages
# e.g. 8859-2, windows-1257, Korean, which are treated as 8859-1.
Expand Down

0 comments on commit ff2e650

Please sign in to comment.