-
-
Notifications
You must be signed in to change notification settings - Fork 925
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'master' into truffle-head
- 9.4.12.0
- 9.4.11.0
- 9.4.10.0
- 9.4.9.0
- 9.4.8.0
- 9.4.7.0
- 9.4.6.0
- 9.4.5.0
- 9.4.4.0
- 9.4.3.0
- 9.4.2.0
- 9.4.1.0
- 9.4.0.0
- 9.3.15.0
- 9.3.14.0
- 9.3.13.0
- 9.3.12.0
- 9.3.11.0
- 9.3.10.0
- 9.3.9.0
- 9.3.8.0
- 9.3.7.0
- 9.3.6.0
- 9.3.5.0
- 9.3.4.0
- 9.3.3.0
- 9.3.2.0
- 9.3.1.0
- 9.3.0.0
- 9.2.21.0
- 9.2.20.1
- 9.2.20.0
- 9.2.19.0
- 9.2.18.0
- 9.2.17.0
- 9.2.16.0
- 9.2.15.0
- 9.2.14.0
- 9.2.13.0
- 9.2.12.0
- 9.2.11.1
- 9.2.11.0
- 9.2.10.0
- 9.2.9.0
- 9.2.8.0
- 9.2.7.0
- 9.2.6.0
- 9.2.5.0
- 9.2.4.1
- 9.2.4.0
- 9.2.3.0
- 9.2.2.0
- 9.2.1.0
- 9.2.0.0
- 9.1.17.0
- 9.1.16.0
- 9.1.15.0
- 9.1.14.0
- 9.1.13.0
- 9.1.12.0
- 9.1.11.0
- 9.1.10.0
- 9.1.9.0
- 9.1.8.0
- 9.1.7.0
- 9.1.6.0
- 9.1.5.0
- 9.1.4.0
- 9.1.3.0
Showing
17 changed files
with
1,544 additions
and
60 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
require_relative '../stdlib/unicode_normalize' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
require_relative '../../stdlib/unicode_normalize/' + File.basename(__FILE__) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
require_relative '../../stdlib/unicode_normalize/' + File.basename(__FILE__) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,77 @@ | ||
# coding: utf-8 | ||
|
||
# Copyright Ayumu Nojima (野島 歩) and Martin J. Dürst (duerst@it.aoyama.ac.jp) | ||
|
||
# additions to class String for Unicode normalization | ||
class String | ||
# === Unicode Normalization | ||
# | ||
# :call-seq: | ||
# str.unicode_normalize(form=:nfc) | ||
# | ||
# Returns a normalized form of +str+, using Unicode normalizations | ||
# NFC, NFD, NFKC, or NFKD. The normalization form used is determined | ||
# by +form+, which is any of the four values :nfc, :nfd, :nfkc, or :nfkd. | ||
# The default is :nfc. | ||
# | ||
# If the string is not in a Unicode Encoding, then an Exception is raised. | ||
# In this context, 'Unicode Encoding' means any of UTF-8, UTF-16BE/LE, | ||
# and UTF-32BE/LE, as well as GB18030, UCS_2BE, and UCS_4BE. Anything | ||
# else than UTF-8 is implemented by converting to UTF-8, | ||
# which makes it slower than UTF-8. | ||
# | ||
# _Examples_ | ||
# | ||
# "a\u0300".unicode_normalize #=> 'à' (same as "\u00E0") | ||
# "a\u0300".unicode_normalize(:nfc) #=> 'à' (same as "\u00E0") | ||
# "\u00E0".unicode_normalize(:nfd) #=> 'à' (same as "a\u0300") | ||
# "\xE0".force_encoding('ISO-8859-1').unicode_normalize(:nfd) | ||
# #=> Encoding::CompatibilityError raised | ||
# | ||
def unicode_normalize(form = :nfc) | ||
require 'unicode_normalize/normalize.rb' unless defined? UnicodeNormalize | ||
## The following line can be uncommented to avoid repeated checking for | ||
## UnicodeNormalize. However, tests didn't show any noticeable speedup | ||
## when doing this. This comment also applies to the commented out lines | ||
## in String#unicode_normalize! and String#unicode_normalized?. | ||
# String.send(:define_method, :unicode_normalize, ->(form = :nfc) { UnicodeNormalize.normalize(self, form) } ) | ||
UnicodeNormalize.normalize(self, form) | ||
end | ||
|
||
# :call-seq: | ||
# str.unicode_normalize!(form=:nfc) | ||
# | ||
# Destructive version of String#unicode_normalize, doing Unicode | ||
# normalization in place. | ||
# | ||
def unicode_normalize!(form = :nfc) | ||
require 'unicode_normalize/normalize.rb' unless defined? UnicodeNormalize | ||
# String.send(:define_method, :unicode_normalize!, ->(form = :nfc) { replace(unicode_normalize(form)) } ) | ||
replace(unicode_normalize(form)) | ||
end | ||
|
||
# :call-seq: | ||
# str.unicode_normalized?(form=:nfc) | ||
# | ||
# Checks whether +str+ is in Unicode normalization form +form+, | ||
# which is any of the four values :nfc, :nfd, :nfkc, or :nfkd. | ||
# The default is :nfc. | ||
# | ||
# If the string is not in a Unicode Encoding, then an Exception is raised. | ||
# For details, see String#unicode_normalize. | ||
# | ||
# _Examples_ | ||
# | ||
# "a\u0300".unicode_normalized? #=> false | ||
# "a\u0300".unicode_normalized?(:nfd) #=> true | ||
# "\u00E0".unicode_normalized? #=> true | ||
# "\u00E0".unicode_normalized?(:nfd) #=> false | ||
# "\xE0".force_encoding('ISO-8859-1').unicode_normalized? | ||
# #=> Encoding::CompatibilityError raised | ||
# | ||
def unicode_normalized?(form = :nfc) | ||
require 'unicode_normalize/normalize.rb' unless defined? UnicodeNormalize | ||
# String.send(:define_method, :unicode_normalized?, ->(form = :nfc) { UnicodeNormalize.normalized?(self, form) } ) | ||
UnicodeNormalize.normalized?(self, form) | ||
end | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,168 @@ | ||
# coding: utf-8 | ||
|
||
# Copyright Ayumu Nojima (野島 歩) and Martin J. Dürst (duerst@it.aoyama.ac.jp) | ||
|
||
require 'unicode_normalize/tables.rb' | ||
|
||
|
||
module UnicodeNormalize | ||
## Constant for max hash capacity to avoid DoS attack | ||
MAX_HASH_LENGTH = 18000 # enough for all test cases, otherwise tests get slow | ||
|
||
## Regular Expressions and Hash Constants | ||
REGEXP_D = Regexp.compile(REGEXP_D_STRING, Regexp::EXTENDED) | ||
REGEXP_C = Regexp.compile(REGEXP_C_STRING, Regexp::EXTENDED) | ||
REGEXP_K = Regexp.compile(REGEXP_K_STRING, Regexp::EXTENDED) | ||
NF_HASH_D = Hash.new do |hash, key| | ||
hash.shift if hash.length>MAX_HASH_LENGTH # prevent DoS attack | ||
hash[key] = nfd_one(key) | ||
end | ||
NF_HASH_C = Hash.new do |hash, key| | ||
hash.shift if hash.length>MAX_HASH_LENGTH # prevent DoS attack | ||
hash[key] = nfc_one(key) | ||
end | ||
NF_HASH_K = Hash.new do |hash, key| | ||
hash.shift if hash.length>MAX_HASH_LENGTH # prevent DoS attack | ||
hash[key] = nfkd_one(key) | ||
end | ||
|
||
## Constants For Hangul | ||
# for details such as the meaning of the identifiers below, please see | ||
# http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf, pp. 144/145 | ||
SBASE = 0xAC00 | ||
LBASE = 0x1100 | ||
VBASE = 0x1161 | ||
TBASE = 0x11A7 | ||
LCOUNT = 19 | ||
VCOUNT = 21 | ||
TCOUNT = 28 | ||
NCOUNT = VCOUNT * TCOUNT | ||
SCOUNT = LCOUNT * NCOUNT | ||
|
||
# Unicode-based encodings (except UTF-8) | ||
UNICODE_ENCODINGS = [Encoding::UTF_16BE, Encoding::UTF_16LE, Encoding::UTF_32BE, Encoding::UTF_32LE, | ||
Encoding::GB18030, Encoding::UCS_2BE, Encoding::UCS_4BE] | ||
|
||
## Hangul Algorithm | ||
def self.hangul_decomp_one(target) | ||
syllable_index = target.ord - SBASE | ||
return target if syllable_index < 0 || syllable_index >= SCOUNT | ||
l = LBASE + syllable_index / NCOUNT | ||
v = VBASE + (syllable_index % NCOUNT) / TCOUNT | ||
t = TBASE + syllable_index % TCOUNT | ||
(t==TBASE ? [l, v] : [l, v, t]).pack('U*') + target[1..-1] | ||
end | ||
|
||
def self.hangul_comp_one(string) | ||
length = string.length | ||
if length>1 and 0 <= (lead =string[0].ord-LBASE) and lead < LCOUNT and | ||
0 <= (vowel=string[1].ord-VBASE) and vowel < VCOUNT | ||
lead_vowel = SBASE + (lead * VCOUNT + vowel) * TCOUNT | ||
if length>2 and 0 <= (trail=string[2].ord-TBASE) and trail < TCOUNT | ||
(lead_vowel + trail).chr(Encoding::UTF_8) + string[3..-1] | ||
else | ||
lead_vowel.chr(Encoding::UTF_8) + string[2..-1] | ||
end | ||
else | ||
string | ||
end | ||
end | ||
|
||
## Canonical Ordering | ||
def self.canonical_ordering_one(string) | ||
sorting = string.each_char.collect { |c| [c, CLASS_TABLE[c]] } | ||
(sorting.length-2).downto(0) do |i| # almost, but not exactly bubble sort | ||
(0..i).each do |j| | ||
later_class = sorting[j+1].last | ||
if 0<later_class and later_class<sorting[j].last | ||
sorting[j], sorting[j+1] = sorting[j+1], sorting[j] | ||
end | ||
end | ||
end | ||
return sorting.collect(&:first).join('') | ||
end | ||
|
||
## Normalization Forms for Patterns (not whole Strings) | ||
def self.nfd_one(string) | ||
string = string.chars.map! {|c| DECOMPOSITION_TABLE[c] || c}.join('') | ||
canonical_ordering_one(hangul_decomp_one(string)) | ||
end | ||
|
||
def self.nfkd_one(string) | ||
string.chars.map! {|c| KOMPATIBLE_TABLE[c] || c}.join('') | ||
end | ||
|
||
def self.nfc_one(string) | ||
nfd_string = nfd_one string | ||
start = nfd_string[0] | ||
last_class = CLASS_TABLE[start]-1 | ||
accents = '' | ||
nfd_string[1..-1].each_char do |accent| | ||
accent_class = CLASS_TABLE[accent] | ||
if last_class<accent_class and composite = COMPOSITION_TABLE[start+accent] | ||
start = composite | ||
else | ||
accents << accent | ||
last_class = accent_class | ||
end | ||
end | ||
hangul_comp_one(start+accents) | ||
end | ||
|
||
def self.normalize(string, form = :nfc) | ||
encoding = string.encoding | ||
case encoding | ||
when Encoding::UTF_8 | ||
case form | ||
when :nfc then | ||
string.gsub REGEXP_C, NF_HASH_C | ||
when :nfd then | ||
string.gsub REGEXP_D, NF_HASH_D | ||
when :nfkc then | ||
string.gsub(REGEXP_K, NF_HASH_K).gsub REGEXP_C, NF_HASH_C | ||
when :nfkd then | ||
string.gsub(REGEXP_K, NF_HASH_K).gsub REGEXP_D, NF_HASH_D | ||
else | ||
raise ArgumentError, "Invalid normalization form #{form}." | ||
end | ||
when Encoding::US_ASCII | ||
string | ||
when *UNICODE_ENCODINGS | ||
normalize(string.encode(Encoding::UTF_8), form).encode(encoding) | ||
else | ||
raise Encoding::CompatibilityError, "Unicode Normalization not appropriate for #{encoding}" | ||
end | ||
end | ||
|
||
def self.normalized?(string, form = :nfc) | ||
encoding = string.encoding | ||
case encoding | ||
when Encoding::UTF_8 | ||
case form | ||
when :nfc then | ||
string.scan REGEXP_C do |match| | ||
return false if NF_HASH_C[match] != match | ||
end | ||
true | ||
when :nfd then | ||
string.scan REGEXP_D do |match| | ||
return false if NF_HASH_D[match] != match | ||
end | ||
true | ||
when :nfkc then | ||
normalized?(string, :nfc) and string !~ REGEXP_K | ||
when :nfkd then | ||
normalized?(string, :nfd) and string !~ REGEXP_K | ||
else | ||
raise ArgumentError, "Invalid normalization form #{form}." | ||
end | ||
when Encoding::US_ASCII | ||
true | ||
when *UNICODE_ENCODINGS | ||
normalized? string.encode(Encoding::UTF_8), form | ||
else | ||
raise Encoding::CompatibilityError, "Unicode Normalization not appropriate for #{encoding}" | ||
end | ||
end | ||
|
||
end # module |
1,163 changes: 1,163 additions & 0 deletions
1,163
lib/ruby/truffle/stdlib/unicode_normalize/tables.rb
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters