Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dir.glob returns UTF-8 string with Windows-31J encoding #4693

Closed
jakago opened this issue Jun 28, 2017 · 3 comments
Closed

Dir.glob returns UTF-8 string with Windows-31J encoding #4693

jakago opened this issue Jun 28, 2017 · 3 comments
Milestone

Comments

@jakago
Copy link

jakago commented Jun 28, 2017

C:/blah/α.rb

# coding: utf-8

# 'α'.bytes => [206, 177]
path = 'C:/blah/α.rb'
p path.encoding
p path.bytes

Dir.glob(path) do |file|
  p file.encoding
  p file.bytes
end

Environment

jruby 9.1.12.0 (2.3.3) 2017-06-15 33c6439 Java HotSpot(TM) Client VM 24.65-b04 on 1.7.0_65-b19 +jit [mswin32-x86]

Windows 7 Ultimate Service Pack 1 32-bit

Expected Behavior

C:\blah>ruby α.rb
#<Encoding:UTF-8>
[67, 58, 47, 98, 108, 97, 104, 47, 206, 177, 46, 114, 98]
#<Encoding:UTF-8>
[67, 58, 47, 98, 108, 97, 104, 47, 206, 177, 46, 114, 98]

Actual Behavior

C:\blah>jruby α.rb
#<Encoding:UTF-8>
[67, 58, 47, 98, 108, 97, 104, 47, 206, 177, 46, 114, 98]
#<Encoding:Windows-31J>
[67, 58, 47, 98, 108, 97, 104, 47, 206, 177, 46, 114, 98]
@jakago
Copy link
Author

jakago commented Jun 29, 2017

quick fix

C:\blah>jruby -Eutf-8 α.rb
#<Encoding:UTF-8>
[67, 58, 47, 98, 108, 97, 104, 47, 206, 177, 46, 114, 98]
#<Encoding:UTF-8>
[67, 58, 47, 98, 108, 97, 104, 47, 206, 177, 46, 114, 98]

I checked the source code for Dir.glob.
core/src/main/java/org/jruby/RubyDir.java:229:

Encoding enc = runtime.getDefaultExternalEncoding();

Encoding.default_external for Japanese Windows 7 is Windows-31J (aka cp932 or ms932), and this code uses it for multi-byte file name built by Java which is encoded with UTF-8.

@ahorek
Copy link
Contributor

ahorek commented Sep 2, 2017

Ruby uses encoding from input patterns. I tried to fix it, but this case still doesn't work:

Dir.glob(['/tmp'.force_encoding('utf-8'), '/tmp'.force_encoding('windows-1250')])
=> [utf-8, windows-1250]

@headius
Copy link
Member

headius commented Sep 7, 2017

Reproduced on Unix by forcing Windows-31J as external encoding:

[] ~/projects/jruby $ jruby -EWindows-31J α.rb
#Encoding:UTF-8
[206, 177, 46, 114, 98]
#Encoding:Windows-31J
[206, 177, 46, 114, 98]

```ruby
# coding: utf-8

# 'α'.bytes => [206, 177]
path = 'α.rb'
p path.encoding
p path.bytes

Dir.glob(path) do |file|
  p file.encoding
  p file.bytes
end

@ahorek's fix in #4773 does address the primary issue for this bug, but I'm working on a patch that fixes the Dir.glob([...]) case too.

@headius headius closed this as completed in 49d1eb3 Sep 7, 2017
@headius headius added this to the JRuby 9.2.0.0 milestone Sep 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants