Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid encoding with nokogiri #2089

Closed
ahorek opened this issue Nov 3, 2014 · 3 comments
Closed

Invalid encoding with nokogiri #2089

ahorek opened this issue Nov 3, 2014 · 3 comments

Comments

@ahorek
Copy link
Contributor

ahorek commented Nov 3, 2014

How to reproduce this:

 # encoding: UTF-8
require 'nokogiri'

Encoding.default_external = Encoding::UTF_8
Encoding.default_internal = Encoding::UTF_8

File.open("test.txt", "r:UTF-8") do |f|
  str = f.read
  puts "Original: #{str}"
  puts "Encoding: #{str.encoding}"
  puts
  doc = Nokogiri::HTML::DocumentFragment.parse(str, 'UTF-8')
  puts "Parsed: #{doc}"
  puts "Encoding: #{doc.to_s.encoding}"
end

Results:
ruby 2.1.3p242 (2014-09-19 revision 47630) [i386-mingw32]
Original: < p >ěščřžýáíé< /p >
Encoding: UTF-8

Parsed: < p >ěščřžýáíé< /p >
Encoding: UTF-8

jruby 1.7.16.1 (1.9.3p392) 2014-10-28 4e93f31 on Java HotSpot(TM) 64-Bit Server VM 1.8.0_25-b18 +jit [Windows 8.1-amd64]
Original: < p >ěščřžýáíé< /p >
Encoding: UTF-8

Parsed: < p >?š??žýáíé< /p > ?????????
Encoding: UTF-8

from nokogiri 1.6.3.1 /lib/nokogiri/html/document_fragment.rb

module Nokogiri
  module HTML
    class DocumentFragment < Nokogiri::XML::DocumentFragment
      attr_accessor :errors

      ####
      # Create a Nokogiri::XML::DocumentFragment from +tags+, using +encoding+
      def self.parse tags, encoding = nil

        ####################################
        # tags.encoding => #<Encoding:UTF-8>
        ####################################

        doc = HTML::Document.new

        encoding ||= tags.respond_to?(:encoding) ? tags.encoding.name : 'UTF-8'
        doc.encoding = encoding

        new(doc, tags)
      end

      def initialize document, tags = nil, ctx = nil

        ###########################################
        # tags.encoding => #<Encoding:Windows-1252> ????????
        ###########################################

        return self unless tags

        if ctx
          preexisting_errors = document.errors.dup
          node_set = ctx.parse("<div>#{tags}</div>")
          node_set.first.children.each { |child| child.parent = self } unless node_set.empty?
          self.errors = document.errors - preexisting_errors
        else
          # This is a horrible hack, but I don't care
          if tags.strip =~ /^<body/i
            path = "/html/body"
          else
            path = "/html/body/node()"
          end

          temp_doc = HTML::Document.parse "<html><body>#{tags}", nil, document.encoding
          temp_doc.xpath(path).each { |child| child.parent = self }
          self.errors = temp_doc.errors
        end
        children
      end
    end
  end
end
@ahorek
Copy link
Contributor Author

ahorek commented Dec 30, 2014

Tested on linux (ubuntu) and the parsed string was properly encoded, so it's a windows platform or some sort of java locale problem.

I tried to freeze original string first

...
str = f.read
+ str.freeze
...

this results no effect on MRI (string was unchanged), but jruby raises an error

  chomp! at org/jruby/RubyString.java:5775
     new at nokogiri/XmlDocumentFragment.java:93
   parse at C:/jruby-1.7.16/lib/ruby/gems/shared/gems/nokogiri-1.6.5-java/lib/no
kogiri/html/document_fragment.rb:14
  (root) at tt.rb:13
    open at org/jruby/RubyIO.java:1181
  (root) at tt.rb:7

https://github.com/sparklemotion/nokogiri/blob/master/ext/java/nokogiri/XmlDocumentFragment.java#L93

Any ideas?

@ahorek
Copy link
Contributor Author

ahorek commented Jan 15, 2015

Update - it works on master 9.0.0.0 (jruby + windows)

@kares
Copy link
Member

kares commented Jun 27, 2017

working in 9K, 1.7.x EOL thus closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants