Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add XML::Node.normalized_text #4020

Closed
wants to merge 2 commits into from

Conversation

Rinkana
Copy link

@Rinkana Rinkana commented Feb 11, 2017

XML::Node has three text functions that all return the same thing: #content, #text, #inner_text.
However they all return the current node's text including the text from its children.

This function only returns the text that the current node has

Example:

node = XML.parse("<foo>foo<bar>bar</bar></foo>")

node.children.first.content # => foobar
node.children.first.normalized_text # => foo

@straight-shoota
Copy link
Member

straight-shoota commented Feb 15, 2017

Why would you call this normalized? That's definitely not self-explanatory. In the context of XML nodes, normalization usually refers to merging adjacent text nodes in a tree and removing empty ones.

Besides naming, I am not sure why this would be necessary, anyway. No XML parser API I could think of provides such a method. What's your use case? I can't think of a common application where you would need only text content from direct descendants. Seems to me like this would probably mean a flawed XML schema... When it's needed, you can just easily write children.select(&.text?).join("", &.to_s) directly in your code and don't need a general library method for that.

@Rinkana
Copy link
Author

Rinkana commented Feb 15, 2017

@straight-shoota:
We where talking about it on IRC and the name popped up.

Also it is not true that no XML parser does this. The ruby XML parser does this by default for the .text method on a node. That's how i came across it, i was migrating Ruby code to Crystal and stumbled upon this difference in results.

@asterite
Copy link
Member

@Rinkana What's this Ruby library that has this normalized_text method?

@asterite
Copy link
Member

By the way:

$ irb
irb(main):001:0> require "nokogiri"
=> true
irb(main):002:0> Nokogiri::XML.parse("<foo>text<bar>another</bar></foo>")
=> #<Nokogiri::XML::Document:0x3ffed2826b60 name="document" children=[#<Nokogiri::XML::Element:0x3ffed28267a0 name="foo" children=[#<Nokogiri::XML::Text:0x3ffed28265c0 "text">, #<Nokogiri::XML::Element:0x3ffed28264d0 name="bar" children=[#<Nokogiri::XML::Text:0x3ffed28262f0 "another">]>]>]>
irb(main):003:0> xml = _
=> #<Nokogiri::XML::Document:0x3ffed2826b60 name="document" children=[#<Nokogiri::XML::Element:0x3ffed28267a0 name="foo" children=[#<Nokogiri::XML::Text:0x3ffed28265c0 "text">, #<Nokogiri::XML::Element:0x3ffed28264d0 name="bar" children=[#<Nokogiri::XML::Text:0x3ffed28262f0 "another">]>]>]>
irb(main):004:0> xml.text
=> "textanother"
irb(main):005:0> xml.normalized_text
NoMethodError: undefined method `normalized_text' for #<Nokogiri::XML::Document:0x007ffda504d6c0>

I also checked Nokogiri's source code, the content method (and also text, which is an alias) invoked xmlNodeGetContent, just like we do.

@straight-shoota
Copy link
Member

@Rinkana Do you mean Nokogiri::XML::Node#content? That's based on xmlNodeGetContent from libxml2 and returns text content from the entire sub-tree. XML::Node#content in crystal does exactly the same.

@Rinkana
Copy link
Author

Rinkana commented Feb 15, 2017

@asterite i've checked. Its REXML that does this:

require "rexml/document"
include REXML

string = <<EOF
  <foo>text<bar>another</bar></foo>
EOF

doc = Document.new string
REXML::XPath.first(doc, "/foo").text # text

@straight-shoota
Copy link
Member

According to the REXML API REXML::Element#text returns the text content of only the first child element. For whatever reason you would need that... 🤔
So this only returns foo:

REXML::Document.new('<foo>foo<bar>bar</bar>baz</foo>').root.text

That's something different than what you are proposing with normalize_text, which would return foobaz. And I still don't see, why a general purpose XML parser should provide such a method.

@Rinkana
Copy link
Author

Rinkana commented Feb 15, 2017

I can give you an example in what usecase it can be useful. I'm using it to parse the OpenGL specs file (gl.xml).

In this file XML types are defined this way:

<types>
  <type>typedef unsigned int <name>GLenum</name>;</type>
  <type>typedef double <name>GLdouble</name>;</type>
</types>

In this usecase i just need the typedefs but without the nested name because they need to be parsed separately.

This is also how i came across it. But this is just a proposal, if you fell that this function is better suited with another name i'd gladly change it.

@straight-shoota
Copy link
Member

Well in that case you could just go with type_elem.children.first.content.

@Rinkana
Copy link
Author

Rinkana commented Feb 16, 2017

Well yeah, you are correct about this case. However that's not the point that i want to address.

Crystal's XML and Ruby's REXML .text method can produce different results.
To help negate any issues coming up migrating Ruby code to Crystal this provides a method that can reproduce Ruby's result. This way the difference is easy to spot and easy to fix.

@straight-shoota
Copy link
Member

straight-shoota commented Feb 16, 2017

Yeah they do. But I don't see why anyone would it to return what rexml does. That doesn't make any sense and can be accomplished in a straightforward way. The #text method should be expected to return all text content inside the node. That's exactly what Nokogiri does, the de-facto standard for XML parsing in Ruby. So if you want compatibility with Ruby APIs, it's much more sensible to relate to Nokogiri.

@spalladino
Copy link
Contributor

Have to agree with @straight-shoota here. The normalized_text method does not match any other API (since REXML returns just the text for the first node), and it seems better to be manually implemented (especially taking into account that the implementation is quite straightforward) than to pollute the XML API with a method with a fairly rare use case.

Still, thank you for the contribution @Rinkana, and please feel free to comment if you think of another use case or argument of why this particular implementation should be included.

@spalladino spalladino closed this Feb 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants