Add XML::Node.normalized_text #4020

Rinkana · 2017-02-11T18:16:12Z

XML::Node has three text functions that all return the same thing: #content, #text, #inner_text.
However they all return the current node's text including the text from its children.

This function only returns the text that the current node has

Example:

node = XML.parse("<foo>foo<bar>bar</bar></foo>")

node.children.first.content # => foobar
node.children.first.normalized_text # => foo

straight-shoota · 2017-02-15T20:48:53Z

Why would you call this normalized? That's definitely not self-explanatory. In the context of XML nodes, normalization usually refers to merging adjacent text nodes in a tree and removing empty ones.

Besides naming, I am not sure why this would be necessary, anyway. No XML parser API I could think of provides such a method. What's your use case? I can't think of a common application where you would need only text content from direct descendants. Seems to me like this would probably mean a flawed XML schema... When it's needed, you can just easily write children.select(&.text?).join("", &.to_s) directly in your code and don't need a general library method for that.

Rinkana · 2017-02-15T21:08:06Z

@straight-shoota:
We where talking about it on IRC and the name popped up.

Also it is not true that no XML parser does this. The ruby XML parser does this by default for the .text method on a node. That's how i came across it, i was migrating Ruby code to Crystal and stumbled upon this difference in results.

asterite · 2017-02-15T21:20:36Z

@Rinkana What's this Ruby library that has this normalized_text method?

asterite · 2017-02-15T21:22:31Z

By the way:

$ irb
irb(main):001:0> require "nokogiri"
=> true
irb(main):002:0> Nokogiri::XML.parse("<foo>text<bar>another</bar></foo>")
=> #<Nokogiri::XML::Document:0x3ffed2826b60 name="document" children=[#<Nokogiri::XML::Element:0x3ffed28267a0 name="foo" children=[#<Nokogiri::XML::Text:0x3ffed28265c0 "text">, #<Nokogiri::XML::Element:0x3ffed28264d0 name="bar" children=[#<Nokogiri::XML::Text:0x3ffed28262f0 "another">]>]>]>
irb(main):003:0> xml = _
=> #<Nokogiri::XML::Document:0x3ffed2826b60 name="document" children=[#<Nokogiri::XML::Element:0x3ffed28267a0 name="foo" children=[#<Nokogiri::XML::Text:0x3ffed28265c0 "text">, #<Nokogiri::XML::Element:0x3ffed28264d0 name="bar" children=[#<Nokogiri::XML::Text:0x3ffed28262f0 "another">]>]>]>
irb(main):004:0> xml.text
=> "textanother"
irb(main):005:0> xml.normalized_text
NoMethodError: undefined method `normalized_text' for #<Nokogiri::XML::Document:0x007ffda504d6c0>

I also checked Nokogiri's source code, the content method (and also text, which is an alias) invoked xmlNodeGetContent, just like we do.

straight-shoota · 2017-02-15T21:24:03Z

@Rinkana Do you mean Nokogiri::XML::Node#content? That's based on xmlNodeGetContent from libxml2 and returns text content from the entire sub-tree. XML::Node#content in crystal does exactly the same.

Rinkana · 2017-02-15T21:43:55Z

@asterite i've checked. Its REXML that does this:

require "rexml/document"
include REXML

string = <<EOF
  <foo>text<bar>another</bar></foo>
EOF

doc = Document.new string
REXML::XPath.first(doc, "/foo").text # text

straight-shoota · 2017-02-15T21:58:43Z

According to the REXML API REXML::Element#text returns the text content of only the first child element. For whatever reason you would need that... 🤔
So this only returns foo:

REXML::Document.new('<foo>foo<bar>bar</bar>baz</foo>').root.text

That's something different than what you are proposing with normalize_text, which would return foobaz. And I still don't see, why a general purpose XML parser should provide such a method.

Rinkana · 2017-02-15T22:06:34Z

I can give you an example in what usecase it can be useful. I'm using it to parse the OpenGL specs file (gl.xml).

In this file XML types are defined this way:

<types>
  <type>typedef unsigned int <name>GLenum</name>;</type>
  <type>typedef double <name>GLdouble</name>;</type>
</types>

In this usecase i just need the typedefs but without the nested name because they need to be parsed separately.

This is also how i came across it. But this is just a proposal, if you fell that this function is better suited with another name i'd gladly change it.

straight-shoota · 2017-02-15T23:39:10Z

Well in that case you could just go with type_elem.children.first.content.

Rinkana · 2017-02-16T07:59:39Z

Well yeah, you are correct about this case. However that's not the point that i want to address.

Crystal's XML and Ruby's REXML .text method can produce different results.
To help negate any issues coming up migrating Ruby code to Crystal this provides a method that can reproduce Ruby's result. This way the difference is easy to spot and easy to fix.

straight-shoota · 2017-02-16T08:41:40Z

Yeah they do. But I don't see why anyone would it to return what rexml does. That doesn't make any sense and can be accomplished in a straightforward way. The #text method should be expected to return all text content inside the node. That's exactly what Nokogiri does, the de-facto standard for XML parsing in Ruby. So if you want compatibility with Ruby APIs, it's much more sensible to relate to Nokogiri.

spalladino · 2017-02-16T17:16:01Z

Have to agree with @straight-shoota here. The normalized_text method does not match any other API (since REXML returns just the text for the first node), and it seems better to be manually implemented (especially taking into account that the implementation is quite straightforward) than to pollute the XML API with a method with a fairly rare use case.

Still, thank you for the contribution @Rinkana, and please feel free to comment if you think of another use case or argument of why this particular implementation should be included.

Rinkana added 2 commits February 11, 2017 19:06

Add XML::Node.normalized_text

24e4560

Added spec for XML::Node.normalized_text

64763e0

spalladino closed this Feb 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add XML::Node.normalized_text #4020

Add XML::Node.normalized_text #4020

Rinkana commented Feb 11, 2017 •

edited

straight-shoota commented Feb 15, 2017 •

edited

Rinkana commented Feb 15, 2017

asterite commented Feb 15, 2017

asterite commented Feb 15, 2017

straight-shoota commented Feb 15, 2017

Rinkana commented Feb 15, 2017

straight-shoota commented Feb 15, 2017

Rinkana commented Feb 15, 2017 •

edited

straight-shoota commented Feb 15, 2017

Rinkana commented Feb 16, 2017

straight-shoota commented Feb 16, 2017 •

edited

spalladino commented Feb 16, 2017

Add XML::Node.normalized_text #4020

Add XML::Node.normalized_text #4020

Conversation

Rinkana commented Feb 11, 2017 • edited

straight-shoota commented Feb 15, 2017 • edited

Rinkana commented Feb 15, 2017

asterite commented Feb 15, 2017

asterite commented Feb 15, 2017

straight-shoota commented Feb 15, 2017

Rinkana commented Feb 15, 2017

straight-shoota commented Feb 15, 2017

Rinkana commented Feb 15, 2017 • edited

straight-shoota commented Feb 15, 2017

Rinkana commented Feb 16, 2017

straight-shoota commented Feb 16, 2017 • edited

spalladino commented Feb 16, 2017

Rinkana commented Feb 11, 2017 •

edited

straight-shoota commented Feb 15, 2017 •

edited

Rinkana commented Feb 15, 2017 •

edited

straight-shoota commented Feb 16, 2017 •

edited