Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds digits method to the Int type #5598

Closed
wants to merge 7 commits into from
Closed

Adds digits method to the Int type #5598

wants to merge 7 commits into from

Conversation

FrankKair
Copy link

Hi, this is my first pull request for a programming language.
I implemented the digits method for the Int type, like Ruby did in its 2.4 release.

@straight-shoota
Copy link
Member

straight-shoota commented Jan 16, 2018

Hey @FrankKair, thanks for participating in making Crystal better!
What is the use case for this method, why is it needed? "because Ruby has it" is unlikely going to convince somebody to include this in the standard library.

Besides the general question, the implementation using to_s is by far not optimal. It is conceptually simple, but realistically it is a very slow algorithm and unnecessarily allocates a string just to throw it away.

@woodruffw
Copy link
Contributor

What is the use case for this method, why is it needed?

Personally, I've used Integer#digits for checksum and check-digit algorithms in Ruby. It also comes in handy for code golfing/exercises, although that's less of a real-world use 🙂


Aside from performance issues, there are some differences with Ruby's Integer#digits that should be addressed:

  • -123.digits raises an error in Ruby, while this returns [1, 2, 3]. IMO, it makes sense to preserve that behavior here (i.e., raise an error instead of silently trimming the negative sign).
  • 123.digits returns [3, 2, 1] in Ruby, while the current implementation returns [1, 2, 3] (MSD order). I think MSD order makes more sense, but a more efficient implementation will probably be naturally LSD.

@FrankKair
Copy link
Author

FrankKair commented Jan 17, 2018

(First of all, thank you for the comments!)

Hey @straight-shoota, thanks for pointing that out, I probably should've been a little more specific and explain why I think it's a nice addition to the standard library.

So, as @woodruffw just said, it's interesting for checksum, digits manipulation and some algorithms (and exercises/coding challenges, why not).

I thought it was so nice when Ruby implemented this method that I wanted Crystal to have it too. Otherwise, people would have to type self.to_s.gsub('-', "").each_char.map(&.to_i).to_a themselves, and I think that's something the Int type could take care of.

And, as you just said, this is rather inefficient. So maybe that's a good thing, instead of letting people write inefficient code (as I just did), we could provide them a clean, performant API to work with.

@FrankKair
Copy link
Author

FrankKair commented Jan 17, 2018

I was trying to think of a better idea for this method and I've come across this Haskell implementation:

digs :: Integral x => x -> [x]
digs 0 = []
digs x = digs (x `div` 10) ++ [x `mod` 10]

And I also saw the Elixir implementation:

  def digits(integer, base \\ 10)
      when is_integer(integer) and is_integer(base) and base >= 2 do
    do_digits(integer, base, [])
  end

  defp do_digits(integer, base, acc) when abs(integer) < base, do: [integer | acc]

  defp do_digits(integer, base, acc),
    do: do_digits(div(integer, base), base, [rem(integer, base) | acc])

This kind of approach would be more efficient, I suppose.
What do you guys think?

@woodruffw I also think that the normal order makes more sense, because it's what most people expect.
As for the negative / error raising, I really don't know. At first I just thought it was an edge case I should deal with. Whatever you guys say/prefer, I still don't have a position on this one.

@FrankKair
Copy link
Author

FrankKair commented Jan 17, 2018

What about this?

def digits
  quotient = self.abs / 10
  remainder = self.abs % 10
  if quotient == 0
    return [remainder]
  end
  digits(quotient) << remainder
end

I'm using self.abs so the method still works with negative numbers.
But well, as I said previously, I still don't have a position on this one.

CPU wise, I tested the method with BigInts with 30k digits and the method was running smoothly.
Memory wise, I don't know, but I guess it's better than allocating strings, as @straight-shoota pointed out.

@woodruffw
Copy link
Contributor

That implementation looks like a solid improvement to me. You might want to compare it to an iterative version, though -- my suspicion is that LLVM can optimize a while loop better than it can a non-tail-recursive function.

@luislavena
Copy link
Contributor

Hello!

Thank you for your interest and contribution to Crystal!. If the reason to bring this to Crystal is because Ruby has it, then the implementation needs to be as close as possible to the original implementation.

This means that 123.digits on both Crystal and Ruby should have similar results, which this implementation does not produce them.

Also, as pointed out:

  • to_s will allocate a new String in memory
  • gsub will result in a new string allocated
  • map(...) converts each character into integer
  • to_a allocates a new array and copies each of the elements of the iteration in.

Doing a quick benchmark:

require "benchmark"

struct Int
  def digits
    self.to_s.gsub('-', "").each_char.map(&.to_i).to_a
  end
end

Benchmark.ips do |x|
  x.report "123.digits (PR)" { 123.digits }
  x.report "-123.digits (PR)" { -123.digits }
end

Results in the following:

 123.digits (PR)   2.44M ( 410.6ns) (± 6.28%)  163 B/op        fastest
-123.digits (PR)   1.45M (687.87ns) (± 5.10%)  289 B/op   1.68× slower

163 bytes allocated per operation. Worse case scenario, dealing with negative values, results in 289 bytes.

There are other differences:

  • Ruby's implementation accepts a different base other than 10
  • Your code uses to_s, which doesn't work when dealing with base 16 representation (fails to convert string back to integer)

A possible alternate implementation might look like this:

struct Int
  def alt_digits(base = 10)
    arr = [] of typeof(self)
    num = self

    while num != 0
      arr << num.remainder(base).abs
      num = num.tdiv(base)
    end

    arr
  end
end

Which shows the following on the benchmarks:

 123.digits (PR)   2.63M (379.52ns) (± 6.92%)  163 B/op   4.12× slower
123.digits (alt)  10.85M ( 92.15ns) (±13.23%)   48 B/op        fastest

Faster iterations and less allocations. However, the naive implementation shown above doesn't deal with negative numbers, so that requires to be addressed if the objective is provide an implementation that is similar to Ruby (and if that is desired).

Cheers.

@FrankKair
Copy link
Author

FrankKair commented Jan 17, 2018

Hi, @luislavena, thank you for the in-depth feedback!

I do believe that digits is a nice addition to the std lib regardless of Ruby (I just mentioned it because I thought people would remember that it's a nice function).

Having said that, I think that the method should return the numbers in order, because it's more natural. It's not intuitive to ask for digits and receive them from lsd to msd. At least for me 😅

The first implementation with to_s was naive, the second one (which I draw inspiration from the Haskell and Elixir snippets I posted) seems more mature.

Here is my benchmark:

require "benchmark"

Benchmark.ips do |x|
  x.report "digits_1" { 12312312312312312312.digits_1 }
  x.report "digits_2" { 12312312312312312312.digits_2 }
  x.report "digits_3" { 12312312312312312312.digits_3 }
end

struct Int
  def digits_1
    self.abs.to_s.each_char.map(&.to_i).to_a
  end

  def digits_2
    quotient = self.abs.tdiv(10)
    remainder = self.abs.remainder(10)
    if quotient == 0
      return [remainder]
    end
    quotient.digits_2 << remainder
  end

  def digits_3(base = 10)
    arr = [] of typeof(self)
    num = self

    while num != 0
      arr << num.remainder(base).abs
      num = num.tdiv(base)
    end

    arr
  end
end
digits_1   1.66M (604.05ns) (± 2.48%)  1.59× slower
digits_2   1.68M (594.63ns) (± 3.81%)  1.56× slower
digits_3   2.63M (380.09ns) (± 3.96%)       fastest

At least for me, I do prefer to return the digits in order, so I tried adding reverse to your implementation (just returning arr.reverse).

digits_1    1.6M (626.85ns) (± 8.30%)  1.05× slower
digits_2   1.68M (595.21ns) (± 3.99%)       fastest
digits_3   1.67M (598.71ns) (± 4.06%)  1.01× slower

That way, the numbers changed a bit.
(These numbers are getting pretty close to each other with these implementations).

What do you guys think should be the proper behaviour and which implementation should remain?

@RX14
Copy link
Contributor

RX14 commented Jan 17, 2018

How is this useful for checksums and check-digits? Besides, if those two (pretty rare) usecases are the only usecases is it worth having in the stdlib? I can't say i've ever wanted a method like this.

@woodruffw
Copy link
Contributor

How is this useful for checksums and check-digits?

It's pretty useful, at least in my experience. One of the common check-digit strategies is "perform function f on every kth digit and function g on every other digit, sum the results, and compare to the last digit," and a digits method makes that a bit easier.

Of course, like you said, that's not a common use case. It's also (now that I think about it) a bit of an abuse of the Integer type, since "numbers" that require checksums and check-digits should really be represented as strings or discrete types. However, just like phone numbers, that won't stop people from using numeric types to store them.


On an unrelated note, if this gets added, I'm in agreement with @FrankKair that MSD order is preferable (and thereby break with Ruby's behavior in that aspect).

@asterite
Copy link
Member

All of the use cases you mention don't need to allocate memory. Just keep diving by ten and you get the digits. If no one can provide a real (as in "here's a real project I used it in") I don't think this should be go in the std.

@RX14
Copy link
Contributor

RX14 commented Jan 17, 2018

At the end of the day, you're still allocating an array for this method which makes calculating a checksum suboptimal. Perhaps it'd be better to put these checksum calculations inside the shards that require them.

If we accept this though, it should accept a base.

@FrankKair
Copy link
Author

FrankKair commented Jan 17, 2018

I refactored the code to this, following @RX14's idea of accepting a base:

def digits(base = 10)
  quotient = self.abs.tdiv(base)
  remainder = self.abs.remainder(base)
  if quotient == 0
    return [remainder]
  end
  quotient.digits << remainder
end

This is my first PR in a programming language, and even if it doesn't get added to the std lib, I'm quite grateful and I've learned a lot. It was very nice interacting with much more knowledgeable people, receiving feedback and discussing aspects that I didn't even consider in the first place.

Well, I still think it's nice and useful (and of course it would be awesome to contribute to the project!), but ultimately this is up to you all.

Cheers! 💎

@lbguilherme
Copy link
Contributor

At least for me, I do prefer to return the digits in order, so I tried adding reverse to your implementation (just returning arr.reverse).

digits_1    1.6M (626.85ns) (± 8.30%)  1.05× slower
digits_2   1.68M (595.21ns) (± 3.99%)       fastest
digits_3   1.67M (598.71ns) (± 4.06%)  1.01× slower

Actually calling arr.reverse will copy the array then reverse the copy and return it. Better to reverse in-place:

  def digits_3(base = 10)
    arr = [] of typeof(self)
    num = self

    while num != 0
      arr << num.remainder(base).abs
      num = num.tdiv(base)
    end

    arr.reverse!

    arr
  end

This leads to:

digits_1   1.65M (606.38ns) (±11.00%)  1.43× slower
digits_2   1.68M (593.69ns) (± 4.77%)  1.40× slower
digits_3   2.37M (422.78ns) (±11.48%)       fastest

I can go one step further by pre-allocating the array capacity to the right size before everything to avoid dynamic resizing:

  def digits_4(base = 10)
    arr = Array(typeof(self)).new(Math.log10(self).ceil.to_i)
    num = self

    while num != 0
      arr << num.remainder(base).abs
      num = num.tdiv(base)
    end

    arr.reverse!

    arr
  end

Benchmark:

digits_1   1.71M (585.69ns) (± 4.10%)  2.00× slower
digits_2   1.55M (645.22ns) (±12.82%)  2.20× slower
digits_3   2.49M (402.04ns) (± 7.79%)  1.37× slower
digits_4   3.41M (293.01ns) (± 6.93%)       fastest

About the topic of whether or not it should be accepted into the standard library, I also question where should method could be useful outside some code golf scenario. And if it is indeed useful, its probably better to have an each_digit method instead.

@straight-shoota
Copy link
Member

@FrankKair If you like to proceed contributing to the language, you might take a look at issues suited for newcomers.

@FrankKair
Copy link
Author

FrankKair commented Jan 20, 2018

Think that way, code golfing such as Project Euler is nice to provide comparison and benchmarks between languages. That was my first contact with Crystal, because I wanted to see how faster it could be than the other languages I used so far (such as Ruby, Python, Swift, Elixir...). I have a friend who also had his first contact with the language by trying to test its raw power.

I may be wrong, but I don't think most people start experimenting with a language trying to build web frameworks or complicated CLIs, but rather trying to compare its features with other languages previously used. So I still think it's a good thing to have a nice performant digits method on the std lib, such as this one because then people wouldn't have the bad idea I had in the beginning of this pull request (using to_s and the like), and thus they would be happy about the performance of the language.

On the Elixir digits pull request, the author of the PR said:

"While this is a pretty basic function, the way that 9/10 programmers tend to implement it is very inefficient and error prone, and it is in common usage. Thus, I would suggest that having a proper and efficient implementation in the standard lib is desirable."

And then someone else said:

"I like this! Having used the (non-standard) Haskell library Data.Digits for different tasks both yesterday and today, I can really see the benefit."

Not only that, but some languages implemented this method in a questionable way, such as Ruby. The to_s.each_char.map method is faster the std lib's digits, which is really counter intuitive. I posted a question about this on SO: Ruby digits performance. And here we seem to have a nice implementation.

So yeah, I still think it's a nice addition to the std lib.

@asterite
Copy link
Member

I guess we can add this, it does no harm. It needs to be documented, and digits should probably always be an Int32 array. Plus we need to think about what to do with negative numbers.

@Sija
Copy link
Contributor

Sija commented Jan 20, 2018

I guess we can add this, it does no harm. It needs to be documented, and digits should probably always be an Int32 array.

Why not UInt32 or even UInt8?

Plus we need to think about what to do with negative numbers.

From my understanding -123 is still composed of the same digits as 123, what's different is the #sign.

src/int.cr Outdated Show resolved Hide resolved
@FrankKair
Copy link
Author

FrankKair commented Jan 20, 2018

@asterite The method is always returning Array(Int32), which is fine by me. Do you think it's better to put this on the method's signature?

"From my understanding -123 is still composed of the same digits as 123"
I agree with that one. At least for me, it makes sense to return 123.

src/int.cr Outdated Show resolved Hide resolved
@asterite
Copy link
Member

Digits only go from 0 to 9. Int32 is the most common int type. Digits are unrelated to the number. At least that's my thinking. For example, multiply all digits then multiply by some other number, with Int8 chance of overflow. But maybe returning the same type is ok.

I wish we had a universal integer number type, it's so messy with so many integer types.

src/int.cr Outdated Show resolved Hide resolved
@FrankKair
Copy link
Author

@asterite @Sija @RX14 Any other thoughts?

@Sija
Copy link
Contributor

Sija commented Mar 28, 2018

@FrankKair LGTM

@RX14
Copy link
Contributor

RX14 commented Mar 28, 2018

I still don't see much use in this method, if two other people want to approve it go ahead but I won't.

Copy link
Contributor

@woodruffw woodruffw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a member, but I'm 👍 on adding a digits method 😄

Edit: Reiterating my desire for this in 2019.

@FrankKair
Copy link
Author

Two approvals, that's nice! 😊

FrankKair wants to merge 7 commits into crystal-lang:master from unknown repository

This pull request is open for over a year now, so I deleted my repo, thinking that it wouldn't go anywhere.

Will GitHub be able to merge it or should I fork the repo again? 😅

@straight-shoota
Copy link
Member

It's only one approval from a core member, and as far as I see it, most others are rather skeptical of this. I don't see this as a useful addition and thus won't approve it.

In any case, this needs a rebase to be merged. It might work if you simply restore your fork and the branch.

@straight-shoota
Copy link
Member

Closing this in favour of #8554
Should the RFC be approved we can decide to either reopen this PR or create a new one.

Thank you anyways, @FrankKair ❤️

@FrankKair
Copy link
Author

Thank you very much @straight-shoota, feeling like a visionary now 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants