Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented #sorted_insert for Array using binary search. #5153

Closed
wants to merge 1 commit into from
Closed

Implemented #sorted_insert for Array using binary search. #5153

wants to merge 1 commit into from

Conversation

Groogy
Copy link

@Groogy Groogy commented Oct 20, 2017

I noticed there's no nice sorted_insert method for Array so I decided to make one. It's fairly simple and it uses the bsearch_index method to find where to put the provided object.

This is my first pull request so let me know if I missed anything or if you want me to add something more to the feature.

src/array.cr Outdated Show resolved Hide resolved
src/array.cr Outdated Show resolved Hide resolved
src/array.cr Outdated Show resolved Hide resolved
@Groogy
Copy link
Author

Groogy commented Oct 20, 2017

Fixed the things @RX14 wanted fixed + another spec on an empty array just in case

@asterite
Copy link
Member

Thank you, but what's the use case of this method?

@RX14
Copy link
Contributor

RX14 commented Oct 20, 2017

@asterite you have a pre-sorted array and want to insert something into the array in a manner to keep it sorted but it's much faster than just iterating and inserting.

Personally i've needed it to implement a scheduler, where I had a queue of about 150,000 work items which needed to be kept ordered by computed next run timestamp, but insert then sort was way way too slow.

@Groogy
Copy link
Author

Groogy commented Oct 20, 2017

What @RX14 said.
I have a cache which will grow to quite a large size and to not keep it sorted and search through the array is not really an option. What the method implements can be done manually but this just looks better and in my opinion makes the resulting code a bit more "crystal-esque" if that makes sense?

arr.sorted_insert(5)

looks better than

index = arr.bsearch_index(5)`
if index
  arr.insert(index, 5)
else
  arr.push(5)
end

It's convenient and something you need when you start working with really large data sets and you need to find specific elements inside it.

@asterite
Copy link
Member

Mmm... what if I to insert an item in a sorted fashion, but not insert it if it's there already?

I'm not convinced that this method is needed. You can just write those 3 lines, it's probably not something that you'll have to write all the time. Plus with that method in place one might be tempted to insert multiple values at once to have a sorted array, when it's faster to insert multiple items and then do a single sort.

Additionally, Ruby doesn't have this method. And I don't know if other languages provide a method such as this.

@asterite
Copy link
Member

And finally, if you need to keep an array sorted, it's probably better to use another data structure like a heap.

@Groogy
Copy link
Author

Groogy commented Oct 20, 2017

Our library at work provides it since it is a very common thing for us that we need. The operation is of logarithmic complexity on insertion which isn't that bad. If you don't know about a big set of data forehand to fill the array with, then you aren't losing that much. Primary use case is when collection grows over time compared to when you have a pre-defined set of data to process.

On using a Heap data structure, the most efficient implementation of a heap is done using adaptation of this on top of an array.

e: On complexity and speed, quicksort worst-case is O(n2) and average O(nlog(n)) while binary search insertion should be O(nlog(n)) in both if you also count in moving elements. So on large enough random data where the quicksort might get unlucky the insertion should technically be faster?(tested, result: nope, it's the reverse. If working with a chunk of data at once, this will excel on lower numbers) If you don't count the allocation overhead of course.

@RX14
Copy link
Contributor

RX14 commented Oct 20, 2017

@asterite there are plenty of methods in Array which are "just a few lines". The point is that because this is named sorted_insert you know at a glance what the operation is doing, instead of having to work it out from the implementation. I really don't think that this usecase is that rare.

@konovod
Copy link
Contributor

konovod commented Oct 20, 2017

I would prefer to have an always ordered data structure, with insert = sorted_insert, find = bsearch_index, so I can't accidentially break the contract by using << on it. On the other hand, such data structure isn't hard to write and stdlib can live without it.

@funny-falcon
Copy link
Contributor

funny-falcon commented Oct 21, 2017

@RX14

Personally i've needed it to implement a scheduler, where I had a queue of about 150,000 work items which needed to be kept ordered by computed next run timestamp,

With 150,000 items sorted insertion could be also slow, if timestamp is a bit random (ie must be inserted close to beginning). And without optimised shift it will be a way slow to remove 'next to run'.
Such queue unavoidably must be implemented with heap. What about adding "#heapify" and "#heappop" methods? Python has great heapq module

For binary_insert to be relatively fast up to several thousand elements, there should be more progress in unification of array and Deque (#5148) :

  • insert_at should detect insertion is done near beginning of array, and then allocate room before @buffer, thus following insertion near beginning could be done by moving small number of elements to left, instead of moving almost whole array to right.

I used such technique once when I didn't what to use binary search tree, and I was sure array will not be larger than several dozen thousands of pointers.

@yxhuvud
Copy link
Contributor

yxhuvud commented Oct 21, 2017

Personally i've needed it to implement a scheduler, where I had a queue of about 150,000 work items which needed to be kept ordered by computed next run timestamp, but insert then sort was way way too slow.

The typical way to implement a scheduler is to use a priority queue and while a sorted array can substitute for one, you are very likely to get better performance by using other data structures.

@Groogy You get O(n) cost for insertion this way which is waaay slow comparatively speaking. Yes, you find the spot in O(logn), but moving the entries out of place cost O(n). Sure, it is a somewhat fast O(n) due to memmove being fast, but it is still bad.

@RX14: See for example https://github.com/yxhuvud/pairing_heap/blob/master/src/priority_queue.cr for one that seems to have somewhat decent performance - you get O(1) for insertion and O(logn) for deletion of the topmost entry. Which is about as good as you get.

Can we please not add convenience methods that give us bad performance?

@jhass
Copy link
Member

jhass commented Oct 30, 2019

It seems nobody feels strong enough to about this to push it further. The usecases this addresses seem good for a shard, providing usecase optimized data structures and/or usecase optimized extensions to stdlib data structures. Then when that shard proliferates we can revisit which parts of it might make sense for stdlib.

@jhass jhass closed this Oct 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants