Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement parallel is_heap and is_heap_until. #2654

Merged
merged 10 commits into from May 31, 2017

Conversation

taeguk
Copy link
Member

@taeguk taeguk commented May 28, 2017

Check Box

  • implementation of is_heap.
  • unit tests for is_heap.
  • implementation of is_heap_until.
  • unit tests for is_heap_until.
  • benchmark codes for is_heap.
  • benchmark codes for is_heap_until.
  • benchmark for is_heap_until.

I writed benchmark codes. And then I found that my implementation's performance is better than std::* and has scalability as the number of cores increase.

Benchmark for is_heap_until (Fix the number of cores and adjust N)

  • The range [0, N) is heap.
  • Environment : Rostam. ( 8 physical cores / 16 logical cores )
  • The number of cores used = 16.
  • time unit : seconds
kind\N 100,000 1,000,000 10,000,000 100,000,000 1,000,000,000
std 0.00019 0.00149 0.01172 0.11785 1.02402
seq 0.00018 0.00140 0.01171 0.11217 1.01263
par 0.00053 0.00033 0.00125 0.01657 0.16828
par_unseq 0.00014 0.00018 0.00118 0.01629 0.16750

Benchmark for is_heap_until (Fix N and adjust M(the number of cores used))

  • The range [0, N) is heap.
  • N = 100,000,000
  • Environment : Rostam. ( 8 physical cores / 16 logical cores )
  • std::* version : 0.10221
  • time unit : seconds
kind\M 1 2 4 8 12 16 32
par 0.10097 0.05089 0.02685 0.01554 0.01595 0.01636 0.04736

Fix a compile error about shadowing template parameter.
Fix a compile error about dependent type name.
@hkaiser
Copy link
Member

hkaiser commented May 28, 2017

I'm writing an application for measuring and analyzing the performance of is_heap_until. I want to include it to HPX code bases. Where do I add it to?

How about you add it here: https://github.com/STEllAR-GROUP/hpx/tree/master/tests/performance/local?

@taeguk
Copy link
Member Author

taeguk commented May 29, 2017

@hkaiser I'll add many performance tests in summer. I think creating new folder and put them into it is better. So which path is good do you think?

  • /tests/performance/local/parallel_algorithms/
  • /tests/performance/parallel_algorithms/local/ (I think it is better.)
  • etc...

sequential(ExPolicy && policy, RandIter first, RandIter last,
Comp && comp)
{
return std::is_heap_until(first, last, std::forward<Comp>(comp));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know whether it's really important here, but first and last are not perfectly forwarded to is_heap_until.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah, we do it inconsistently throughout the algorithm implementations. Most of the time iterators are passed by value, sometimes not. The Standard prescribes for the algorithms to take iterators by value (on the outermost layer), everything else is up to the implementation I guess.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iterators are usually cheap to copy. So, pass-by-value is faster than pass-by-reference.
And when using pass-by-reference, there is a dangerous thing like below:

iterator func(iterator& first, iterator& last, int val) { 
    while(first != last) {
        if(*fisrt == val) return first;
        ++first;
    } 
}
func(first, first);

So, I think pass-by-value is better than perfect forwarding.

Copy link
Member

@hkaiser hkaiser May 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iterators are usually cheap to copy. So, pass-by-value is faster than pass-by-reference.

While I agree with the first statement (even if the keyword here is usually), I don't see how pass-by-value can be faster than pass-by-reference. Could you elaborate, please?

Copy link
Member Author

@taeguk taeguk May 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hkaiser By compiler, pass-by-reference is implemented using pointer in compiled assembly codes. So, when accessing element through reference, there is a dereferencing cost. So, pass-by-value is more faster than pass-by-reference.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can zip_iterator<iterator1, iterator2, ...> be used with parallel algorithms?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Naios @hkaiser I agree about size of iterator can be larger than a pointer. I want to hear your thoughts about what we use for parallel algorithm function interfaces. Which one is better choice as you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said above, the Standard mandates for the algorithms to take the iterators by value (at the outermost level). So that's what we should do. Otherwise we have not applied a uniform policy. I think almost everywhere we pass iterators by value, but I'm not sure. In the end we shouldn't be able to measure any difference, so I think you can leave things the way you have it right now.

Once you have resolved the merge conflict with master we can go ahead and merge this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, my personal opinion is that one should avoid additional copies in API internal code, since you don't know how huge the object will be the users passes to it. Also @hkaiser has shown that compound iterators can have a significantly larger size (:eyes: ranges). Hence, call by value can have a performance drawback when a copy operation can't be optimized out (type erased objects).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can zip_iterator<iterator1, iterator2, ...> be used with parallel algorithms?

Yes, and we do so in the implementation of some of the algorithms.

Formally, zip_iterator can be only an input iterator at best as the type returned by dereferencing it is not a true reference. However, we applied a somewhat illegal trick allowing to pretend that zip_iterator is more than an input iterator. We know that dereferencing our zip_iterator returns a tuple of stable references, which allows for it to be used in most contexts normally requiring better iterators. The only exception to this is sort_by_key which relies on swapping elements dereferenced by the underlying iterator. For this reason we have added this overload for swapping tuples of references: https://github.com/STEllAR-GROUP/hpx/blob/master/hpx/util/tuple.hpp#L985-L1001 which allows us to implement sort_by_key on top of sort using a zip_iterator.

if (count <= 1)
return result::get(std::move(last));

RandIter second = first + 1;
Copy link
Contributor

@Naios Naios May 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor improvement: maybe you could use std::next over + 1?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_heap requires random access iterators, so this should be fine.

difference_type;

return is_heap_until_helper()(
std::forward<ExPolicy>(policy), first, last,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for first and second as above

/// That is, the last iterator it for which range [first, it) is a max heap.
///
template <typename ExPolicy, typename RandIter, typename Comp = detail::less,
HPX_CONCEPT_REQUIRES_(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

"HPX main exited with non-zero status");

return hpx::util::report_errors();
}
Copy link
Contributor

@Naios Naios May 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing newline at eof

returned_from_algorithm = true;
f.get();

HPX_TEST(false);
Copy link
Contributor

@Naios Naios May 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's far out of scope for this PR, but it would be nice if there would be something like a HPX_TEST_FAIL(); macro, like GTest provides one.
Just checked it and something like this doesn't seem to exist in the current codebase.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Naios feel free to add it, we'd be happy to accept such a PR. Thanks!

@taeguk taeguk changed the title Implement parallel is_heap_until. Implement parallel is_heap and is_heap_until. May 29, 2017
@hkaiser
Copy link
Member

hkaiser commented May 29, 2017

I'll add many performance tests in summer. I think creating new folder and put them into it is better. So which path is good do you think?

/tests/performance/parallel_algorithms/local/ looks good to me.

Copy link
Member

@hkaiser hkaiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly very good, thanks!

[table Heap operations (In Header: <hpx/include/parallel_algorithm.hpp>)
[[Name] [Description] [In Header]]
[[ [algoref is_heap] ]
[Returns `true` if the range is max heap.]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a 'max heap' ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hkaiser in standard, is_heap check whether the given range is max heap.
http://en.cppreference.com/w/cpp/algorithm/is_heap

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to add a link to cppreference.com as this has an explanation for a max heap (see also: 47e122e). BTW, the actual Standard does not talk about max heaps, see: http://eel.is/c++draft/alg.heap.operations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. But, I think using word "max heap' is right. With std::less as a predicate, biggest element is root of heap.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, just please add a link to cppreference.com like I did for the other algorithms, please (47e122e).


typedef execution::is_sequential_execution_policy<ExPolicy> is_seq;

return detail::is_heap<RandIter>().call(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a benefit of explicitly implementing is_heap instead of simply relying on is_heap_until:

 bool is_heap(first, last) { return is_heap_until(first, last) == last; }

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a overhead in that way.
is_heap : https://github.com/taeguk/hpx/blob/176f86e11388b38dcc6afe9516b099246a7b6505/hpx/parallel/algorithms/is_heap.hpp#L77
is_heap_until : https://github.com/taeguk/hpx/blob/176f86e11388b38dcc6afe9516b099246a7b6505/hpx/parallel/algorithms/is_heap.hpp#L227
In case of is_heap, it can be canceled early than is_heap_until.

Surely, there are many duplications of codes.
I had a try to have a one helper with no overhead, not two helpers(is_heap_helper and is_heap_until_helper).
But, I was failed to that. Because util::partitioner returns algorithm_result<ExPolicy, R>, I can't determine whether using just result or result.get().
https://github.com/taeguk/hpx/blob/176f86e11388b38dcc6afe9516b099246a7b6505/hpx/parallel/algorithms/is_heap.hpp#L216
Surely, I can do that using template meta programming and enable_if. But, this is complex. And I think it seems not good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of is_heap , it can be canceled early than is_heap_until .

You should be able to break early for is_heap_until as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of is_heap_until, I can't break the partition checking lower index than an index of element which breaks the heap. I can only break the partition checking higher index than it.
But in case of is_heap, when in any partition I find an element which breaks the heap, I can break all partitions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, makes sense.

{
if (this->name < t.name)
return true;
else if (this->name < t.name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not make any sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix it.

auto result = hpx::parallel::is_heap(policy,
iterator(boost::begin(c)), iterator(boost::end(c)), pred);
auto solution = std::is_heap(std::begin(c), std::end(c), pred);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to use bool here instead of auto. Same amount of typing, better readability...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix it.

iterator(boost::begin(c)), iterator(boost::end(c)));
auto result = f.get();
auto solution = std::is_heap(std::begin(c), std::end(c));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Let's make those return types explicit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix it.

… mistakes in them.

By accepting reviews from @hkaiser, refactor and fix codes related to parallel is_heap and is_heap_until.
base_idx, it, part_size, tok,
[&tok, first, &comp](type const& v, std::size_t i)
{
if (comp(*(first + i / 2), v))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hkaiser @Naios How do you think about this overhead to dereferencing comp.
In line 73, comp is passed by reference. This dereferencing is executed many times. So, I think this is an issue to think about.
I think passing comp by value is better. How do you think about it?
This issue is related to find_if, too. (https://github.com/STEllAR-GROUP/hpx/blob/master/hpx/parallel/algorithms/find.hpp#L208)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the comparison function is a function object, passing by reference here is fine. Passing by value requires a copy operation on the FO which will be more costly - in cases like this, where a lambda captures the FO (by reference) the compiler only has to call the function at the address supplied - it does not have to 'dereference' the pointer every time, so this is a zero cost reference capture in effect.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I obsessed with too small thing which just one memory access per loop iteration?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly. Ultimately, you should time the algorithm, use a profiler if possible, and then make changes and do it again. If the dereferencing is a problem, you'll see a difference, if not, don't worry about it and move on to bigger things. Usually in cases like this, the algorithm itself can be order N, O(N^2), O(logN) etc etc and tweaking the way the algorithm works will make a much bigger impact than the micro-optimizations.

There will be cases where passing by reference/copy makes a difference and it's good to keep an eye open for them though, so don't stop looking (but time things if you can to confirm your hypotheses)

Copy link
Member Author

@taeguk taeguk May 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does not have to 'dereference' the pointer every time, so this is a zero cost reference capture in effect.

@biddisco Did you mean that the compiler optimizes such dereferencings?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the function object has no 'state', the only thing that matters is the call operator - which is just a function pointer. - an address. So when you pass the function object by reference, all the compiler has to do is call the function at the address. So yes, I suspect that when you have a function object with no member data, then the compiler really ought to be smart enough to know that there is no need to pass a reference to the object itself, but rather the function. However, as I write this down, I realize that it probably isn't true and there might well be a dereference going on ...
I think I need to try a small example with an assembler to be sure ....

But passing a function object by value, won't be any better, because you'll have to create a copy on the stack and then call the function call operator - which will be the same pointer (after all, the call operator will be a static function stored somewhere).

Maybe the call operator will be inlined anyway for a simple lambda ..... I need to research what's going on before I write more!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hold on. the call operator is a just a function address. and when it is caled, it requires the args and a 'this' pointer (to the lambda in this case), so passing by refernce just requires the this pointer (the reference) and the function call. Passing by value, requires a copy, and then a call. So reference wins (even if the copy is made only once at the start of a loop for example because they both require a function call and a this pointer - and if the object has no state, the this pointer isn't even needed, so the compile will probably drop it)

Copy link
Member Author

@taeguk taeguk May 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@biddisco I've got it! That was my wrong thought.
The dereferencing will not occur even if FO is passed by reference because a function call needs a pointer anyway.
Obviously, pass-by-reference is better than pass-by-value in this case.

@taeguk
Copy link
Member Author

taeguk commented May 31, 2017

@hkaiser Finally, I'm ready to be merged!!!

@hkaiser hkaiser merged commit ac340b2 into STEllAR-GROUP:master May 31, 2017
@hkaiser hkaiser mentioned this pull request May 31, 2017
47 tasks
@taeguk
Copy link
Member Author

taeguk commented Jun 5, 2017

Just note) This implementation's performance seems to be limited by memory bandwidth.

@hkaiser
Copy link
Member

hkaiser commented Jun 5, 2017

Just note) This implementation's performance seems to be limited by memory bandwidth.

Is this caused by the algorithm itself or is it because of the implementation?

@taeguk
Copy link
Member Author

taeguk commented Jun 6, 2017

@hkaiser
As the algorithm itself, is_heap just iterates loop and performs comparison. There are not many computations, but there are a lot of memory accesses. Therefore, maybe performance is limited by memory bandwidth.
I think that utilizing cache more well is a one of ways to get better performance.

if (comp(*(first + i / 2), v))

In above code, accessing parent node is bad to cache. But I have no solutions to resolve that problem with keeping current algorithm for is_heap.
If anyone want to get better performance in is_heap, maybe they have to find other algorithms.
This is a note for someone in the future.

@hkaiser
Copy link
Member

hkaiser commented Jun 6, 2017

@taeguk, I think we can leave the code as it is for now as it is functionally correct and reasonably fast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants