New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update ROCm to 3.7.0 #95489
Update ROCm to 3.7.0 #95489
Conversation
This only bumps the tag. But the tag points at the same commit, so no sha256 changes.
rocm-runtime-ext is not needed anymore for OpenCL image support.
Wow, you made this upgrade fast, and made overall improvements while doing it! The integration of image support isn't even mentioned in the ROCm 3.7 release notes, which is sort of shocking given how long it's been sparking Issues on their repositories (an example that remains open). Everything looks good to me, but I haven't yet ported the diffs to the overlay to push it through CI there. |
Thanks, that was a fast update! Two things I noticed: I tried to run the same OpenCL samples as last time, though it seems like they hang my GPU now ( |
Odd, I haven't bumped into that problem.
Is the loader
Which sample hangs? I'd like to try out if it is fine on my RX580. |
I know I wasn’t asked, but I like to try out |
Sample code
Both of them hang for me with ROCm 3.7.
test.cpp: //
// Copyright (c) 2010 Advanced Micro Devices, Inc. All rights reserved.
//
// A minimalist OpenCL program.
#include <CL/cl.h>
#include <stdio.h>
#define NWITEMS 512
// A simple memset kernel
const char *source =
"kernel void memset( global uint *dst ) \n"
"{ \n"
" dst[get_global_id(0)] = get_global_id(0); \n"
"} \n";
int main(int argc, char ** argv)
{
// 1. Get a platform.
cl_platform_id platform;
clGetPlatformIDs( 1, &platform, NULL );
// 2. Find a gpu device.
cl_device_id device;
clGetDeviceIDs( platform,
CL_DEVICE_TYPE_GPU,
1,
&device, NULL);
// 3. Create a context and command queue on that device.
cl_context context = clCreateContext( NULL,
1,
&device,
NULL, NULL, NULL);
cl_command_queue queue = clCreateCommandQueue( context,
device,
0, NULL );
// 4. Perform runtime source compilation, and obtain kernel entry point.
cl_program program = clCreateProgramWithSource( context,
1,
&source,
NULL, NULL );
clBuildProgram( program, 1, &device, NULL, NULL, NULL );
cl_kernel kernel = clCreateKernel( program, "memset", NULL );
// 5. Create a data buffer.
cl_mem buffer = clCreateBuffer( context,
CL_MEM_WRITE_ONLY,
NWITEMS * sizeof(cl_uint),
NULL, NULL );
// 6. Launch the kernel. Let OpenCL pick the local work size.
size_t global_work_size = NWITEMS;
clSetKernelArg(kernel, 0, sizeof(buffer), (void*) &buffer);
clEnqueueNDRangeKernel( queue,
kernel,
1,
NULL,
&global_work_size,
NULL,
0,
NULL, NULL);
clFinish( queue );
// 7. Look at the results via synchronous buffer map.
cl_uint *ptr;
ptr = (cl_uint *) clEnqueueMapBuffer( queue,
buffer,
CL_TRUE,
CL_MAP_READ,
0,
NWITEMS * sizeof(cl_uint),
0, NULL, NULL, NULL );
int i;
for(i=0; i < NWITEMS; i++)
printf("%d %d\n", i, ptr[i]);
return 0;
} test2.cpp: //
// Copyright (c) 2010 Advanced Micro Devices, Inc. All rights reserved.
//
#include <CL/cl.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include "Timer.h"
#define NDEVS 1
// A parallel min() kernel that works well on CPU and GPU
const char *kernel_source =
" \n"
"#pragma OPENCL EXTENSION cl_khr_local_int32_extended_atomics : enable \n"
"#pragma OPENCL EXTENSION cl_khr_global_int32_extended_atomics : enable \n"
" \n"
"// 9. The source buffer is accessed as 4-vectors. \n"
"__kernel void minp(__global uint4 *src, \n"
" __global uint *gmin, \n"
" __local uint *lmin, \n"
" __global uint *dbg, \n"
" int nitems, \n"
" uint dev ) \n"
"{ \n"
" \n"
" // 10. Set up global memory access pattern. \n"
" \n"
" uint count = ( nitems / 4 ) / get_global_size(0); \n"
" uint idx = (dev == 0) ? get_global_id(0) * count \n"
" : get_global_id(0); \n"
" uint stride = (dev == 0) ? 1 : get_global_size(0); \n"
" uint pmin = (uint) -1; \n"
" // 11. First, compute private min, for this work-item. \n"
" for( int n=0; n < count; n++, idx += stride ) \n"
" { \n"
" pmin = min( pmin, src[idx].x ); \n"
" pmin = min( pmin, src[idx].y ); \n"
" pmin = min( pmin, src[idx].z ); \n"
" pmin = min( pmin, src[idx].w ); \n"
" } \n"
" \n"
" // 12. Reduce min values inside work-group. \n"
" if( get_local_id(0) == 0 ) \n"
" lmin[0] = (uint) -1; \n"
" barrier( CLK_LOCAL_MEM_FENCE ); \n"
" (void) atom_min( lmin, pmin ); \n"
" barrier( CLK_LOCAL_MEM_FENCE ); \n"
" // Write out to __global. \n"
" if( get_local_id(0) == 0 ) \n"
" gmin[ get_group_id(0) ] = lmin[0]; \n"
" // Dump some debug information. \n"
" if( get_global_id(0) == 0 ) \n"
" { \n"
" dbg[0] = get_num_groups(0); \n"
" dbg[1] = get_global_size(0); \n"
" dbg[2] = count; \n"
" dbg[3] = stride; \n"
" } \n"
"} \n"
" \n"
"// 13. Reduce work-group min values from __global to __global. \n"
"kernel void reduce(__global uint4 *src, \n"
" __global uint *gmin ) \n"
"{ \n"
" (void) atom_min( gmin, gmin[get_global_id(0)] ); \n"
"}; \n";
int main(int argc, char ** argv)
{
cl_platform_id platform;
int dev, nw;
cl_device_type devs[NDEVS] = { CL_DEVICE_TYPE_GPU };
cl_uint *src_ptr;
unsigned int num_src_items = 4096*4096;
// 1. quick & dirty MWC random init of source buffer.
// Random seed (portable).
time_t ltime;
time(<ime);
src_ptr = (cl_uint *) malloc( num_src_items * sizeof(cl_uint) );
cl_uint a = (cl_uint) ltime, b = (cl_uint) ltime;
cl_uint min = (cl_uint) -1;
// Do serial computation of min() for result verification.
for( int i=0; i < num_src_items; i++ )
{
src_ptr[i] = (cl_uint) (b = ( a * ( b & 65535 )) + ( b >> 16 ));
min = src_ptr[i] < min ? src_ptr[i] : min;
}
// Get a platform.
clGetPlatformIDs( 1, &platform, NULL );
// 3. Iterate over devices.
for(dev=0; dev < NDEVS; dev++)
{
cl_device_id device;
cl_context context;
cl_command_queue queue;
cl_program program;
cl_kernel minp;
cl_kernel reduce;
cl_mem src_buf;
cl_mem dst_buf;
cl_mem dbg_buf;
cl_uint *dst_ptr,
*dbg_ptr;
printf("\n%s: ", dev == 0 ? "CPU" : "GPU");
// Find the device.
clGetDeviceIDs( platform,
devs[dev],
1,
&device,
NULL);
// 4. Compute work sizes.
cl_uint compute_units;
size_t global_work_size;
size_t local_work_size;
size_t num_groups;
clGetDeviceInfo( device,
CL_DEVICE_MAX_COMPUTE_UNITS,
sizeof(cl_uint),
&compute_units,
NULL);
if( devs[dev] == CL_DEVICE_TYPE_CPU )
{
global_work_size = compute_units * 1; // 1 thread per core
local_work_size = 1;
}
else
{
cl_uint ws = 64;
global_work_size = compute_units * 7 * ws; // 7 wavefronts per SIMD
printf("Size: %u, %u\n", global_work_size, compute_units);
while( (num_src_items / 4) % global_work_size != 0 )
global_work_size += ws;
local_work_size = ws;
}
num_groups = global_work_size / local_work_size;
// Create a context and command queue on that device.
context = clCreateContext( NULL,
1,
&device,
NULL, NULL, NULL);
queue = clCreateCommandQueue( context,
device,
0,
NULL);
// Minimal error check.
if( queue == NULL )
{
printf("Compute device setup failed\n");
return(-1);
}
// Perform runtime source compilation, and obtain kernel entry point.
program = clCreateProgramWithSource( context,
1,
&kernel_source,
NULL, NULL );
//Tell compiler to dump intermediate .il and .isa GPU files.
cl_int ret = clBuildProgram( program,
1,
&device,
"-save-temps",
NULL, NULL );
// 5. Print compiler error messages
if(ret != CL_SUCCESS)
{
printf("clBuildProgram failed: %d\n", ret);
char buf[0x10000];
clGetProgramBuildInfo( program,
device,
CL_PROGRAM_BUILD_LOG,
0x10000,
buf,
NULL);
printf("\n%s\n", buf);
return(-1);
}
minp = clCreateKernel( program, "minp", NULL );
reduce = clCreateKernel( program, "reduce", NULL );
// Create input, output and debug buffers.
src_buf = clCreateBuffer( context,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
num_src_items * sizeof(cl_uint),
src_ptr,
NULL );
dst_buf = clCreateBuffer( context,
CL_MEM_READ_WRITE,
num_groups * sizeof(cl_uint),
NULL, NULL );
dbg_buf = clCreateBuffer( context,
CL_MEM_WRITE_ONLY,
global_work_size * sizeof(cl_uint),
NULL, NULL );
clSetKernelArg(minp, 0, sizeof(void *), (void*) &src_buf);
clSetKernelArg(minp, 1, sizeof(void *), (void*) &dst_buf);
clSetKernelArg(minp, 2, 1*sizeof(cl_uint), (void*) NULL);
clSetKernelArg(minp, 3, sizeof(void *), (void*) &dbg_buf);
clSetKernelArg(minp, 4, sizeof(num_src_items), (void*) &num_src_items);
clSetKernelArg(minp, 5, sizeof(dev), (void*) &dev);
clSetKernelArg(reduce, 0, sizeof(void *), (void*) &src_buf);
clSetKernelArg(reduce, 1, sizeof(void *), (void*) &dst_buf);
CPerfCounter t;
t.Reset();
t.Start();
// 6. Main timing loop.
#define NLOOPS 500
cl_event ev;
int nloops = NLOOPS;
while(nloops--)
{
clEnqueueNDRangeKernel( queue,
minp,
1,
NULL,
&global_work_size,
&local_work_size,
0,
NULL,
&ev);
clEnqueueNDRangeKernel( queue,
reduce,
1,
NULL,
&num_groups,
NULL,
1,
&ev,
NULL);
}
clFinish( queue );
t.Stop();
printf("B/W %.2f GB/sec, ", ((float) num_src_items * sizeof(cl_uint) * NLOOPS) / t.GetElapsedTime() / 1e9 );
// 7. Look at the results via synchronous buffer map.
dst_ptr = (cl_uint *) clEnqueueMapBuffer( queue,
dst_buf,
CL_TRUE,
CL_MAP_READ,
0,
num_groups * sizeof(cl_uint),
0,
NULL, NULL, NULL );
dbg_ptr = (cl_uint *) clEnqueueMapBuffer( queue,
dbg_buf,
CL_TRUE,
CL_MAP_READ,
0,
global_work_size * sizeof(cl_uint),
0,
NULL, NULL, NULL );
// 8. Print some debug info.
printf("%d groups, %d threads, count %d, stride %d\n", dbg_ptr[0], dbg_ptr[1], dbg_ptr[2], dbg_ptr[3] );
if( dst_ptr[0] == min )
printf("result correct\n");
else
printf("result INcorrect\n");
}
printf("\n");
return 0;
} Timer.h: /**********************************************************************
Copyright ?012 Advanced Micro Devices, Inc. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
?Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
?Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or
other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS
OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
********************************************************************/
#ifndef _TIMER_H_
#define _TIMER_H_
/**
* \file Timer.h
* \brief A timer class that provides a cross platform timer for use
* in timing code progress with a high degree of accuracy.
*/
#ifdef _WIN32
/**
* \typedef __int64 i64
* \brief Maps the windows 64 bit integer to a uniform name
*/
#if defined(__MINGW64__) || defined(__MINGW32__)
typedef long long i64;
#else
typedef __int64 i64;
#endif
#else
/**
* \typedef long long i64
* \brief Maps the linux 64 bit integer to a uniform name
*/
typedef long long i64;
#endif
/**
* \class CPerfCounter
* \brief Counter that provides a fairly accurate timing mechanism for both
* windows and linux. This timer is used extensively in all the samples.
*/
class CPerfCounter {
public:
/**
* \fn CPerfCounter()
* \brief Constructor for CPerfCounter that initializes the class
*/
CPerfCounter();
/**
* \fn ~CPerfCounter()
* \brief Destructor for CPerfCounter that cleans up the class
*/
~CPerfCounter();
/**
* \fn void Start(void)
* \brief Start the timer
* \sa Stop(), Reset()
*/
void Start(void);
/**
* \fn void Stop(void)
* \brief Stop the timer
* \sa Start(), Reset()
*/
void Stop(void);
/**
* \fn void Reset(void)
* \brief Reset the timer to 0
* \sa Start(), Stop()
*/
void Reset(void);
/**
* \fn double GetElapsedTime(void)
* \return Amount of time that has accumulated between the \a Start()
* and \a Stop() function calls
*/
double GetElapsedTime(void);
private:
i64 _freq;
i64 _clocks;
i64 _start;
};
#endif // _TIMER_H_ Timer.cpp: /**********************************************************************
Copyright ?012 Advanced Micro Devices, Inc. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
?Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
?Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or
other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS
OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
********************************************************************/
#include "Timer.h"
#ifdef _WIN32
#include <windows.h>
#else
#include <sys/time.h>
#include <time.h>
#endif
CPerfCounter::CPerfCounter() : _clocks(0), _start(0)
{
#ifdef _WIN32
QueryPerformanceFrequency((LARGE_INTEGER *)&_freq);
#else
_freq = 1000;
#endif
}
CPerfCounter::~CPerfCounter()
{
// EMPTY!
}
void
CPerfCounter::Start(void)
{
#ifdef _WIN32
QueryPerformanceCounter((LARGE_INTEGER *)&_start);
#else
struct timespec s;
clock_gettime( CLOCK_REALTIME, &s );
_start = (i64)s.tv_sec * 1e9 + (i64)s.tv_nsec;
#endif
}
void
CPerfCounter::Stop(void)
{
i64 n;
#ifdef _WIN32
QueryPerformanceCounter((LARGE_INTEGER *)&n);
#else
struct timespec s;
clock_gettime( CLOCK_REALTIME, &s );
n = (i64)s.tv_sec * 1e9 + (i64)s.tv_nsec;
#endif
n -= _start;
_start = 0;
_clocks += n;
}
void
CPerfCounter::Reset(void)
{
_clocks = 0;
}
double
CPerfCounter::GetElapsedTime(void)
{
#if _WIN32
return (double)_clocks / (double) _freq;
#else
return (double)_clocks / (double) 1e9;
#endif
} I found the problem with clinfo: I was using the clinfo from
|
Thanks for the pointer! Works well. |
Thanks! Both work on RX580.
Ah, good catch. We should probably remove that from |
Added a commit that removes |
I think it's fine, but I can't say I've ever seen the issue reported by @Flakebi, and I do not see it with my ROCm 3.5 install. The reason the derivation existed was due to AMD having some extensions relating to OpenCL. This has been a very confused situation that has caused issues with having a suitable headers for AMD's implementation, but it has never been well-documented. I can keep it i the overlay in case it is an issue. |
@ofborg build rocm-opencl-icd |
I'm porting the patches to the overlay, and I think the |
We currently do not have Are we happy with merging this now, or do we want to wait to see what is happening on the Navi front? Since Navi is not officially supported yet, I think we should probably merge it. Then 3.7.0 will also be in NixOS 20.09. |
I'm fine with you making that call. I have a build failure with |
I made some progress, but |
I’m fine with merging it as it seems to work fine on your system. |
I think the overlay is actually good to go now, too. I haven't updated all of the downstream middleware, but |
Cool! That's great to hear. |
Motivation for this change
Tested with
clinfo
and Darktable.The good news is that this release removes the need for
rocm-runtime-ext
, since image support is added torocm-runtime
. Also restructured dependencies a bit, sincerocm-runtime
now requiresclang
androcm-device-libs
.Things done
sandbox
innix.conf
on non-NixOS linux)nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
./result/bin/
)nix path-info -S
before and after)