Update ROCm to 3.7.0 #95489

danieldk · 2020-08-15T06:08:10Z

Motivation for this change

Tested with clinfo and Darktable.

The good news is that this release removes the need for rocm-runtime-ext, since image support is added to rocm-runtime. Also restructured dependencies a bit, since rocm-runtime now requires clang and rocm-device-libs.

Things done

This only bumps the tag. But the tag points at the same commit, so no sha256 changes.

rocm-runtime-ext is not needed anymore for OpenCL image support.

danieldk · 2020-08-21T06:53:52Z

@acowley @Flakebi this is now ready for review.

acowley · 2020-08-21T07:32:10Z

Wow, you made this upgrade fast, and made overall improvements while doing it! The integration of image support isn't even mentioned in the ROCm 3.7 release notes, which is sort of shocking given how long it's been sparking Issues on their repositories (an example that remains open).

Everything looks good to me, but I haven't yet ported the diffs to the overlay to push it through CI there.

Flakebi · 2020-08-21T09:20:54Z

Thanks, that was a fast update! clinfo now reports image support without needing to install the -ext package.

Two things I noticed:
clinfo does not seem to run by default: clinfo: error while loading shared libraries: libstdc++.so.6: cannot open shared object file: No such file or directory
Setting set -x LD_LIBRARY_PATH (nix-build -E 'import <nixpkgs>' -A 'gcc.cc.lib')/lib64 makes it work. That does not seem to be a problem of this PR, nix-shell -p rocm-opencl-runtime shows the same behavior.

I tried to run the same OpenCL samples as last time, though it seems like they hang my GPU now (set -x OCL_ICD_VENDORS ~/.cache/nixpkgs-review/pr-95489/results/rocm-opencl-icd/etc/OpenCL/vendors/ hangs while set -x OCL_ICD_VENDORS (nix-build -E 'import <nixpkgs>' -A 'rocm-opencl-icd')/etc/OpenCL/vendors/ works fine). I’m on navi/gfx10 tough, which is not yet officially supported by ROCm, so I guess it’s ok, if it doesn’t run there.

danieldk · 2020-08-21T09:30:00Z

Two things I noticed:
clinfo does not seem to run by default: clinfo: error while loading shared libraries: libstdc++.so.6: cannot open shared object file: No such file or directory
Setting set -x LD_LIBRARY_PATH (nix-build -E 'import <nixpkgs>' -A 'gcc.cc.lib')/lib64 makes it work. That does not seem to
be a problem of this PR, nix-shell -p rocm-opencl-runtime shows the same behavior.

Odd, I haven't bumped into that problem. clinfo doesn't seem to link against libstdc++:

$ ldd /nix/store/za5hzhbrvddyc7c29w7rpr9wqcr25l8m-clinfo-2.2.18.04.06/bin/clinfo
	linux-vdso.so.1 (0x00007ffd1d39d000)
	libOpenCL.so.1 => /nix/store/a483z5di2217hp57klm4rss49fhih0y6-ocl-icd-2.2.10/lib/libOpenCL.so.1 (0x00007fecbdaca000)
	libdl.so.2 => /nix/store/mh78fk3x12q2a77srgkzv16h0irl8r61-glibc-2.31/lib/libdl.so.2 (0x00007fecbdac5000)
	libc.so.6 => /nix/store/mh78fk3x12q2a77srgkzv16h0irl8r61-glibc-2.31/lib/libc.so.6 (0x00007fecbd906000)
	/nix/store/mh78fk3x12q2a77srgkzv16h0irl8r61-glibc-2.31/lib/ld-linux-x86-64.so.2 => /nix/store/mh78fk3x12q2a77srgkzv16h0irl8r61-glibc-2.31/lib64/ld-linux-x86-64.so.2 (0x00007fecbdaed000)

Is the loader dlopen()ing something that requires libstdc++? The ROCm OpenCL runtime seems to be properly linked:

ldd /nix/store/z0yjba7iwzs782sa74ay8khwkpdy9kyj-rocm-opencl-runtime-3.7.0/lib/libamdocl64.so | grep stdc
	libstdc++.so.6 => /nix/store/v589pqjhvxrj73g3r0xb41yr84z5pwb7-gcc-9.3.0-lib/lib/libstdc++.so.6 (0x00007f7c528f7000)

I tried to run the same OpenCL samples as last time, though it seems like they hang my GPU now (set -x OCL_ICD_VENDORS ~/.cache/nixpkgs-review/pr-95489/results/rocm-opencl-icd/etc/OpenCL/vendors/ hangs while set -x OCL_ICD_VENDORS (nix-build -E 'import <nixpkgs>' -A 'rocm-opencl-icd')/etc/OpenCL/vendors/ works fine). I’m on navi/gfx10 tough, which is not yet officially supported by ROCm, so I guess it’s ok, if it doesn’t run there.

Which sample hangs? I'd like to try out if it is fine on my RX580.

acowley · 2020-08-21T09:34:36Z

I know I wasn’t asked, but I like to try out clpeak as a quick smoke test (one can abort it early for an even quicker test). That’s how it ended up in the overlay.

Flakebi · 2020-08-21T09:49:21Z

Sample code

test can be compiled with g++ -o test.o -c test.cpp -g && g++ -o test test.o -lOpenCL
test2 can be compiled with g++ -o test2.o -c test2.cpp -g && g++ -o Timer.o -c Timer.cpp -g && g++ -o test2 test2.o Timer.o -lOpenCL

Both of them hang for me with ROCm 3.7.

test2 prints that it’s running on a CPU for me, though I’m pretty sure it’s not.

test.cpp:

//
// Copyright (c) 2010 Advanced Micro Devices, Inc. All rights reserved.
//

// A minimalist OpenCL program.

#include <CL/cl.h>
#include <stdio.h>

#define NWITEMS 512
// A simple memset kernel
const char *source =
"kernel void memset(   global uint *dst )             \n"
"{                                                    \n"
"    dst[get_global_id(0)] = get_global_id(0);        \n"
"}                                                    \n";

int main(int argc, char ** argv)
{
  // 1. Get a platform.
  cl_platform_id platform;
  clGetPlatformIDs( 1, &platform, NULL );

  // 2. Find a gpu device.
  cl_device_id device;
  clGetDeviceIDs( platform,
                  CL_DEVICE_TYPE_GPU,
                  1,
                  &device, NULL);

  // 3. Create a context and command queue on that device.
  cl_context context = clCreateContext( NULL,
                                        1,
                                        &device,
                                        NULL, NULL, NULL);

  cl_command_queue queue = clCreateCommandQueue( context,
                                                 device,
                                                 0, NULL );

  // 4. Perform runtime source compilation, and obtain kernel entry point.
  cl_program program = clCreateProgramWithSource( context,
                                                  1,
                                                  &source,
                                                  NULL, NULL );

  clBuildProgram( program, 1, &device, NULL, NULL, NULL );

  cl_kernel kernel = clCreateKernel( program, "memset", NULL );

  // 5. Create a data buffer.
  cl_mem buffer = clCreateBuffer( context,
                                  CL_MEM_WRITE_ONLY,
                                  NWITEMS * sizeof(cl_uint),
                                  NULL, NULL );

  // 6. Launch the kernel. Let OpenCL pick the local work size.
  size_t global_work_size = NWITEMS;
  clSetKernelArg(kernel, 0, sizeof(buffer), (void*) &buffer);

  clEnqueueNDRangeKernel( queue,
                          kernel,
                          1,
                          NULL,
                          &global_work_size,
                          NULL,
                          0,
                          NULL, NULL);

  clFinish( queue );

  // 7. Look at the results via synchronous buffer map.
  cl_uint *ptr;
  ptr = (cl_uint *) clEnqueueMapBuffer( queue,
                                        buffer,
                                        CL_TRUE,
                                        CL_MAP_READ,
                                        0,
                                        NWITEMS * sizeof(cl_uint),
                                        0, NULL, NULL, NULL );

  int i;

  for(i=0; i < NWITEMS; i++)
      printf("%d %d\n", i, ptr[i]);

  return 0;
}

test2.cpp:

//
// Copyright (c) 2010 Advanced Micro Devices, Inc. All rights reserved.
//

#include <CL/cl.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include "Timer.h"

#define NDEVS      1

// A parallel min() kernel that works well on CPU and GPU

const char *kernel_source =
"                                                                                 \n"
"#pragma OPENCL EXTENSION cl_khr_local_int32_extended_atomics : enable            \n"
"#pragma OPENCL EXTENSION cl_khr_global_int32_extended_atomics : enable           \n"
"                                                                                 \n"
"// 9. The source buffer is accessed as 4-vectors.                                \n"
"__kernel void minp(__global uint4 *src,                                          \n"
"                   __global uint *gmin,                                          \n"
"                   __local uint *lmin,                                           \n"
"                   __global uint *dbg,                                           \n"
"                   int nitems,                                                   \n"
"                   uint dev )                                                    \n"
"{                                                                                \n"
"                                                                                 \n"
"  // 10. Set up   global memory access pattern.                                  \n"
"                                                                                 \n"
"  uint count = ( nitems / 4 ) / get_global_size(0);                              \n"
"  uint idx   = (dev == 0) ? get_global_id(0) * count                             \n"
"                          :  get_global_id(0);                                   \n"
"  uint stride = (dev == 0) ? 1 : get_global_size(0);                             \n"
"  uint pmin  = (uint) -1;                                                        \n"
"  // 11. First, compute private min, for this work-item.                         \n"
"  for( int n=0; n < count; n++, idx += stride )                                  \n"
"  {                                                                              \n"
"    pmin = min( pmin, src[idx].x );                                              \n"
"    pmin = min( pmin, src[idx].y );                                              \n"
"    pmin = min( pmin, src[idx].z );                                              \n"
"    pmin = min( pmin, src[idx].w );                                              \n"
"  }                                                                              \n"
"                                                                                 \n"
"  // 12. Reduce min values inside work-group.                                    \n"
"  if( get_local_id(0) == 0 )                                                     \n"
"    lmin[0] = (uint) -1;                                                         \n"
"  barrier( CLK_LOCAL_MEM_FENCE );                                                \n"
"  (void) atom_min( lmin, pmin );                                                 \n"
"  barrier( CLK_LOCAL_MEM_FENCE );                                                \n"
"  // Write out to __global.                                                      \n"
"  if( get_local_id(0) == 0 )                                                     \n"
"    gmin[ get_group_id(0) ] = lmin[0];                                           \n"
"  // Dump some debug information.                                                \n"
"  if( get_global_id(0) == 0 )                                                    \n"
"  {                                                                              \n"
"    dbg[0] = get_num_groups(0);                                                  \n"
"    dbg[1] = get_global_size(0);                                                 \n"
"    dbg[2] = count;                                                              \n"
"    dbg[3] = stride;                                                             \n"
"  }                                                                              \n"
"}                                                                                \n"
"                                                                                 \n"
"// 13. Reduce work-group min values from __global to __global.                   \n"
"kernel void reduce(__global uint4 *src,                                          \n"
"                   __global uint *gmin )                                         \n"
"{                                                                                \n"
"  (void) atom_min( gmin, gmin[get_global_id(0)] );                               \n"
"};                                                                               \n";

int main(int argc, char ** argv)
{
  cl_platform_id      platform;

  int dev, nw;
  cl_device_type      devs[NDEVS] = { CL_DEVICE_TYPE_GPU };

  cl_uint     *src_ptr;
  unsigned int        num_src_items = 4096*4096;

  // 1. quick & dirty MWC random init of source buffer.
  // Random seed (portable).
  time_t ltime;
  time(&ltime);

  src_ptr = (cl_uint *) malloc( num_src_items * sizeof(cl_uint) );

  cl_uint a = (cl_uint) ltime, b =    (cl_uint) ltime;
  cl_uint min = (cl_uint) -1;
  // Do serial computation of min() for result verification.
  for( int i=0; i < num_src_items; i++ )
  {
    src_ptr[i] = (cl_uint) (b = ( a * ( b & 65535 )) + ( b >> 16 ));
    min = src_ptr[i] < min ? src_ptr[i] : min;
  }

  // Get a platform.
  clGetPlatformIDs( 1, &platform, NULL );

  // 3. Iterate over devices.
  for(dev=0; dev < NDEVS; dev++)
  {
    cl_device_id      device;
    cl_context        context;
    cl_command_queue  queue;
    cl_program        program;
    cl_kernel         minp;
    cl_kernel         reduce;

    cl_mem            src_buf;
    cl_mem            dst_buf;
    cl_mem            dbg_buf;

    cl_uint           *dst_ptr,
                      *dbg_ptr;

    printf("\n%s: ", dev == 0 ? "CPU" : "GPU");
    // Find the device.
    clGetDeviceIDs( platform,
                    devs[dev],
                    1,
                    &device,
                    NULL);

    // 4. Compute work sizes.
    cl_uint compute_units;
    size_t global_work_size;
    size_t local_work_size;
    size_t num_groups;

    clGetDeviceInfo( device,
                     CL_DEVICE_MAX_COMPUTE_UNITS,
                     sizeof(cl_uint),
                     &compute_units,
                     NULL);

    if( devs[dev] == CL_DEVICE_TYPE_CPU )
    {
      global_work_size = compute_units * 1;   // 1 thread per core
      local_work_size = 1;
    }
    else
    {
      cl_uint ws = 64;
      global_work_size = compute_units * 7 * ws; // 7 wavefronts per SIMD
      printf("Size: %u, %u\n", global_work_size, compute_units);
      while( (num_src_items / 4) % global_work_size != 0 )
        global_work_size += ws;
      local_work_size = ws;
    }
    num_groups = global_work_size / local_work_size;
    // Create a context and command queue on that device.
    context = clCreateContext( NULL,
                               1,
                               &device,
                               NULL, NULL, NULL);

    queue = clCreateCommandQueue( context,
                                  device,
                                  0,
                                  NULL);
    // Minimal error check.
    if( queue == NULL )
    {
      printf("Compute device setup failed\n");
      return(-1);
    }

    // Perform runtime source compilation, and obtain kernel entry point.
    program = clCreateProgramWithSource( context,
                                         1,
                                         &kernel_source,
                                         NULL, NULL );

    //Tell compiler to dump intermediate .il and .isa GPU files.
    cl_int ret = clBuildProgram( program,
                          1,
                          &device,
                          "-save-temps",
                          NULL, NULL );

    // 5. Print compiler error messages
    if(ret != CL_SUCCESS)
    {
      printf("clBuildProgram failed: %d\n", ret);

      char buf[0x10000];

      clGetProgramBuildInfo( program,
                             device,
                             CL_PROGRAM_BUILD_LOG,
                             0x10000,
                             buf,
                             NULL);
      printf("\n%s\n", buf);
      return(-1);
    }

    minp      = clCreateKernel( program, "minp", NULL );
    reduce = clCreateKernel( program, "reduce", NULL );
    // Create input, output and debug buffers.
    src_buf = clCreateBuffer( context,
                              CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
                              num_src_items * sizeof(cl_uint),
                              src_ptr,
                              NULL );

    dst_buf = clCreateBuffer( context,
                              CL_MEM_READ_WRITE,
                              num_groups * sizeof(cl_uint),
                              NULL, NULL );

    dbg_buf = clCreateBuffer( context,
                              CL_MEM_WRITE_ONLY,
                              global_work_size * sizeof(cl_uint),
                              NULL, NULL );

    clSetKernelArg(minp, 0, sizeof(void *),        (void*) &src_buf);
    clSetKernelArg(minp, 1, sizeof(void *),        (void*) &dst_buf);
    clSetKernelArg(minp, 2, 1*sizeof(cl_uint),     (void*) NULL);
    clSetKernelArg(minp, 3, sizeof(void *),        (void*) &dbg_buf);
    clSetKernelArg(minp, 4, sizeof(num_src_items), (void*) &num_src_items);
    clSetKernelArg(minp, 5, sizeof(dev),           (void*) &dev);

    clSetKernelArg(reduce, 0, sizeof(void *),      (void*) &src_buf);
    clSetKernelArg(reduce, 1, sizeof(void *),      (void*) &dst_buf);

    CPerfCounter t;
    t.Reset();
    t.Start();

    // 6. Main timing loop.
    #define NLOOPS 500

    cl_event ev;
    int nloops = NLOOPS;

    while(nloops--)
    {
      clEnqueueNDRangeKernel( queue,
                              minp,
                              1,
                              NULL,
                              &global_work_size,
                              &local_work_size,
                              0,
                              NULL,
                              &ev);

      clEnqueueNDRangeKernel( queue,
                              reduce,
                              1,
                              NULL,
                              &num_groups,
                              NULL,
                              1,
                              &ev,
                              NULL);
    }

    clFinish( queue );
    t.Stop();

    printf("B/W %.2f GB/sec, ", ((float) num_src_items * sizeof(cl_uint) * NLOOPS) / t.GetElapsedTime() / 1e9 );

    // 7. Look at the results via synchronous buffer map.
    dst_ptr = (cl_uint *) clEnqueueMapBuffer( queue,
                                              dst_buf,
                                              CL_TRUE,
                                              CL_MAP_READ,
                                              0,
                                              num_groups * sizeof(cl_uint),
                                              0,
                                              NULL, NULL, NULL );

    dbg_ptr = (cl_uint *) clEnqueueMapBuffer( queue,
                                              dbg_buf,
                                              CL_TRUE,
                                              CL_MAP_READ,
                                              0,
                                              global_work_size * sizeof(cl_uint),
                                              0,
                                              NULL, NULL, NULL );

    // 8. Print some debug info.
    printf("%d groups, %d threads, count %d, stride %d\n", dbg_ptr[0], dbg_ptr[1], dbg_ptr[2], dbg_ptr[3] );

    if( dst_ptr[0] == min )
      printf("result correct\n");
    else
      printf("result INcorrect\n");
  }

  printf("\n");
  return 0;
}

Timer.h:

/**********************************************************************
Copyright ?012 Advanced Micro Devices, Inc. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
?Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
?Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or
 other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
 WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY
 DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS
 OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
 NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
********************************************************************/

#ifndef _TIMER_H_
#define _TIMER_H_
/**
 * \file Timer.h
 * \brief A timer class that provides a cross platform timer for use
 * in timing code progress with a high degree of accuracy.
 */
#ifdef _WIN32
/**
 * \typedef __int64 i64
 * \brief Maps the windows 64 bit integer to a uniform name
 */
#if defined(__MINGW64__) || defined(__MINGW32__)
typedef long long i64;
#else
typedef __int64 i64;
#endif
#else
/**
 * \typedef long long i64
 * \brief Maps the linux 64 bit integer to a uniform name
 */
typedef long long i64;
#endif

/**
 * \class CPerfCounter
 * \brief Counter that provides a fairly accurate timing mechanism for both
 * windows and linux. This timer is used extensively in all the samples.
 */
class CPerfCounter {

public:
    /**
     * \fn CPerfCounter()
     * \brief Constructor for CPerfCounter that initializes the class
     */
    CPerfCounter();
    /**
     * \fn ~CPerfCounter()
     * \brief Destructor for CPerfCounter that cleans up the class
     */
    ~CPerfCounter();
    /**
     * \fn void Start(void)
     * \brief Start the timer
     * \sa Stop(), Reset()
     */
    void Start(void);
    /**
     * \fn void Stop(void)
     * \brief Stop the timer
     * \sa Start(), Reset()
     */
    void Stop(void);
    /**
     * \fn void Reset(void)
     * \brief Reset the timer to 0
     * \sa Start(), Stop()
     */
    void Reset(void);
    /**
     * \fn double GetElapsedTime(void)
     * \return Amount of time that has accumulated between the \a Start()
     * and \a Stop() function calls
     */
    double GetElapsedTime(void);

private:

    i64 _freq;
    i64 _clocks;
    i64 _start;
};

#endif // _TIMER_H_

Timer.cpp:

/**********************************************************************
Copyright ?012 Advanced Micro Devices, Inc. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
?Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
?Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or
 other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
 WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY
 DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS
 OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
 NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
********************************************************************/

#include "Timer.h"

#ifdef _WIN32
#include <windows.h>
#else
#include <sys/time.h>
#include <time.h>
#endif

CPerfCounter::CPerfCounter() : _clocks(0), _start(0)
{

#ifdef _WIN32
    QueryPerformanceFrequency((LARGE_INTEGER *)&_freq);
#else
    _freq = 1000;
#endif

}

CPerfCounter::~CPerfCounter()
{
    // EMPTY!
}

void
CPerfCounter::Start(void)
{

#ifdef _WIN32
    QueryPerformanceCounter((LARGE_INTEGER *)&_start);
#else
    struct timespec s;
    clock_gettime( CLOCK_REALTIME, &s );
    _start = (i64)s.tv_sec * 1e9 + (i64)s.tv_nsec;
#endif

}

void
CPerfCounter::Stop(void)
{
    i64 n;

#ifdef _WIN32
    QueryPerformanceCounter((LARGE_INTEGER *)&n);
#else
    struct timespec s;
    clock_gettime( CLOCK_REALTIME, &s );
    n = (i64)s.tv_sec * 1e9 + (i64)s.tv_nsec;
#endif

    n -= _start;
    _start = 0;
    _clocks += n;
}

void
CPerfCounter::Reset(void)
{

    _clocks = 0;
}

double
CPerfCounter::GetElapsedTime(void)
{
#if _WIN32
    return (double)_clocks / (double) _freq;
#else
    return (double)_clocks / (double) 1e9;
#endif

}

I found the problem with clinfo: I was using the clinfo from rocm-opencl-runtime, which links against libstdc++:

$ ldd /nix/store/z0yjba7iwzs782sa74ay8khwkpdy9kyj-rocm-opencl-runtime-3.7.0/bin/clinfo
        linux-vdso.so.1 (0x00007ffdfe2bd000)
        libOpenCL.so.1 => /nix/store/z0yjba7iwzs782sa74ay8khwkpdy9kyj-rocm-opencl-runtime-3.7.0/lib/libOpenCL.so.1 (0x00007f5963f15000)
        libdl.so.2 => /nix/store/mh78fk3x12q2a77srgkzv16h0irl8r61-glibc-2.31/lib/libdl.so.2 (0x00007f5963f10000)
        libstdc++.so.6 => not found
        libm.so.6 => /nix/store/mh78fk3x12q2a77srgkzv16h0irl8r61-glibc-2.31/lib/libm.so.6 (0x00007f5963dcf000)
        libgcc_s.so.1 => /nix/store/mh78fk3x12q2a77srgkzv16h0irl8r61-glibc-2.31/lib/libgcc_s.so.1 (0x00007f5963db5000)
        libpthread.so.0 => /nix/store/mh78fk3x12q2a77srgkzv16h0irl8r61-glibc-2.31/lib/libpthread.so.0 (0x00007f5963d94000)
        libc.so.6 => /nix/store/mh78fk3x12q2a77srgkzv16h0irl8r61-glibc-2.31/lib/libc.so.6 (0x00007f5963bd3000)
        /nix/store/mh78fk3x12q2a77srgkzv16h0irl8r61-glibc-2.31/lib/ld-linux-x86-64.so.2 => /nix/store/mh78fk3x12q2a77srgkzv16h0irl8r61-glibc-2.31/lib64/ld-linux-x86-64.so.2 (0x00007f5963f21000)

danieldk · 2020-08-21T09:52:15Z

I know I wasn’t asked, but I like to try out clpeak as a quick smoke test (one can abort it early for an even quicker test). That’s how it ended up in the overlay.

Thanks for the pointer! Works well.

danieldk · 2020-08-21T09:59:33Z

Sample code

Thanks! Both work on RX580.

I found the problem with clinfo: I was using the clinfo from rocm-opencl-runtime, which links against libstdc++:

Ah, good catch. We should probably remove that from rocm-opencl-runtime's output anyway. Any objections?

danieldk · 2020-08-21T12:34:20Z

Added a commit that removes clinfo.

acowley · 2020-08-21T16:48:23Z

I think it's fine, but I can't say I've ever seen the issue reported by @Flakebi, and I do not see it with my ROCm 3.5 install. The reason the derivation existed was due to AMD having some extensions relating to OpenCL. This has been a very confused situation that has caused issues with having a suitable headers for AMD's implementation, but it has never been well-documented. I can keep it i the overlay in case it is an issue.

danieldk · 2020-08-21T17:17:12Z

@ofborg build rocm-opencl-icd

acowley · 2020-08-22T04:35:29Z

I'm porting the patches to the overlay, and I think the rocminfo upgrade is missing here. The sha256 for 3.7.0 is 0yc6z0migxzw3l12vbvq1sww3cmwl8flbj5a2vj2fnpdxkkqfgsl.

danieldk · 2020-08-22T05:46:31Z

I'm porting the patches to the overlay, and I think the rocminfo upgrade is missing here. The sha256 for 3.7.0 is 0yc6z0migxzw3l12vbvq1sww3cmwl8flbj5a2vj2fnpdxkkqfgsl.

We currently do not have rocminfo, but I did forget rocm-smi.

Are we happy with merging this now, or do we want to wait to see what is happening on the Navi front? Since Navi is not officially supported yet, I think we should probably merge it. Then 3.7.0 will also be in NixOS 20.09.

acowley · 2020-08-22T06:22:26Z

I'm fine with you making that call. I have a build failure with rocminfo which blocks CI for the overlay, so I've not run anything on 3.7 yet. But I think two of you have now run it, though with some differing results, so perhaps that's enough.

acowley · 2020-08-22T08:16:45Z

I made some progress, but hip can't link at the moment with an error regarding libclang_rt.builtins-x86_64. This may require some poking into the LLVM packages.

Flakebi · 2020-08-22T10:01:45Z

I’m fine with merging it as it seems to work fine on your system.

acowley · 2020-08-22T10:39:38Z

I think the overlay is actually good to go now, too. I haven't updated all of the downstream middleware, but hip is happy now.

danieldk · 2020-08-22T11:29:43Z

I think the overlay is actually good to go now, too. I haven't updated all of the downstream middleware, but hip is happy now.

Cool! That's great to hear.

rocm-thunk: 3.5.0 -> 3.7.0

20656b7

danieldk force-pushed the rocm-3.7.0 branch from 4c81387 to 6d5e390 Compare August 15, 2020 06:09

ofborg bot added 11.by: package-maintainer 10.rebuild-darwin: 0 10.rebuild-linux: 1-10 labels Aug 15, 2020

danieldk added 10 commits August 21, 2020 08:50

llvmPackages_rocm: 3.5.1 -> 3.7.0

89567e2

rocm-device-libs: 3.5.1 -> 3.7.0

acc68bb

rocm-runtime: 3.5.0 -> 3.7.0

d4e40af

rocm-comgr: 3.5.0 -> 3.7.0

f511f43

rocm-cmake: 3.5.0 -> 3.7.0

8c2bc9a

This only bumps the tag. But the tag points at the same commit, so no sha256 changes.

rocclr: 3.5.0 -> 3.7.0

e0ef874

rocm-opencl-runtime: 3.5.0 -> 3.7.0

29775a4

rocm-opencl-icd: set version to rocm-opencl-runtime.version

cc47462

rocm-runtime-ext: remove

5ce029d

nixos/manual: remove references to rocm-runtime-ext

95e5089

rocm-runtime-ext is not needed anymore for OpenCL image support.

danieldk force-pushed the rocm-3.7.0 branch from 6d5e390 to 95e5089 Compare August 21, 2020 06:51

danieldk marked this pull request as ready for review August 21, 2020 06:51

danieldk requested a review from matthewbauer as a code owner August 21, 2020 06:51

danieldk requested review from Flakebi and acowley August 21, 2020 06:52

rocm-opencl-runtime: remove clinfo from the output

b88ac04

ofborg bot added 6.topic: nixos 8.has: documentation 8.has: clean-up labels Aug 21, 2020

rocm-smi: 3.5.0 -> 3.7.0

d75d3eb

acowley approved these changes Aug 22, 2020

View reviewed changes

danieldk merged commit 280147c into NixOS:master Aug 22, 2020

danieldk deleted the rocm-3.7.0 branch August 22, 2020 11:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update ROCm to 3.7.0 #95489

Update ROCm to 3.7.0 #95489

danieldk commented Aug 15, 2020 •

edited

danieldk commented Aug 21, 2020

acowley commented Aug 21, 2020

Flakebi commented Aug 21, 2020

danieldk commented Aug 21, 2020 •

edited

acowley commented Aug 21, 2020

Flakebi commented Aug 21, 2020

danieldk commented Aug 21, 2020

danieldk commented Aug 21, 2020 •

edited

danieldk commented Aug 21, 2020

acowley commented Aug 21, 2020

danieldk commented Aug 21, 2020

acowley commented Aug 22, 2020

danieldk commented Aug 22, 2020

acowley commented Aug 22, 2020

acowley commented Aug 22, 2020

Flakebi commented Aug 22, 2020

acowley commented Aug 22, 2020

danieldk commented Aug 22, 2020

Update ROCm to 3.7.0 #95489

Update ROCm to 3.7.0 #95489

Conversation

danieldk commented Aug 15, 2020 • edited

Motivation for this change

Things done

danieldk commented Aug 21, 2020

acowley commented Aug 21, 2020

Flakebi commented Aug 21, 2020

danieldk commented Aug 21, 2020 • edited

acowley commented Aug 21, 2020

Flakebi commented Aug 21, 2020

danieldk commented Aug 21, 2020

danieldk commented Aug 21, 2020 • edited

danieldk commented Aug 21, 2020

acowley commented Aug 21, 2020

danieldk commented Aug 21, 2020

acowley commented Aug 22, 2020

danieldk commented Aug 22, 2020

acowley commented Aug 22, 2020

acowley commented Aug 22, 2020

Flakebi commented Aug 22, 2020

acowley commented Aug 22, 2020

danieldk commented Aug 22, 2020

danieldk commented Aug 15, 2020 •

edited

danieldk commented Aug 21, 2020 •

edited

danieldk commented Aug 21, 2020 •

edited