Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ROCm to 3.7.0 #95489

Merged
merged 13 commits into from Aug 22, 2020
Merged

Update ROCm to 3.7.0 #95489

merged 13 commits into from Aug 22, 2020

Conversation

danieldk
Copy link
Contributor

@danieldk danieldk commented Aug 15, 2020

Motivation for this change

Tested with clinfo and Darktable.

The good news is that this release removes the need for rocm-runtime-ext, since image support is added to rocm-runtime. Also restructured dependencies a bit, since rocm-runtime now requires clang and rocm-device-libs.

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS linux)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Ensured that relevant documentation is up to date
  • Fits CONTRIBUTING.md.

@danieldk danieldk marked this pull request as ready for review August 21, 2020 06:51
@danieldk
Copy link
Contributor Author

@acowley @Flakebi this is now ready for review.

@acowley
Copy link
Contributor

acowley commented Aug 21, 2020

Wow, you made this upgrade fast, and made overall improvements while doing it! The integration of image support isn't even mentioned in the ROCm 3.7 release notes, which is sort of shocking given how long it's been sparking Issues on their repositories (an example that remains open).

Everything looks good to me, but I haven't yet ported the diffs to the overlay to push it through CI there.

@Flakebi
Copy link
Member

Flakebi commented Aug 21, 2020

Thanks, that was a fast update! clinfo now reports image support without needing to install the -ext package.

Two things I noticed:
clinfo does not seem to run by default: clinfo: error while loading shared libraries: libstdc++.so.6: cannot open shared object file: No such file or directory
Setting set -x LD_LIBRARY_PATH (nix-build -E 'import <nixpkgs>' -A 'gcc.cc.lib')/lib64 makes it work. That does not seem to be a problem of this PR, nix-shell -p rocm-opencl-runtime shows the same behavior.

I tried to run the same OpenCL samples as last time, though it seems like they hang my GPU now (set -x OCL_ICD_VENDORS ~/.cache/nixpkgs-review/pr-95489/results/rocm-opencl-icd/etc/OpenCL/vendors/ hangs while set -x OCL_ICD_VENDORS (nix-build -E 'import <nixpkgs>' -A 'rocm-opencl-icd')/etc/OpenCL/vendors/ works fine). I’m on navi/gfx10 tough, which is not yet officially supported by ROCm, so I guess it’s ok, if it doesn’t run there.

@danieldk
Copy link
Contributor Author

danieldk commented Aug 21, 2020

Two things I noticed:
clinfo does not seem to run by default: clinfo: error while loading shared libraries: libstdc++.so.6: cannot open shared object file: No such file or directory
Setting set -x LD_LIBRARY_PATH (nix-build -E 'import <nixpkgs>' -A 'gcc.cc.lib')/lib64 makes it work. That does not seem to
be a problem of this PR, nix-shell -p rocm-opencl-runtime shows the same behavior.

Odd, I haven't bumped into that problem. clinfo doesn't seem to link against libstdc++:

$ ldd /nix/store/za5hzhbrvddyc7c29w7rpr9wqcr25l8m-clinfo-2.2.18.04.06/bin/clinfo
	linux-vdso.so.1 (0x00007ffd1d39d000)
	libOpenCL.so.1 => /nix/store/a483z5di2217hp57klm4rss49fhih0y6-ocl-icd-2.2.10/lib/libOpenCL.so.1 (0x00007fecbdaca000)
	libdl.so.2 => /nix/store/mh78fk3x12q2a77srgkzv16h0irl8r61-glibc-2.31/lib/libdl.so.2 (0x00007fecbdac5000)
	libc.so.6 => /nix/store/mh78fk3x12q2a77srgkzv16h0irl8r61-glibc-2.31/lib/libc.so.6 (0x00007fecbd906000)
	/nix/store/mh78fk3x12q2a77srgkzv16h0irl8r61-glibc-2.31/lib/ld-linux-x86-64.so.2 => /nix/store/mh78fk3x12q2a77srgkzv16h0irl8r61-glibc-2.31/lib64/ld-linux-x86-64.so.2 (0x00007fecbdaed000)

Is the loader dlopen()ing something that requires libstdc++? The ROCm OpenCL runtime seems to be properly linked:

ldd /nix/store/z0yjba7iwzs782sa74ay8khwkpdy9kyj-rocm-opencl-runtime-3.7.0/lib/libamdocl64.so | grep stdc
	libstdc++.so.6 => /nix/store/v589pqjhvxrj73g3r0xb41yr84z5pwb7-gcc-9.3.0-lib/lib/libstdc++.so.6 (0x00007f7c528f7000)

I tried to run the same OpenCL samples as last time, though it seems like they hang my GPU now (set -x OCL_ICD_VENDORS ~/.cache/nixpkgs-review/pr-95489/results/rocm-opencl-icd/etc/OpenCL/vendors/ hangs while set -x OCL_ICD_VENDORS (nix-build -E 'import <nixpkgs>' -A 'rocm-opencl-icd')/etc/OpenCL/vendors/ works fine). I’m on navi/gfx10 tough, which is not yet officially supported by ROCm, so I guess it’s ok, if it doesn’t run there.

Which sample hangs? I'd like to try out if it is fine on my RX580.

@acowley
Copy link
Contributor

acowley commented Aug 21, 2020

I know I wasn’t asked, but I like to try out clpeak as a quick smoke test (one can abort it early for an even quicker test). That’s how it ended up in the overlay.

@Flakebi
Copy link
Member

Flakebi commented Aug 21, 2020

Sample code

test can be compiled with g++ -o test.o -c test.cpp -g && g++ -o test test.o -lOpenCL
test2 can be compiled with g++ -o test2.o -c test2.cpp -g && g++ -o Timer.o -c Timer.cpp -g && g++ -o test2 test2.o Timer.o -lOpenCL

Both of them hang for me with ROCm 3.7.

test2 prints that it’s running on a CPU for me, though I’m pretty sure it’s not.

test.cpp:

//
// Copyright (c) 2010 Advanced Micro Devices, Inc. All rights reserved.
//

// A minimalist OpenCL program.

#include <CL/cl.h>
#include <stdio.h>

#define NWITEMS 512
// A simple memset kernel
const char *source =
"kernel void memset(   global uint *dst )             \n"
"{                                                    \n"
"    dst[get_global_id(0)] = get_global_id(0);        \n"
"}                                                    \n";

int main(int argc, char ** argv)
{
  // 1. Get a platform.
  cl_platform_id platform;
  clGetPlatformIDs( 1, &platform, NULL );

  // 2. Find a gpu device.
  cl_device_id device;
  clGetDeviceIDs( platform,
                  CL_DEVICE_TYPE_GPU,
                  1,
                  &device, NULL);

  // 3. Create a context and command queue on that device.
  cl_context context = clCreateContext( NULL,
                                        1,
                                        &device,
                                        NULL, NULL, NULL);

  cl_command_queue queue = clCreateCommandQueue( context,
                                                 device,
                                                 0, NULL );

  // 4. Perform runtime source compilation, and obtain kernel entry point.
  cl_program program = clCreateProgramWithSource( context,
                                                  1,
                                                  &source,
                                                  NULL, NULL );

  clBuildProgram( program, 1, &device, NULL, NULL, NULL );

  cl_kernel kernel = clCreateKernel( program, "memset", NULL );

  // 5. Create a data buffer.
  cl_mem buffer = clCreateBuffer( context,
                                  CL_MEM_WRITE_ONLY,
                                  NWITEMS * sizeof(cl_uint),
                                  NULL, NULL );

  // 6. Launch the kernel. Let OpenCL pick the local work size.
  size_t global_work_size = NWITEMS;
  clSetKernelArg(kernel, 0, sizeof(buffer), (void*) &buffer);

  clEnqueueNDRangeKernel( queue,
                          kernel,
                          1,
                          NULL,
                          &global_work_size,
                          NULL,
                          0,
                          NULL, NULL);

  clFinish( queue );

  // 7. Look at the results via synchronous buffer map.
  cl_uint *ptr;
  ptr = (cl_uint *) clEnqueueMapBuffer( queue,
                                        buffer,
                                        CL_TRUE,
                                        CL_MAP_READ,
                                        0,
                                        NWITEMS * sizeof(cl_uint),
                                        0, NULL, NULL, NULL );

  int i;

  for(i=0; i < NWITEMS; i++)
      printf("%d %d\n", i, ptr[i]);

  return 0;
}

test2.cpp:

//
// Copyright (c) 2010 Advanced Micro Devices, Inc. All rights reserved.
//

#include <CL/cl.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include "Timer.h"

#define NDEVS      1

// A parallel min() kernel that works well on CPU and GPU

const char *kernel_source =
"                                                                                 \n"
"#pragma OPENCL EXTENSION cl_khr_local_int32_extended_atomics : enable            \n"
"#pragma OPENCL EXTENSION cl_khr_global_int32_extended_atomics : enable           \n"
"                                                                                 \n"
"// 9. The source buffer is accessed as 4-vectors.                                \n"
"__kernel void minp(__global uint4 *src,                                          \n"
"                   __global uint *gmin,                                          \n"
"                   __local uint *lmin,                                           \n"
"                   __global uint *dbg,                                           \n"
"                   int nitems,                                                   \n"
"                   uint dev )                                                    \n"
"{                                                                                \n"
"                                                                                 \n"
"  // 10. Set up   global memory access pattern.                                  \n"
"                                                                                 \n"
"  uint count = ( nitems / 4 ) / get_global_size(0);                              \n"
"  uint idx   = (dev == 0) ? get_global_id(0) * count                             \n"
"                          :  get_global_id(0);                                   \n"
"  uint stride = (dev == 0) ? 1 : get_global_size(0);                             \n"
"  uint pmin  = (uint) -1;                                                        \n"
"  // 11. First, compute private min, for this work-item.                         \n"
"  for( int n=0; n < count; n++, idx += stride )                                  \n"
"  {                                                                              \n"
"    pmin = min( pmin, src[idx].x );                                              \n"
"    pmin = min( pmin, src[idx].y );                                              \n"
"    pmin = min( pmin, src[idx].z );                                              \n"
"    pmin = min( pmin, src[idx].w );                                              \n"
"  }                                                                              \n"
"                                                                                 \n"
"  // 12. Reduce min values inside work-group.                                    \n"
"  if( get_local_id(0) == 0 )                                                     \n"
"    lmin[0] = (uint) -1;                                                         \n"
"  barrier( CLK_LOCAL_MEM_FENCE );                                                \n"
"  (void) atom_min( lmin, pmin );                                                 \n"
"  barrier( CLK_LOCAL_MEM_FENCE );                                                \n"
"  // Write out to __global.                                                      \n"
"  if( get_local_id(0) == 0 )                                                     \n"
"    gmin[ get_group_id(0) ] = lmin[0];                                           \n"
"  // Dump some debug information.                                                \n"
"  if( get_global_id(0) == 0 )                                                    \n"
"  {                                                                              \n"
"    dbg[0] = get_num_groups(0);                                                  \n"
"    dbg[1] = get_global_size(0);                                                 \n"
"    dbg[2] = count;                                                              \n"
"    dbg[3] = stride;                                                             \n"
"  }                                                                              \n"
"}                                                                                \n"
"                                                                                 \n"
"// 13. Reduce work-group min values from __global to __global.                   \n"
"kernel void reduce(__global uint4 *src,                                          \n"
"                   __global uint *gmin )                                         \n"
"{                                                                                \n"
"  (void) atom_min( gmin, gmin[get_global_id(0)] );                               \n"
"};                                                                               \n";

int main(int argc, char ** argv)
{
  cl_platform_id      platform;

  int dev, nw;
  cl_device_type      devs[NDEVS] = { CL_DEVICE_TYPE_GPU };

  cl_uint     *src_ptr;
  unsigned int        num_src_items = 4096*4096;

  // 1. quick & dirty MWC random init of source buffer.
  // Random seed (portable).
  time_t ltime;
  time(&ltime);

  src_ptr = (cl_uint *) malloc( num_src_items * sizeof(cl_uint) );

  cl_uint a = (cl_uint) ltime, b =    (cl_uint) ltime;
  cl_uint min = (cl_uint) -1;
  // Do serial computation of min() for result verification.
  for( int i=0; i < num_src_items; i++ )
  {
    src_ptr[i] = (cl_uint) (b = ( a * ( b & 65535 )) + ( b >> 16 ));
    min = src_ptr[i] < min ? src_ptr[i] : min;
  }

  // Get a platform.
  clGetPlatformIDs( 1, &platform, NULL );

  // 3. Iterate over devices.
  for(dev=0; dev < NDEVS; dev++)
  {
    cl_device_id      device;
    cl_context        context;
    cl_command_queue  queue;
    cl_program        program;
    cl_kernel         minp;
    cl_kernel         reduce;

    cl_mem            src_buf;
    cl_mem            dst_buf;
    cl_mem            dbg_buf;

    cl_uint           *dst_ptr,
                      *dbg_ptr;

    printf("\n%s: ", dev == 0 ? "CPU" : "GPU");
    // Find the device.
    clGetDeviceIDs( platform,
                    devs[dev],
                    1,
                    &device,
                    NULL);

    // 4. Compute work sizes.
    cl_uint compute_units;
    size_t global_work_size;
    size_t local_work_size;
    size_t num_groups;

    clGetDeviceInfo( device,
                     CL_DEVICE_MAX_COMPUTE_UNITS,
                     sizeof(cl_uint),
                     &compute_units,
                     NULL);

    if( devs[dev] == CL_DEVICE_TYPE_CPU )
    {
      global_work_size = compute_units * 1;   // 1 thread per core
      local_work_size = 1;
    }
    else
    {
      cl_uint ws = 64;
      global_work_size = compute_units * 7 * ws; // 7 wavefronts per SIMD
      printf("Size: %u, %u\n", global_work_size, compute_units);
      while( (num_src_items / 4) % global_work_size != 0 )
        global_work_size += ws;
      local_work_size = ws;
    }
    num_groups = global_work_size / local_work_size;
    // Create a context and command queue on that device.
    context = clCreateContext( NULL,
                               1,
                               &device,
                               NULL, NULL, NULL);

    queue = clCreateCommandQueue( context,
                                  device,
                                  0,
                                  NULL);
    // Minimal error check.
    if( queue == NULL )
    {
      printf("Compute device setup failed\n");
      return(-1);
    }

    // Perform runtime source compilation, and obtain kernel entry point.
    program = clCreateProgramWithSource( context,
                                         1,
                                         &kernel_source,
                                         NULL, NULL );

    //Tell compiler to dump intermediate .il and .isa GPU files.
    cl_int ret = clBuildProgram( program,
                          1,
                          &device,
                          "-save-temps",
                          NULL, NULL );

    // 5. Print compiler error messages
    if(ret != CL_SUCCESS)
    {
      printf("clBuildProgram failed: %d\n", ret);

      char buf[0x10000];

      clGetProgramBuildInfo( program,
                             device,
                             CL_PROGRAM_BUILD_LOG,
                             0x10000,
                             buf,
                             NULL);
      printf("\n%s\n", buf);
      return(-1);
    }

    minp      = clCreateKernel( program, "minp", NULL );
    reduce = clCreateKernel( program, "reduce", NULL );
    // Create input, output and debug buffers.
    src_buf = clCreateBuffer( context,
                              CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
                              num_src_items * sizeof(cl_uint),
                              src_ptr,
                              NULL );

    dst_buf = clCreateBuffer( context,
                              CL_MEM_READ_WRITE,
                              num_groups * sizeof(cl_uint),
                              NULL, NULL );

    dbg_buf = clCreateBuffer( context,
                              CL_MEM_WRITE_ONLY,
                              global_work_size * sizeof(cl_uint),
                              NULL, NULL );

    clSetKernelArg(minp, 0, sizeof(void *),        (void*) &src_buf);
    clSetKernelArg(minp, 1, sizeof(void *),        (void*) &dst_buf);
    clSetKernelArg(minp, 2, 1*sizeof(cl_uint),     (void*) NULL);
    clSetKernelArg(minp, 3, sizeof(void *),        (void*) &dbg_buf);
    clSetKernelArg(minp, 4, sizeof(num_src_items), (void*) &num_src_items);
    clSetKernelArg(minp, 5, sizeof(dev),           (void*) &dev);

    clSetKernelArg(reduce, 0, sizeof(void *),      (void*) &src_buf);
    clSetKernelArg(reduce, 1, sizeof(void *),      (void*) &dst_buf);

    CPerfCounter t;
    t.Reset();
    t.Start();

    // 6. Main timing loop.
    #define NLOOPS 500

    cl_event ev;
    int nloops = NLOOPS;

    while(nloops--)
    {
      clEnqueueNDRangeKernel( queue,
                              minp,
                              1,
                              NULL,
                              &global_work_size,
                              &local_work_size,
                              0,
                              NULL,
                              &ev);

      clEnqueueNDRangeKernel( queue,
                              reduce,
                              1,
                              NULL,
                              &num_groups,
                              NULL,
                              1,
                              &ev,
                              NULL);
    }

    clFinish( queue );
    t.Stop();

    printf("B/W %.2f GB/sec, ", ((float) num_src_items * sizeof(cl_uint) * NLOOPS) / t.GetElapsedTime() / 1e9 );

    // 7. Look at the results via synchronous buffer map.
    dst_ptr = (cl_uint *) clEnqueueMapBuffer( queue,
                                              dst_buf,
                                              CL_TRUE,
                                              CL_MAP_READ,
                                              0,
                                              num_groups * sizeof(cl_uint),
                                              0,
                                              NULL, NULL, NULL );

    dbg_ptr = (cl_uint *) clEnqueueMapBuffer( queue,
                                              dbg_buf,
                                              CL_TRUE,
                                              CL_MAP_READ,
                                              0,
                                              global_work_size * sizeof(cl_uint),
                                              0,
                                              NULL, NULL, NULL );

    // 8. Print some debug info.
    printf("%d groups, %d threads, count %d, stride %d\n", dbg_ptr[0], dbg_ptr[1], dbg_ptr[2], dbg_ptr[3] );

    if( dst_ptr[0] == min )
      printf("result correct\n");
    else
      printf("result INcorrect\n");
  }

  printf("\n");
  return 0;
}

Timer.h:

/**********************************************************************
Copyright ?012 Advanced Micro Devices, Inc. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
?Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
?Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or
 other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
 WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY
 DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS
 OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
 NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
********************************************************************/

#ifndef _TIMER_H_
#define _TIMER_H_
/**
 * \file Timer.h
 * \brief A timer class that provides a cross platform timer for use
 * in timing code progress with a high degree of accuracy.
 */
#ifdef _WIN32
/**
 * \typedef __int64 i64
 * \brief Maps the windows 64 bit integer to a uniform name
 */
#if defined(__MINGW64__) || defined(__MINGW32__)
typedef long long i64;
#else
typedef __int64 i64;
#endif
#else
/**
 * \typedef long long i64
 * \brief Maps the linux 64 bit integer to a uniform name
 */
typedef long long i64;
#endif

/**
 * \class CPerfCounter
 * \brief Counter that provides a fairly accurate timing mechanism for both
 * windows and linux. This timer is used extensively in all the samples.
 */
class CPerfCounter {

public:
    /**
     * \fn CPerfCounter()
     * \brief Constructor for CPerfCounter that initializes the class
     */
    CPerfCounter();
    /**
     * \fn ~CPerfCounter()
     * \brief Destructor for CPerfCounter that cleans up the class
     */
    ~CPerfCounter();
    /**
     * \fn void Start(void)
     * \brief Start the timer
     * \sa Stop(), Reset()
     */
    void Start(void);
    /**
     * \fn void Stop(void)
     * \brief Stop the timer
     * \sa Start(), Reset()
     */
    void Stop(void);
    /**
     * \fn void Reset(void)
     * \brief Reset the timer to 0
     * \sa Start(), Stop()
     */
    void Reset(void);
    /**
     * \fn double GetElapsedTime(void)
     * \return Amount of time that has accumulated between the \a Start()
     * and \a Stop() function calls
     */
    double GetElapsedTime(void);

private:

    i64 _freq;
    i64 _clocks;
    i64 _start;
};

#endif // _TIMER_H_

Timer.cpp:

/**********************************************************************
Copyright ?012 Advanced Micro Devices, Inc. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
?Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
?Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or
 other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
 WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY
 DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS
 OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
 NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
********************************************************************/

#include "Timer.h"

#ifdef _WIN32
#include <windows.h>
#else
#include <sys/time.h>
#include <time.h>
#endif

CPerfCounter::CPerfCounter() : _clocks(0), _start(0)
{

#ifdef _WIN32
    QueryPerformanceFrequency((LARGE_INTEGER *)&_freq);
#else
    _freq = 1000;
#endif

}

CPerfCounter::~CPerfCounter()
{
    // EMPTY!
}

void
CPerfCounter::Start(void)
{

#ifdef _WIN32
    QueryPerformanceCounter((LARGE_INTEGER *)&_start);
#else
    struct timespec s;
    clock_gettime( CLOCK_REALTIME, &s );
    _start = (i64)s.tv_sec * 1e9 + (i64)s.tv_nsec;
#endif

}

void
CPerfCounter::Stop(void)
{
    i64 n;

#ifdef _WIN32
    QueryPerformanceCounter((LARGE_INTEGER *)&n);
#else
    struct timespec s;
    clock_gettime( CLOCK_REALTIME, &s );
    n = (i64)s.tv_sec * 1e9 + (i64)s.tv_nsec;
#endif

    n -= _start;
    _start = 0;
    _clocks += n;
}

void
CPerfCounter::Reset(void)
{

    _clocks = 0;
}

double
CPerfCounter::GetElapsedTime(void)
{
#if _WIN32
    return (double)_clocks / (double) _freq;
#else
    return (double)_clocks / (double) 1e9;
#endif

}

I found the problem with clinfo: I was using the clinfo from rocm-opencl-runtime, which links against libstdc++:

$ ldd /nix/store/z0yjba7iwzs782sa74ay8khwkpdy9kyj-rocm-opencl-runtime-3.7.0/bin/clinfo
        linux-vdso.so.1 (0x00007ffdfe2bd000)
        libOpenCL.so.1 => /nix/store/z0yjba7iwzs782sa74ay8khwkpdy9kyj-rocm-opencl-runtime-3.7.0/lib/libOpenCL.so.1 (0x00007f5963f15000)
        libdl.so.2 => /nix/store/mh78fk3x12q2a77srgkzv16h0irl8r61-glibc-2.31/lib/libdl.so.2 (0x00007f5963f10000)
        libstdc++.so.6 => not found
        libm.so.6 => /nix/store/mh78fk3x12q2a77srgkzv16h0irl8r61-glibc-2.31/lib/libm.so.6 (0x00007f5963dcf000)
        libgcc_s.so.1 => /nix/store/mh78fk3x12q2a77srgkzv16h0irl8r61-glibc-2.31/lib/libgcc_s.so.1 (0x00007f5963db5000)
        libpthread.so.0 => /nix/store/mh78fk3x12q2a77srgkzv16h0irl8r61-glibc-2.31/lib/libpthread.so.0 (0x00007f5963d94000)
        libc.so.6 => /nix/store/mh78fk3x12q2a77srgkzv16h0irl8r61-glibc-2.31/lib/libc.so.6 (0x00007f5963bd3000)
        /nix/store/mh78fk3x12q2a77srgkzv16h0irl8r61-glibc-2.31/lib/ld-linux-x86-64.so.2 => /nix/store/mh78fk3x12q2a77srgkzv16h0irl8r61-glibc-2.31/lib64/ld-linux-x86-64.so.2 (0x00007f5963f21000)

@danieldk
Copy link
Contributor Author

I know I wasn’t asked, but I like to try out clpeak as a quick smoke test (one can abort it early for an even quicker test). That’s how it ended up in the overlay.

Thanks for the pointer! Works well.

@danieldk
Copy link
Contributor Author

danieldk commented Aug 21, 2020

Sample code

Thanks! Both work on RX580.

I found the problem with clinfo: I was using the clinfo from rocm-opencl-runtime, which links against libstdc++:

Ah, good catch. We should probably remove that from rocm-opencl-runtime's output anyway. Any objections?

@danieldk
Copy link
Contributor Author

Added a commit that removes clinfo.

@acowley
Copy link
Contributor

acowley commented Aug 21, 2020

I think it's fine, but I can't say I've ever seen the issue reported by @Flakebi, and I do not see it with my ROCm 3.5 install. The reason the derivation existed was due to AMD having some extensions relating to OpenCL. This has been a very confused situation that has caused issues with having a suitable headers for AMD's implementation, but it has never been well-documented. I can keep it i the overlay in case it is an issue.

@danieldk
Copy link
Contributor Author

@ofborg build rocm-opencl-icd

@acowley
Copy link
Contributor

acowley commented Aug 22, 2020

I'm porting the patches to the overlay, and I think the rocminfo upgrade is missing here. The sha256 for 3.7.0 is 0yc6z0migxzw3l12vbvq1sww3cmwl8flbj5a2vj2fnpdxkkqfgsl.

@danieldk
Copy link
Contributor Author

I'm porting the patches to the overlay, and I think the rocminfo upgrade is missing here. The sha256 for 3.7.0 is 0yc6z0migxzw3l12vbvq1sww3cmwl8flbj5a2vj2fnpdxkkqfgsl.

We currently do not have rocminfo, but I did forget rocm-smi.

Are we happy with merging this now, or do we want to wait to see what is happening on the Navi front? Since Navi is not officially supported yet, I think we should probably merge it. Then 3.7.0 will also be in NixOS 20.09.

@acowley
Copy link
Contributor

acowley commented Aug 22, 2020

I'm fine with you making that call. I have a build failure with rocminfo which blocks CI for the overlay, so I've not run anything on 3.7 yet. But I think two of you have now run it, though with some differing results, so perhaps that's enough.

@acowley
Copy link
Contributor

acowley commented Aug 22, 2020

I made some progress, but hip can't link at the moment with an error regarding libclang_rt.builtins-x86_64. This may require some poking into the LLVM packages.

@Flakebi
Copy link
Member

Flakebi commented Aug 22, 2020

I’m fine with merging it as it seems to work fine on your system.

@acowley
Copy link
Contributor

acowley commented Aug 22, 2020

I think the overlay is actually good to go now, too. I haven't updated all of the downstream middleware, but hip is happy now.

@danieldk
Copy link
Contributor Author

I think the overlay is actually good to go now, too. I haven't updated all of the downstream middleware, but hip is happy now.

Cool! That's great to hear.

@danieldk danieldk merged commit 280147c into NixOS:master Aug 22, 2020
@danieldk danieldk deleted the rocm-3.7.0 branch August 22, 2020 11:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants