Integrate faster kernels for delay calculation #58

maedoc · 2024-02-16T11:04:26Z

In another repo, we prototyped (much) faster delay calculation, and the kernels are not particularly complicated, and everything else can be written in Jax no problem. Jax uses pybind11 and a bunch of complicated stuff to add primitives,

https://jax.readthedocs.io/en/latest/Custom_Operation_for_GPUs.html

probably as much binding code as the actual kernels themselves:

void delays2(int nv, int nh, int t,
             float *out1, float *out2,
             float *buf, float *weights, int *idelays, int *indices, int *indptr)
{
    // nh is power of two, so x&(nh-1) is faster way to compute x%nh
    int nhm = nh - 1;
    #pragma omp parallel for
    for (int i=0; i<nv; i++)
    {
        // compute coupling terms for both Heun stages
        float acc1 = 0.0f, acc2 = 0.0f;
        #pragma omp simd reduction(+:acc1,acc2)
        for (int j=indptr[i]; j<indptr[i+1]; j++) {
            float *b = buf + indices[j]*nh;
            float w = weights[j];
            int roll_t = nh + t - idelays[j];
            acc1 += w * b[(roll_t+0) & nhm];
            acc2 += w * b[(roll_t+1) & nhm];
        }
        out1[i] = acc1;
        out2[i] = acc2;
    }
}

// variant which updates the buf with current state
void delays2_upbuf(int nv, int nh, int t,
             float *out1, float *out2,
             float *buf, float *weights, int *idelays, int *indices, int *indptr,
             float *x)
{
    // nh is power of two, so x&(nh-1) is faster way to compute x%nh
    int nhm = nh - 1;
    #pragma omp parallel for
    for (int i=0; i<nv; i++)
    {
        // update buffer
        buf[i*nh + ((nh + t) & nhm)] = x[i];
        // compute coupling terms for both Heun stages
        float acc1 = 0.0f, acc2 = 0.0f;
        #pragma omp simd reduction(+:acc1,acc2)
        for (int j=indptr[i]; j<indptr[i+1]; j++) {
            float *b = buf + indices[j]*nh;
            float w = weights[j];
            int roll_t = nh + t - idelays[j];
            acc1 += w * b[(roll_t+0) & nhm];
            acc2 += w * b[(roll_t+1) & nhm];
        }
        out1[i] = acc1;
        out2[i] = acc2;
    }
}

maedoc · 2024-02-16T11:04:59Z

These kernels don't operate on batches which would be even better/faster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate faster kernels for delay calculation #58

Integrate faster kernels for delay calculation #58

maedoc commented Feb 16, 2024 •

edited

Loading

maedoc commented Feb 16, 2024

Integrate faster kernels for delay calculation #58

Integrate faster kernels for delay calculation #58

Comments

maedoc commented Feb 16, 2024 • edited Loading

maedoc commented Feb 16, 2024

maedoc commented Feb 16, 2024 •

edited

Loading