Intro

Optical communication has evolved tremendously from the systems based on intensity modulation from just 20 years ago. A modern system will use any number of techniques largely adopted from wireless communications, such as coherent (QAM) modulation and pulse shaping, to boost the spectral efficiency and throughput of the transmission.

Digital signal processing (DSP) has become a crucial component of these systems to compensate for the impairments which these signals inevitably experience upon propagation, and understanding of the required signal processing is extremely important for running state-of-the-art experiments. While real systems employ custom designed ASICs to cope with the high symbol rates which easily exceed 40 Gbaud, in research and development people largely use offline processing where data is captured using an ultra-fast real-time oscilloscope, such as this or this, and process the data using matlab typically.

I am a big Python fan, so the natural thing to do was write our DSP in Python. However, we are typically capture and process large amounts of data (TB are not uncommon) and when running experiments you want quick feedback so you can see the result of a change or adjustment. Therefore you really want to optimise your code for speed. Python as well as Matlab are not fast enough by themselves. At the same time, I want to avoid rewriting in C or C++, because we are scientists first not programmers and I want the PhD students working on this to understand the code without learning C or C++.

In this notebook we are going explore how to improve performance using cython

Constant Modulus Algorithm

The problem that we test here is a very common function in communications and optical communications; the adaptive equaliser, which is essentially training an adaptive FIR filter to compensate for channel impairments. The problem is the following:

Consider a transmitted signal of independently and identically distributed symbols $a(n)$ transmitted through a channel modelled by an FIR filter and addititve white Gaussian noise (AWGN). The received signal can be expressed as:

$$ x(n) = \mathbf{h^H} \mathbf{a}(n) + v(n) $$

where $\mathbf{a}(n) = [a(n), a(n-1), \dots, (a(n-N+1)]^T$. $\mathbf{h} = [h_0, h_1, \dots, h_{N-1}]^T$ is the vector of the impulse response of the channel, N is the channel length (or memory) and $v(n)$ the additive noise. Note that $x, h, a$ and $v$ are in general complex.

The objective is to find a receiver filter that minimizes inter symbol interference (ISI) and therefore compensates for the channel impairments. The equaliser output then becomes: $$ y(n) = \mathbf{w}^H \mathbf{x}(n) $$ where $\mathbf{w} = [w_0(n), w_1(n), \dots, w_{N-1}(n)]^T$ is the tap weights of the equaliser impulse response and $\mathbf{x}(n)=[x(n), x(n-1), \dots, x(n-N+1)]^T$ is the equaliser input.

One of the most famous adaptive blind (without knowledge of the transmitted signal) equalisers is the so-called constant-modulus algorithm (CMA) [1]. It is designed for PSK modulation but is also used as a initial preconvergence algorithm for higher-order QAM signals. The algorithm takes advantage of the fact that the modulus of PSK modulation is a constant, or in other words that PSK modulated symbols lay on a circle in the IQ-plane. The cost function of our equaliser is therefore: $$ J(\mathbf{w})=E \left[ (R-|y(n)|^2)^2] \right] $$ where $R$ is the radius of the PSK modulation. Using stochastic gradient decent we get an filter update function:

$$ \mathbf{w}(n+1) = \mathbf{w}(n) + \mu y(n)(R-|y(n)|^2) \mathbf{x}^\ast(n) $$

Here $\mu$ is a step size parameter.

In optics the channel includes polarisation and therefore we need to extend the algorithm to a $2\times2$ MIMO which is quite straight forward (see e.g. [2])

References

[1] D. Godard, “Self-recovering equalization and carrier tracking in two dimensional data communication systems,” IEEE Trans. Commun., vol. 28, pp. 1867–1875 (1980).

[2] Seb J. Savory, “Digital filters for coherent optical receivers,” Opt. Express, vol. 16, pp. 804-817 (2008)

Simulations and optimizations

To concentrate on the implementation of the CMA algorithm we will use QAMpy for the signal generation and simulation of our system. QAMpy is a is a optical communications DSP chain and simulation toolbox we have written for simulating and processing optical communications signals. You can find it on github here

Initialise packages

Let’s first import all relevant packages

In [1]:
%matplotlib inline
import matplotlib.pylab as plt
import numpy as np
from qampy import signals, impairments, equalisation
import cython 
from timeit import default_timer as timer
%load_ext cython

Numpy

To establish a performance baseline we are not going to implement the algorithm in pure python, but instead go directly to numpy.

We are going to separate the CMA algorithm into two functions. A function which applys the filter to the signal and the actual equaliser training, which loops over both polarisations and all samples of the signal.

Apply filter

Here is the function that applies the filter to the signal. We’d like to make this as fast as possible because it’s being called a lot of times. Note that we could also implement this as np.sum(E*wx.conj()) but the below implementation is actually quite a bit faster.

In [2]:
def apply_filter_npy(E,wx):
    Xest = 0 +0j
    for i in range(E.shape[0]):
        Xest += np.dot(E[i], wx[i].conj().T)
    return Xest

Equaliser training

Here is the training function for the equaliser. It takes 5 parameters the 2d complex signal, the initial equaliser taps (3d) and the step-size mu, radius R (both float constants) and the oversampling factor . Note how we are training the filter for both polarisations in the outer loop.

In [3]:
def cma_numpy(E, wxy, mu, R, os):
    pols, L = E.shape
    assert wxy.shape[0] == pols
    assert wxy.shape[1] == pols
    ntaps = wxy.shape[-1]
    N = (L//os//ntaps-1)*ntaps
    err = np.zeros((pols, L), dtype=np.complex128)
    for k in range(pols):
        for i in range(N):
            X = E[:, i*os:i*os+ntaps]
            Xest = apply_filter_npy(X, wxy[k])
            err[k,i] = (R-abs(Xest)**2)*Xest
            wxy[k] += mu*np.conj(err[k,i])*X
    return err, wxy

Simulation

Ok lets set up the simulation environment. We are using functions from QAMpy to make signal generation and impairments easier. Feel free to check out the documentation.

First we set set up the signal which is a dual-polarisation signal with 100,000 symbols and a symbolrate of 10 Gbaud.

In [4]:
s = signals.SignalQAMGrayCoded(4, 10**5, nmodes=2, fb=10e9)

We are making it easy for us and are only simulating a signal with gaussian noise and pulse-shaping. Thus the adaptive filter is essentially just doing a match-filter. The following code $2\times$ oversamples the signal and applies a root-raised cosine pulse-shape with 1% roll-off. We then change the SNR to 10 dB.

In [5]:
s2 = s.resample(s.fb*2, beta=0.01)
s3 =  impairments.change_snr(s2, 10)

Let have a look at the constellation diagram.

In [6]:
plt.figure()
plt.hist2d(s3[0,::2].real, s3[0,::2].imag, bins=200)
plt.show()

What we see here is a combination of the SNR and the ISI from the lack of matched filtering for the pulse-shaping. We can also calculate the BER.

In [7]:
s3[:,::2].copy().cal_ber()
Out[7]:
array([0.01266 , 0.012585])

This code initialises the filter taps for x and y polarisations to be orthogonal. We are using a filter with 21 taps here and the filter has to have the shape (2,2,21).

In [8]:
wxy0 = np.zeros((2,2,21), dtype=np.complex128)
wxy0[0][0][21//2] =1
wxy0[1][1][21//2] = 1

So let’s time the performance and check the error rate after the filter.

In [9]:
wxyin = wxy0.copy() # we are copying the initial taps so we don't overwrite them
start = timer()
err, wxy = cma_numpy(s3, wxyin, 1e-3, 1, s3.os) 
end=timer()
t_numpy = end-start
print("Numpy time = {:.3f}".format(t_numpy))
out = equalisation.apply_filter(s3, wxy)
print("BER = {}".format(out.cal_ber()))
Numpy time = 5.484
BER = [0.00109511 0.00123012]

Numpy took about 5.5 s for processing the 100,000 symbols. We can also see that the adaptive filter improved performance by a factor of 10.

Generally 100,000 symbols is on the lower end of what we measure and 5 s is quite a wait so we really want to go faster than this.

Cython

The most well known way to improve python performance is to use cython which is a superset of the python language and can be compiled straight to C yielding some impressive performance improvements. The performance critical parts of QAMpy are written in cython at the moment. Below is the straight direct translation of the above numpy code into cython, note we have made the apply_filter function a C-only function (indicated by the cdef), because we don’t need to call it from python. We also disabled array boundary checks and wraparound which speeds up loops significantly.

In [10]:
%%cython
import numpy as np
cimport numpy as np
import cython

@cython.boundscheck(False) # turn off bounds-checking for entire function
@cython.wraparound(False)  # turn off negative index wrapping for entire function
cdef double complex apply_filter_pyx(np.ndarray[np.complex128_t, ndim=2] E, np.ndarray[np.complex128_t, ndim=2] wx):
    cdef int j, k, M, N
    cdef double complex Xest =0
    M = E.shape[0]
    for i in range(M):
        Xest += np.dot(E[i], np.conj(wx[i]))
    return Xest


@cython.boundscheck(False) # turn off bounds-checking for entire function
@cython.wraparound(False)  # turn off negative index wrapping for entire function
def cma_pyx1(np.ndarray[np.complex128_t, ndim=2] E, np.ndarray[np.complex128_t, ndim=3] wxy, double mu, double R, int os):
    cdef int i,j,k,pols, N, ntaps
    cdef np.ndarray[np.complex128_t, ndim=2] err
    cdef np.ndarray[np.complex128_t, ndim=2] X
    cdef double complex Xest
    ntaps = wxy.shape[2]
    pols = E.shape[0]
    L = E.shape[1]
    N = (L//os//ntaps-1)*ntaps
    err = np.zeros((pols, L), dtype=np.complex128)
    for k in range(pols):
        for i in range(N):
            X = E[:, i*os:i*os+ntaps]
            Xest = apply_filter_pyx(X, wxy[k])
            err[k,i] = (R-abs(Xest)**2)*Xest
            wxy[k] += mu*np.conj(err[k,i])*X
    return err,wxy

Let us see if that improved performance

In [11]:
start = timer()
err, wxy = cma_pyx1(s3, wxyin, 1e-3, 1, s3.os)
end=timer()
t_cython1 = end-start
print("Cython initial time = {:.3f}".format(t_cython1))
out = equalisation.apply_filter(s3, wxy)
print("BER = {}".format(out.cal_ber()))
Cython initial time = 4.651
BER = [0.00122512 0.00122512]

This is rather disappointing. It took 4.7s on my maching so only a very modest speedup.

Cython annotate

Let’s have a look why. Cython offers -a commandline switch which outputs an annotated html file highlighting the sections which contain calls to python code. Those sections are in general the bottlenecks and you want to eliminate as many as possible of those, so that the code runs at pure C-code speed.

In [12]:
%%cython -a
import numpy as np
cimport numpy as np
import cython

@cython.boundscheck(False) # turn off bounds-checking for entire function
@cython.wraparound(False)  # turn off negative index wrapping for entire function
cdef double complex apply_filter_pyx(np.ndarray[np.complex128_t, ndim=2] E, np.ndarray[np.complex128_t, ndim=2] wx):
    cdef int j, k, M, N
    cdef double complex Xest =0
    M = E.shape[0]
    for i in range(M):
        Xest += np.dot(E[i], np.conj(wx[i]))
    return Xest


@cython.boundscheck(False) # turn off bounds-checking for entire function
@cython.wraparound(False)  # turn off negative index wrapping for entire function
def cma_pyx1(np.ndarray[np.complex128_t, ndim=2] E, np.ndarray[np.complex128_t, ndim=3] wxy, double mu, double R, int os):
    cdef int i,j,k,pols, N, ntaps
    cdef np.ndarray[np.complex128_t, ndim=2] err
    cdef np.ndarray[np.complex128_t, ndim=2] X
    cdef double complex Xest
    ntaps = wxy.shape[2]
    pols = E.shape[0]
    L = E.shape[1]
    N = (L//os//ntaps-1)*ntaps
    err = np.zeros((pols, L), dtype=np.complex128)
    for k in range(pols):
        for i in range(N):
            X = E[:, i*os:i*os+ntaps]
            Xest = apply_filter_pyx(X, wxy[k])
            err[k,i] = (R-abs(Xest)**2)*Xest
            wxy[k] += mu*np.conj(err[k,i])*X
    return err,wxy
Out[12]:
Cython: _cython_magic_29b52b9ef7f7fc3ece7fd1d8438167ec.pyx

Generated by Cython 0.29.20

Yellow lines hint at Python interaction.
Click on a line that starts with a “+” to see the C code that Cython generated for it.

+01: import numpy as np
  __pyx_t_1 = __Pyx_Import(__pyx_n_s_numpy, 0, 0); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 1, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_np, __pyx_t_1) < 0) __PYX_ERR(0, 1, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
/* … */
  __pyx_t_1 = __Pyx_PyDict_NewPresized(0); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 1, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_test, __pyx_t_1) < 0) __PYX_ERR(0, 1, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
 02: cimport numpy as np
 03: import cython
 04: 
 05: @cython.boundscheck(False) # turn off bounds-checking for entire function
 06: @cython.wraparound(False)  # turn off negative index wrapping for entire function
+07: cdef double complex apply_filter_pyx(np.ndarray[np.complex128_t, ndim=2] E, np.ndarray[np.complex128_t, ndim=2] wx):
static __pyx_t_double_complex __pyx_f_46_cython_magic_29b52b9ef7f7fc3ece7fd1d8438167ec_apply_filter_pyx(PyArrayObject *__pyx_v_E, PyArrayObject *__pyx_v_wx) {
  int __pyx_v_M;
  __pyx_t_double_complex __pyx_v_Xest;
  int __pyx_v_i;
  __Pyx_LocalBuf_ND __pyx_pybuffernd_E;
  __Pyx_Buffer __pyx_pybuffer_E;
  __Pyx_LocalBuf_ND __pyx_pybuffernd_wx;
  __Pyx_Buffer __pyx_pybuffer_wx;
  __pyx_t_double_complex __pyx_r;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("apply_filter_pyx", 0);
  __pyx_pybuffer_E.pybuffer.buf = NULL;
  __pyx_pybuffer_E.refcount = 0;
  __pyx_pybuffernd_E.data = NULL;
  __pyx_pybuffernd_E.rcbuffer = &__pyx_pybuffer_E;
  __pyx_pybuffer_wx.pybuffer.buf = NULL;
  __pyx_pybuffer_wx.refcount = 0;
  __pyx_pybuffernd_wx.data = NULL;
  __pyx_pybuffernd_wx.rcbuffer = &__pyx_pybuffer_wx;
  {
    __Pyx_BufFmt_StackElem __pyx_stack[1];
    if (unlikely(__Pyx_GetBufferAndValidate(&__pyx_pybuffernd_E.rcbuffer->pybuffer, (PyObject*)__pyx_v_E, &__Pyx_TypeInfo___pyx_t_double_complex, PyBUF_FORMAT| PyBUF_STRIDES, 2, 0, __pyx_stack) == -1)) __PYX_ERR(0, 7, __pyx_L1_error)
  }
  __pyx_pybuffernd_E.diminfo[0].strides = __pyx_pybuffernd_E.rcbuffer->pybuffer.strides[0]; __pyx_pybuffernd_E.diminfo[0].shape = __pyx_pybuffernd_E.rcbuffer->pybuffer.shape[0]; __pyx_pybuffernd_E.diminfo[1].strides = __pyx_pybuffernd_E.rcbuffer->pybuffer.strides[1]; __pyx_pybuffernd_E.diminfo[1].shape = __pyx_pybuffernd_E.rcbuffer->pybuffer.shape[1];
  {
    __Pyx_BufFmt_StackElem __pyx_stack[1];
    if (unlikely(__Pyx_GetBufferAndValidate(&__pyx_pybuffernd_wx.rcbuffer->pybuffer, (PyObject*)__pyx_v_wx, &__Pyx_TypeInfo___pyx_t_double_complex, PyBUF_FORMAT| PyBUF_STRIDES, 2, 0, __pyx_stack) == -1)) __PYX_ERR(0, 7, __pyx_L1_error)
  }
  __pyx_pybuffernd_wx.diminfo[0].strides = __pyx_pybuffernd_wx.rcbuffer->pybuffer.strides[0]; __pyx_pybuffernd_wx.diminfo[0].shape = __pyx_pybuffernd_wx.rcbuffer->pybuffer.shape[0]; __pyx_pybuffernd_wx.diminfo[1].strides = __pyx_pybuffernd_wx.rcbuffer->pybuffer.strides[1]; __pyx_pybuffernd_wx.diminfo[1].shape = __pyx_pybuffernd_wx.rcbuffer->pybuffer.shape[1];
/* … */
  /* function exit code */
  __pyx_L1_error:;
  __Pyx_XDECREF(__pyx_t_4);
  __Pyx_XDECREF(__pyx_t_5);
  __Pyx_XDECREF(__pyx_t_6);
  __Pyx_XDECREF(__pyx_t_7);
  __Pyx_XDECREF(__pyx_t_8);
  __Pyx_XDECREF(__pyx_t_9);
  __Pyx_XDECREF(__pyx_t_10);
  __Pyx_XDECREF(__pyx_t_11);
  { PyObject *__pyx_type, *__pyx_value, *__pyx_tb;
    __Pyx_PyThreadState_declare
    __Pyx_PyThreadState_assign
    __Pyx_ErrFetch(&__pyx_type, &__pyx_value, &__pyx_tb);
    __Pyx_SafeReleaseBuffer(&__pyx_pybuffernd_E.rcbuffer->pybuffer);
    __Pyx_SafeReleaseBuffer(&__pyx_pybuffernd_wx.rcbuffer->pybuffer);
  __Pyx_ErrRestore(__pyx_type, __pyx_value, __pyx_tb);}
  __Pyx_WriteUnraisable("_cython_magic_29b52b9ef7f7fc3ece7fd1d8438167ec.apply_filter_pyx", __pyx_clineno, __pyx_lineno, __pyx_filename, 1, 0);
  __pyx_r = __pyx_t_double_complex_from_parts(0, 0);
  goto __pyx_L2;
  __pyx_L0:;
  __Pyx_SafeReleaseBuffer(&__pyx_pybuffernd_E.rcbuffer->pybuffer);
  __Pyx_SafeReleaseBuffer(&__pyx_pybuffernd_wx.rcbuffer->pybuffer);
  __pyx_L2:;
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}
 08:     cdef int j, k, M, N
+09:     cdef double complex Xest =0
  __pyx_v_Xest = __pyx_t_double_complex_from_parts(0, 0);
+10:     M = E.shape[0]
  __pyx_v_M = (__pyx_v_E->dimensions[0]);
+11:     for i in range(M):
  __pyx_t_1 = __pyx_v_M;
  __pyx_t_2 = __pyx_t_1;
  for (__pyx_t_3 = 0; __pyx_t_3 < __pyx_t_2; __pyx_t_3+=1) {
    __pyx_v_i = __pyx_t_3;
+12:         Xest += np.dot(E[i], np.conj(wx[i]))
    __pyx_t_4 = __pyx_PyComplex_FromComplex(__pyx_v_Xest); if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 12, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_4);
    __Pyx_GetModuleGlobalName(__pyx_t_6, __pyx_n_s_np); if (unlikely(!__pyx_t_6)) __PYX_ERR(0, 12, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_6);
    __pyx_t_7 = __Pyx_PyObject_GetAttrStr(__pyx_t_6, __pyx_n_s_dot); if (unlikely(!__pyx_t_7)) __PYX_ERR(0, 12, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_7);
    __Pyx_DECREF(__pyx_t_6); __pyx_t_6 = 0;
    __pyx_t_6 = __Pyx_GetItemInt(((PyObject *)__pyx_v_E), __pyx_v_i, int, 1, __Pyx_PyInt_From_int, 0, 0, 0); if (unlikely(!__pyx_t_6)) __PYX_ERR(0, 12, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_6);
    __Pyx_GetModuleGlobalName(__pyx_t_9, __pyx_n_s_np); if (unlikely(!__pyx_t_9)) __PYX_ERR(0, 12, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_9);
    __pyx_t_10 = __Pyx_PyObject_GetAttrStr(__pyx_t_9, __pyx_n_s_conj); if (unlikely(!__pyx_t_10)) __PYX_ERR(0, 12, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_10);
    __Pyx_DECREF(__pyx_t_9); __pyx_t_9 = 0;
    __pyx_t_9 = __Pyx_GetItemInt(((PyObject *)__pyx_v_wx), __pyx_v_i, int, 1, __Pyx_PyInt_From_int, 0, 0, 0); if (unlikely(!__pyx_t_9)) __PYX_ERR(0, 12, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_9);
    __pyx_t_11 = NULL;
    if (CYTHON_UNPACK_METHODS && unlikely(PyMethod_Check(__pyx_t_10))) {
      __pyx_t_11 = PyMethod_GET_SELF(__pyx_t_10);
      if (likely(__pyx_t_11)) {
        PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_10);
        __Pyx_INCREF(__pyx_t_11);
        __Pyx_INCREF(function);
        __Pyx_DECREF_SET(__pyx_t_10, function);
      }
    }
    __pyx_t_8 = (__pyx_t_11) ? __Pyx_PyObject_Call2Args(__pyx_t_10, __pyx_t_11, __pyx_t_9) : __Pyx_PyObject_CallOneArg(__pyx_t_10, __pyx_t_9);
    __Pyx_XDECREF(__pyx_t_11); __pyx_t_11 = 0;
    __Pyx_DECREF(__pyx_t_9); __pyx_t_9 = 0;
    if (unlikely(!__pyx_t_8)) __PYX_ERR(0, 12, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_8);
    __Pyx_DECREF(__pyx_t_10); __pyx_t_10 = 0;
    __pyx_t_10 = NULL;
    __pyx_t_12 = 0;
    if (CYTHON_UNPACK_METHODS && unlikely(PyMethod_Check(__pyx_t_7))) {
      __pyx_t_10 = PyMethod_GET_SELF(__pyx_t_7);
      if (likely(__pyx_t_10)) {
        PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_7);
        __Pyx_INCREF(__pyx_t_10);
        __Pyx_INCREF(function);
        __Pyx_DECREF_SET(__pyx_t_7, function);
        __pyx_t_12 = 1;
      }
    }
    #if CYTHON_FAST_PYCALL
    if (PyFunction_Check(__pyx_t_7)) {
      PyObject *__pyx_temp[3] = {__pyx_t_10, __pyx_t_6, __pyx_t_8};
      __pyx_t_5 = __Pyx_PyFunction_FastCall(__pyx_t_7, __pyx_temp+1-__pyx_t_12, 2+__pyx_t_12); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 12, __pyx_L1_error)
      __Pyx_XDECREF(__pyx_t_10); __pyx_t_10 = 0;
      __Pyx_GOTREF(__pyx_t_5);
      __Pyx_DECREF(__pyx_t_6); __pyx_t_6 = 0;
      __Pyx_DECREF(__pyx_t_8); __pyx_t_8 = 0;
    } else
    #endif
    #if CYTHON_FAST_PYCCALL
    if (__Pyx_PyFastCFunction_Check(__pyx_t_7)) {
      PyObject *__pyx_temp[3] = {__pyx_t_10, __pyx_t_6, __pyx_t_8};
      __pyx_t_5 = __Pyx_PyCFunction_FastCall(__pyx_t_7, __pyx_temp+1-__pyx_t_12, 2+__pyx_t_12); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 12, __pyx_L1_error)
      __Pyx_XDECREF(__pyx_t_10); __pyx_t_10 = 0;
      __Pyx_GOTREF(__pyx_t_5);
      __Pyx_DECREF(__pyx_t_6); __pyx_t_6 = 0;
      __Pyx_DECREF(__pyx_t_8); __pyx_t_8 = 0;
    } else
    #endif
    {
      __pyx_t_9 = PyTuple_New(2+__pyx_t_12); if (unlikely(!__pyx_t_9)) __PYX_ERR(0, 12, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_9);
      if (__pyx_t_10) {
        __Pyx_GIVEREF(__pyx_t_10); PyTuple_SET_ITEM(__pyx_t_9, 0, __pyx_t_10); __pyx_t_10 = NULL;
      }
      __Pyx_GIVEREF(__pyx_t_6);
      PyTuple_SET_ITEM(__pyx_t_9, 0+__pyx_t_12, __pyx_t_6);
      __Pyx_GIVEREF(__pyx_t_8);
      PyTuple_SET_ITEM(__pyx_t_9, 1+__pyx_t_12, __pyx_t_8);
      __pyx_t_6 = 0;
      __pyx_t_8 = 0;
      __pyx_t_5 = __Pyx_PyObject_Call(__pyx_t_7, __pyx_t_9, NULL); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 12, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_5);
      __Pyx_DECREF(__pyx_t_9); __pyx_t_9 = 0;
    }
    __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
    __pyx_t_7 = PyNumber_InPlaceAdd(__pyx_t_4, __pyx_t_5); if (unlikely(!__pyx_t_7)) __PYX_ERR(0, 12, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_7);
    __Pyx_DECREF(__pyx_t_4); __pyx_t_4 = 0;
    __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
    __pyx_t_13 = __Pyx_PyComplex_As___pyx_t_double_complex(__pyx_t_7); if (unlikely(PyErr_Occurred())) __PYX_ERR(0, 12, __pyx_L1_error)
    __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
    __pyx_v_Xest = __pyx_t_13;
  }
+13:     return Xest
  __pyx_r = __pyx_v_Xest;
  goto __pyx_L0;
 14: 
 15: 
 16: @cython.boundscheck(False) # turn off bounds-checking for entire function
 17: @cython.wraparound(False)  # turn off negative index wrapping for entire function
+18: def cma_pyx1(np.ndarray[np.complex128_t, ndim=2] E, np.ndarray[np.complex128_t, ndim=3] wxy, double mu, double R, int os):
/* Python wrapper */
static PyObject *__pyx_pw_46_cython_magic_29b52b9ef7f7fc3ece7fd1d8438167ec_1cma_pyx1(PyObject *__pyx_self, PyObject *__pyx_args, PyObject *__pyx_kwds); /*proto*/
static PyMethodDef __pyx_mdef_46_cython_magic_29b52b9ef7f7fc3ece7fd1d8438167ec_1cma_pyx1 = {"cma_pyx1", (PyCFunction)(void*)(PyCFunctionWithKeywords)__pyx_pw_46_cython_magic_29b52b9ef7f7fc3ece7fd1d8438167ec_1cma_pyx1, METH_VARARGS|METH_KEYWORDS, 0};
static PyObject *__pyx_pw_46_cython_magic_29b52b9ef7f7fc3ece7fd1d8438167ec_1cma_pyx1(PyObject *__pyx_self, PyObject *__pyx_args, PyObject *__pyx_kwds) {
  PyArrayObject *__pyx_v_E = 0;
  PyArrayObject *__pyx_v_wxy = 0;
  double __pyx_v_mu;
  double __pyx_v_R;
  int __pyx_v_os;
  PyObject *__pyx_r = 0;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("cma_pyx1 (wrapper)", 0);
  {
    static PyObject **__pyx_pyargnames[] = {&__pyx_n_s_E,&__pyx_n_s_wxy,&__pyx_n_s_mu,&__pyx_n_s_R,&__pyx_n_s_os,0};
    PyObject* values[5] = {0,0,0,0,0};
    if (unlikely(__pyx_kwds)) {
      Py_ssize_t kw_args;
      const Py_ssize_t pos_args = PyTuple_GET_SIZE(__pyx_args);
      switch (pos_args) {
        case  5: values[4] = PyTuple_GET_ITEM(__pyx_args, 4);
        CYTHON_FALLTHROUGH;
        case  4: values[3] = PyTuple_GET_ITEM(__pyx_args, 3);
        CYTHON_FALLTHROUGH;
        case  3: values[2] = PyTuple_GET_ITEM(__pyx_args, 2);
        CYTHON_FALLTHROUGH;
        case  2: values[1] = PyTuple_GET_ITEM(__pyx_args, 1);
        CYTHON_FALLTHROUGH;
        case  1: values[0] = PyTuple_GET_ITEM(__pyx_args, 0);
        CYTHON_FALLTHROUGH;
        case  0: break;
        default: goto __pyx_L5_argtuple_error;
      }
      kw_args = PyDict_Size(__pyx_kwds);
      switch (pos_args) {
        case  0:
        if (likely((values[0] = __Pyx_PyDict_GetItemStr(__pyx_kwds, __pyx_n_s_E)) != 0)) kw_args--;
        else goto __pyx_L5_argtuple_error;
        CYTHON_FALLTHROUGH;
        case  1:
        if (likely((values[1] = __Pyx_PyDict_GetItemStr(__pyx_kwds, __pyx_n_s_wxy)) != 0)) kw_args--;
        else {
          __Pyx_RaiseArgtupleInvalid("cma_pyx1", 1, 5, 5, 1); __PYX_ERR(0, 18, __pyx_L3_error)
        }
        CYTHON_FALLTHROUGH;
        case  2:
        if (likely((values[2] = __Pyx_PyDict_GetItemStr(__pyx_kwds, __pyx_n_s_mu)) != 0)) kw_args--;
        else {
          __Pyx_RaiseArgtupleInvalid("cma_pyx1", 1, 5, 5, 2); __PYX_ERR(0, 18, __pyx_L3_error)
        }
        CYTHON_FALLTHROUGH;
        case  3:
        if (likely((values[3] = __Pyx_PyDict_GetItemStr(__pyx_kwds, __pyx_n_s_R)) != 0)) kw_args--;
        else {
          __Pyx_RaiseArgtupleInvalid("cma_pyx1", 1, 5, 5, 3); __PYX_ERR(0, 18, __pyx_L3_error)
        }
        CYTHON_FALLTHROUGH;
        case  4:
        if (likely((values[4] = __Pyx_PyDict_GetItemStr(__pyx_kwds, __pyx_n_s_os)) != 0)) kw_args--;
        else {
          __Pyx_RaiseArgtupleInvalid("cma_pyx1", 1, 5, 5, 4); __PYX_ERR(0, 18, __pyx_L3_error)
        }
      }
      if (unlikely(kw_args > 0)) {
        if (unlikely(__Pyx_ParseOptionalKeywords(__pyx_kwds, __pyx_pyargnames, 0, values, pos_args, "cma_pyx1") < 0)) __PYX_ERR(0, 18, __pyx_L3_error)
      }
    } else if (PyTuple_GET_SIZE(__pyx_args) != 5) {
      goto __pyx_L5_argtuple_error;
    } else {
      values[0] = PyTuple_GET_ITEM(__pyx_args, 0);
      values[1] = PyTuple_GET_ITEM(__pyx_args, 1);
      values[2] = PyTuple_GET_ITEM(__pyx_args, 2);
      values[3] = PyTuple_GET_ITEM(__pyx_args, 3);
      values[4] = PyTuple_GET_ITEM(__pyx_args, 4);
    }
    __pyx_v_E = ((PyArrayObject *)values[0]);
    __pyx_v_wxy = ((PyArrayObject *)values[1]);
    __pyx_v_mu = __pyx_PyFloat_AsDouble(values[2]); if (unlikely((__pyx_v_mu == (double)-1) && PyErr_Occurred())) __PYX_ERR(0, 18, __pyx_L3_error)
    __pyx_v_R = __pyx_PyFloat_AsDouble(values[3]); if (unlikely((__pyx_v_R == (double)-1) && PyErr_Occurred())) __PYX_ERR(0, 18, __pyx_L3_error)
    __pyx_v_os = __Pyx_PyInt_As_int(values[4]); if (unlikely((__pyx_v_os == (int)-1) && PyErr_Occurred())) __PYX_ERR(0, 18, __pyx_L3_error)
  }
  goto __pyx_L4_argument_unpacking_done;
  __pyx_L5_argtuple_error:;
  __Pyx_RaiseArgtupleInvalid("cma_pyx1", 1, 5, 5, PyTuple_GET_SIZE(__pyx_args)); __PYX_ERR(0, 18, __pyx_L3_error)
  __pyx_L3_error:;
  __Pyx_AddTraceback("_cython_magic_29b52b9ef7f7fc3ece7fd1d8438167ec.cma_pyx1", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __Pyx_RefNannyFinishContext();
  return NULL;
  __pyx_L4_argument_unpacking_done:;
  if (unlikely(!__Pyx_ArgTypeTest(((PyObject *)__pyx_v_E), __pyx_ptype_5numpy_ndarray, 1, "E", 0))) __PYX_ERR(0, 18, __pyx_L1_error)
  if (unlikely(!__Pyx_ArgTypeTest(((PyObject *)__pyx_v_wxy), __pyx_ptype_5numpy_ndarray, 1, "wxy", 0))) __PYX_ERR(0, 18, __pyx_L1_error)
  __pyx_r = __pyx_pf_46_cython_magic_29b52b9ef7f7fc3ece7fd1d8438167ec_cma_pyx1(__pyx_self, __pyx_v_E, __pyx_v_wxy, __pyx_v_mu, __pyx_v_R, __pyx_v_os);
  int __pyx_lineno = 0;
  const char *__pyx_filename = NULL;
  int __pyx_clineno = 0;

  /* function exit code */
  goto __pyx_L0;
  __pyx_L1_error:;
  __pyx_r = NULL;
  __pyx_L0:;
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}

static PyObject *__pyx_pf_46_cython_magic_29b52b9ef7f7fc3ece7fd1d8438167ec_cma_pyx1(CYTHON_UNUSED PyObject *__pyx_self, PyArrayObject *__pyx_v_E, PyArrayObject *__pyx_v_wxy, double __pyx_v_mu, double __pyx_v_R, int __pyx_v_os) {
  int __pyx_v_i;
  int __pyx_v_k;
  int __pyx_v_pols;
  int __pyx_v_N;
  int __pyx_v_ntaps;
  PyArrayObject *__pyx_v_err = 0;
  PyArrayObject *__pyx_v_X = 0;
  __pyx_t_double_complex __pyx_v_Xest;
  PyObject *__pyx_v_L = NULL;
  __Pyx_LocalBuf_ND __pyx_pybuffernd_E;
  __Pyx_Buffer __pyx_pybuffer_E;
  __Pyx_LocalBuf_ND __pyx_pybuffernd_X;
  __Pyx_Buffer __pyx_pybuffer_X;
  __Pyx_LocalBuf_ND __pyx_pybuffernd_err;
  __Pyx_Buffer __pyx_pybuffer_err;
  __Pyx_LocalBuf_ND __pyx_pybuffernd_wxy;
  __Pyx_Buffer __pyx_pybuffer_wxy;
  PyObject *__pyx_r = NULL;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("cma_pyx1", 0);
  __pyx_pybuffer_err.pybuffer.buf = NULL;
  __pyx_pybuffer_err.refcount = 0;
  __pyx_pybuffernd_err.data = NULL;
  __pyx_pybuffernd_err.rcbuffer = &__pyx_pybuffer_err;
  __pyx_pybuffer_X.pybuffer.buf = NULL;
  __pyx_pybuffer_X.refcount = 0;
  __pyx_pybuffernd_X.data = NULL;
  __pyx_pybuffernd_X.rcbuffer = &__pyx_pybuffer_X;
  __pyx_pybuffer_E.pybuffer.buf = NULL;
  __pyx_pybuffer_E.refcount = 0;
  __pyx_pybuffernd_E.data = NULL;
  __pyx_pybuffernd_E.rcbuffer = &__pyx_pybuffer_E;
  __pyx_pybuffer_wxy.pybuffer.buf = NULL;
  __pyx_pybuffer_wxy.refcount = 0;
  __pyx_pybuffernd_wxy.data = NULL;
  __pyx_pybuffernd_wxy.rcbuffer = &__pyx_pybuffer_wxy;
  {
    __Pyx_BufFmt_StackElem __pyx_stack[1];
    if (unlikely(__Pyx_GetBufferAndValidate(&__pyx_pybuffernd_E.rcbuffer->pybuffer, (PyObject*)__pyx_v_E, &__Pyx_TypeInfo___pyx_t_double_complex, PyBUF_FORMAT| PyBUF_STRIDES, 2, 0, __pyx_stack) == -1)) __PYX_ERR(0, 18, __pyx_L1_error)
  }
  __pyx_pybuffernd_E.diminfo[0].strides = __pyx_pybuffernd_E.rcbuffer->pybuffer.strides[0]; __pyx_pybuffernd_E.diminfo[0].shape = __pyx_pybuffernd_E.rcbuffer->pybuffer.shape[0]; __pyx_pybuffernd_E.diminfo[1].strides = __pyx_pybuffernd_E.rcbuffer->pybuffer.strides[1]; __pyx_pybuffernd_E.diminfo[1].shape = __pyx_pybuffernd_E.rcbuffer->pybuffer.shape[1];
  {
    __Pyx_BufFmt_StackElem __pyx_stack[1];
    if (unlikely(__Pyx_GetBufferAndValidate(&__pyx_pybuffernd_wxy.rcbuffer->pybuffer, (PyObject*)__pyx_v_wxy, &__Pyx_TypeInfo___pyx_t_double_complex, PyBUF_FORMAT| PyBUF_STRIDES, 3, 0, __pyx_stack) == -1)) __PYX_ERR(0, 18, __pyx_L1_error)
  }
  __pyx_pybuffernd_wxy.diminfo[0].strides = __pyx_pybuffernd_wxy.rcbuffer->pybuffer.strides[0]; __pyx_pybuffernd_wxy.diminfo[0].shape = __pyx_pybuffernd_wxy.rcbuffer->pybuffer.shape[0]; __pyx_pybuffernd_wxy.diminfo[1].strides = __pyx_pybuffernd_wxy.rcbuffer->pybuffer.strides[1]; __pyx_pybuffernd_wxy.diminfo[1].shape = __pyx_pybuffernd_wxy.rcbuffer->pybuffer.shape[1]; __pyx_pybuffernd_wxy.diminfo[2].strides = __pyx_pybuffernd_wxy.rcbuffer->pybuffer.strides[2]; __pyx_pybuffernd_wxy.diminfo[2].shape = __pyx_pybuffernd_wxy.rcbuffer->pybuffer.shape[2];
/* … */
  /* function exit code */
  __pyx_L1_error:;
  __Pyx_XDECREF(__pyx_t_1);
  __Pyx_XDECREF(__pyx_t_2);
  __Pyx_XDECREF(__pyx_t_3);
  __Pyx_XDECREF(__pyx_t_5);
  __Pyx_XDECREF(__pyx_t_6);
  __Pyx_XDECREF(__pyx_t_23);
  { PyObject *__pyx_type, *__pyx_value, *__pyx_tb;
    __Pyx_PyThreadState_declare
    __Pyx_PyThreadState_assign
    __Pyx_ErrFetch(&__pyx_type, &__pyx_value, &__pyx_tb);
    __Pyx_SafeReleaseBuffer(&__pyx_pybuffernd_E.rcbuffer->pybuffer);
    __Pyx_SafeReleaseBuffer(&__pyx_pybuffernd_X.rcbuffer->pybuffer);
    __Pyx_SafeReleaseBuffer(&__pyx_pybuffernd_err.rcbuffer->pybuffer);
    __Pyx_SafeReleaseBuffer(&__pyx_pybuffernd_wxy.rcbuffer->pybuffer);
  __Pyx_ErrRestore(__pyx_type, __pyx_value, __pyx_tb);}
  __Pyx_AddTraceback("_cython_magic_29b52b9ef7f7fc3ece7fd1d8438167ec.cma_pyx1", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __pyx_r = NULL;
  goto __pyx_L2;
  __pyx_L0:;
  __Pyx_SafeReleaseBuffer(&__pyx_pybuffernd_E.rcbuffer->pybuffer);
  __Pyx_SafeReleaseBuffer(&__pyx_pybuffernd_X.rcbuffer->pybuffer);
  __Pyx_SafeReleaseBuffer(&__pyx_pybuffernd_err.rcbuffer->pybuffer);
  __Pyx_SafeReleaseBuffer(&__pyx_pybuffernd_wxy.rcbuffer->pybuffer);
  __pyx_L2:;
  __Pyx_XDECREF((PyObject *)__pyx_v_err);
  __Pyx_XDECREF((PyObject *)__pyx_v_X);
  __Pyx_XDECREF(__pyx_v_L);
  __Pyx_XGIVEREF(__pyx_r);
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}
/* … */
  __pyx_tuple__9 = PyTuple_Pack(15, __pyx_n_s_E, __pyx_n_s_wxy, __pyx_n_s_mu, __pyx_n_s_R, __pyx_n_s_os, __pyx_n_s_i, __pyx_n_s_j, __pyx_n_s_k, __pyx_n_s_pols, __pyx_n_s_N, __pyx_n_s_ntaps, __pyx_n_s_err, __pyx_n_s_X, __pyx_n_s_Xest, __pyx_n_s_L); if (unlikely(!__pyx_tuple__9)) __PYX_ERR(0, 18, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_tuple__9);
  __Pyx_GIVEREF(__pyx_tuple__9);
/* … */
  __pyx_t_1 = PyCFunction_NewEx(&__pyx_mdef_46_cython_magic_29b52b9ef7f7fc3ece7fd1d8438167ec_1cma_pyx1, NULL, __pyx_n_s_cython_magic_29b52b9ef7f7fc3ece); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 18, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_cma_pyx1, __pyx_t_1) < 0) __PYX_ERR(0, 18, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
 19:     cdef int i,j,k,pols, N, ntaps
 20:     cdef np.ndarray[np.complex128_t, ndim=2] err
 21:     cdef np.ndarray[np.complex128_t, ndim=2] X
 22:     cdef double complex Xest
+23:     ntaps = wxy.shape[2]
  __pyx_v_ntaps = (__pyx_v_wxy->dimensions[2]);
+24:     pols = E.shape[0]
  __pyx_v_pols = (__pyx_v_E->dimensions[0]);
+25:     L = E.shape[1]
  __pyx_t_1 = __Pyx_PyInt_From_Py_intptr_t((__pyx_v_E->dimensions[1])); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 25, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  __pyx_v_L = __pyx_t_1;
  __pyx_t_1 = 0;
+26:     N = (L//os//ntaps-1)*ntaps
  __pyx_t_1 = __Pyx_PyInt_From_int(__pyx_v_os); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 26, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  __pyx_t_2 = PyNumber_FloorDivide(__pyx_v_L, __pyx_t_1); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 26, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
  __pyx_t_1 = __Pyx_PyInt_From_int(__pyx_v_ntaps); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 26, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  __pyx_t_3 = PyNumber_FloorDivide(__pyx_t_2, __pyx_t_1); if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 26, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_3);
  __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
  __pyx_t_1 = __Pyx_PyInt_SubtractObjC(__pyx_t_3, __pyx_int_1, 1, 0, 0); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 26, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
  __pyx_t_3 = __Pyx_PyInt_From_int(__pyx_v_ntaps); if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 26, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_3);
  __pyx_t_2 = PyNumber_Multiply(__pyx_t_1, __pyx_t_3); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 26, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
  __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
  __pyx_t_4 = __Pyx_PyInt_As_int(__pyx_t_2); if (unlikely((__pyx_t_4 == (int)-1) && PyErr_Occurred())) __PYX_ERR(0, 26, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
  __pyx_v_N = __pyx_t_4;
+27:     err = np.zeros((pols, L), dtype=np.complex128)
  __Pyx_GetModuleGlobalName(__pyx_t_2, __pyx_n_s_np); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 27, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  __pyx_t_3 = __Pyx_PyObject_GetAttrStr(__pyx_t_2, __pyx_n_s_zeros); if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 27, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_3);
  __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
  __pyx_t_2 = __Pyx_PyInt_From_int(__pyx_v_pols); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 27, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  __pyx_t_1 = PyTuple_New(2); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 27, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  __Pyx_GIVEREF(__pyx_t_2);
  PyTuple_SET_ITEM(__pyx_t_1, 0, __pyx_t_2);
  __Pyx_INCREF(__pyx_v_L);
  __Pyx_GIVEREF(__pyx_v_L);
  PyTuple_SET_ITEM(__pyx_t_1, 1, __pyx_v_L);
  __pyx_t_2 = 0;
  __pyx_t_2 = PyTuple_New(1); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 27, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  __Pyx_GIVEREF(__pyx_t_1);
  PyTuple_SET_ITEM(__pyx_t_2, 0, __pyx_t_1);
  __pyx_t_1 = 0;
  __pyx_t_1 = __Pyx_PyDict_NewPresized(1); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 27, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  __Pyx_GetModuleGlobalName(__pyx_t_5, __pyx_n_s_np); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 27, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_5);
  __pyx_t_6 = __Pyx_PyObject_GetAttrStr(__pyx_t_5, __pyx_n_s_complex128); if (unlikely(!__pyx_t_6)) __PYX_ERR(0, 27, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_6);
  __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
  if (PyDict_SetItem(__pyx_t_1, __pyx_n_s_dtype, __pyx_t_6) < 0) __PYX_ERR(0, 27, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_6); __pyx_t_6 = 0;
  __pyx_t_6 = __Pyx_PyObject_Call(__pyx_t_3, __pyx_t_2, __pyx_t_1); if (unlikely(!__pyx_t_6)) __PYX_ERR(0, 27, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_6);
  __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
  __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
  if (!(likely(((__pyx_t_6) == Py_None) || likely(__Pyx_TypeTest(__pyx_t_6, __pyx_ptype_5numpy_ndarray))))) __PYX_ERR(0, 27, __pyx_L1_error)
  __pyx_t_7 = ((PyArrayObject *)__pyx_t_6);
  {
    __Pyx_BufFmt_StackElem __pyx_stack[1];
    __Pyx_SafeReleaseBuffer(&__pyx_pybuffernd_err.rcbuffer->pybuffer);
    __pyx_t_4 = __Pyx_GetBufferAndValidate(&__pyx_pybuffernd_err.rcbuffer->pybuffer, (PyObject*)__pyx_t_7, &__Pyx_TypeInfo___pyx_t_double_complex, PyBUF_FORMAT| PyBUF_STRIDES| PyBUF_WRITABLE, 2, 0, __pyx_stack);
    if (unlikely(__pyx_t_4 < 0)) {
      PyErr_Fetch(&__pyx_t_8, &__pyx_t_9, &__pyx_t_10);
      if (unlikely(__Pyx_GetBufferAndValidate(&__pyx_pybuffernd_err.rcbuffer->pybuffer, (PyObject*)__pyx_v_err, &__Pyx_TypeInfo___pyx_t_double_complex, PyBUF_FORMAT| PyBUF_STRIDES| PyBUF_WRITABLE, 2, 0, __pyx_stack) == -1)) {
        Py_XDECREF(__pyx_t_8); Py_XDECREF(__pyx_t_9); Py_XDECREF(__pyx_t_10);
        __Pyx_RaiseBufferFallbackError();
      } else {
        PyErr_Restore(__pyx_t_8, __pyx_t_9, __pyx_t_10);
      }
      __pyx_t_8 = __pyx_t_9 = __pyx_t_10 = 0;
    }
    __pyx_pybuffernd_err.diminfo[0].strides = __pyx_pybuffernd_err.rcbuffer->pybuffer.strides[0]; __pyx_pybuffernd_err.diminfo[0].shape = __pyx_pybuffernd_err.rcbuffer->pybuffer.shape[0]; __pyx_pybuffernd_err.diminfo[1].strides = __pyx_pybuffernd_err.rcbuffer->pybuffer.strides[1]; __pyx_pybuffernd_err.diminfo[1].shape = __pyx_pybuffernd_err.rcbuffer->pybuffer.shape[1];
    if (unlikely(__pyx_t_4 < 0)) __PYX_ERR(0, 27, __pyx_L1_error)
  }
  __pyx_t_7 = 0;
  __pyx_v_err = ((PyArrayObject *)__pyx_t_6);
  __pyx_t_6 = 0;
+28:     for k in range(pols):
  __pyx_t_4 = __pyx_v_pols;
  __pyx_t_11 = __pyx_t_4;
  for (__pyx_t_12 = 0; __pyx_t_12 < __pyx_t_11; __pyx_t_12+=1) {
    __pyx_v_k = __pyx_t_12;
+29:         for i in range(N):
    __pyx_t_13 = __pyx_v_N;
    __pyx_t_14 = __pyx_t_13;
    for (__pyx_t_15 = 0; __pyx_t_15 < __pyx_t_14; __pyx_t_15+=1) {
      __pyx_v_i = __pyx_t_15;
+30:             X = E[:, i*os:i*os+ntaps]
      __pyx_t_6 = __Pyx_PyInt_From_int((__pyx_v_i * __pyx_v_os)); if (unlikely(!__pyx_t_6)) __PYX_ERR(0, 30, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_6);
      __pyx_t_1 = __Pyx_PyInt_From_int(((__pyx_v_i * __pyx_v_os) + __pyx_v_ntaps)); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 30, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_1);
      __pyx_t_2 = PySlice_New(__pyx_t_6, __pyx_t_1, Py_None); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 30, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_2);
      __Pyx_DECREF(__pyx_t_6); __pyx_t_6 = 0;
      __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
      __pyx_t_1 = PyTuple_New(2); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 30, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_1);
      __Pyx_INCREF(__pyx_slice_);
      __Pyx_GIVEREF(__pyx_slice_);
      PyTuple_SET_ITEM(__pyx_t_1, 0, __pyx_slice_);
      __Pyx_GIVEREF(__pyx_t_2);
      PyTuple_SET_ITEM(__pyx_t_1, 1, __pyx_t_2);
      __pyx_t_2 = 0;
      __pyx_t_2 = __Pyx_PyObject_GetItem(((PyObject *)__pyx_v_E), __pyx_t_1); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 30, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_2);
      __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
      if (!(likely(((__pyx_t_2) == Py_None) || likely(__Pyx_TypeTest(__pyx_t_2, __pyx_ptype_5numpy_ndarray))))) __PYX_ERR(0, 30, __pyx_L1_error)
      __pyx_t_16 = ((PyArrayObject *)__pyx_t_2);
      {
        __Pyx_BufFmt_StackElem __pyx_stack[1];
        __Pyx_SafeReleaseBuffer(&__pyx_pybuffernd_X.rcbuffer->pybuffer);
        __pyx_t_17 = __Pyx_GetBufferAndValidate(&__pyx_pybuffernd_X.rcbuffer->pybuffer, (PyObject*)__pyx_t_16, &__Pyx_TypeInfo___pyx_t_double_complex, PyBUF_FORMAT| PyBUF_STRIDES, 2, 0, __pyx_stack);
        if (unlikely(__pyx_t_17 < 0)) {
          PyErr_Fetch(&__pyx_t_10, &__pyx_t_9, &__pyx_t_8);
          if (unlikely(__Pyx_GetBufferAndValidate(&__pyx_pybuffernd_X.rcbuffer->pybuffer, (PyObject*)__pyx_v_X, &__Pyx_TypeInfo___pyx_t_double_complex, PyBUF_FORMAT| PyBUF_STRIDES, 2, 0, __pyx_stack) == -1)) {
            Py_XDECREF(__pyx_t_10); Py_XDECREF(__pyx_t_9); Py_XDECREF(__pyx_t_8);
            __Pyx_RaiseBufferFallbackError();
          } else {
            PyErr_Restore(__pyx_t_10, __pyx_t_9, __pyx_t_8);
          }
          __pyx_t_10 = __pyx_t_9 = __pyx_t_8 = 0;
        }
        __pyx_pybuffernd_X.diminfo[0].strides = __pyx_pybuffernd_X.rcbuffer->pybuffer.strides[0]; __pyx_pybuffernd_X.diminfo[0].shape = __pyx_pybuffernd_X.rcbuffer->pybuffer.shape[0]; __pyx_pybuffernd_X.diminfo[1].strides = __pyx_pybuffernd_X.rcbuffer->pybuffer.strides[1]; __pyx_pybuffernd_X.diminfo[1].shape = __pyx_pybuffernd_X.rcbuffer->pybuffer.shape[1];
        if (unlikely(__pyx_t_17 < 0)) __PYX_ERR(0, 30, __pyx_L1_error)
      }
      __pyx_t_16 = 0;
      __Pyx_XDECREF_SET(__pyx_v_X, ((PyArrayObject *)__pyx_t_2));
      __pyx_t_2 = 0;
/* … */
  __pyx_slice_ = PySlice_New(Py_None, Py_None, Py_None); if (unlikely(!__pyx_slice_)) __PYX_ERR(0, 30, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_slice_);
  __Pyx_GIVEREF(__pyx_slice_);
+31:             Xest = apply_filter_pyx(X, wxy[k])
      __pyx_t_2 = __Pyx_GetItemInt(((PyObject *)__pyx_v_wxy), __pyx_v_k, int, 1, __Pyx_PyInt_From_int, 0, 0, 0); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 31, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_2);
      if (!(likely(((__pyx_t_2) == Py_None) || likely(__Pyx_TypeTest(__pyx_t_2, __pyx_ptype_5numpy_ndarray))))) __PYX_ERR(0, 31, __pyx_L1_error)
      __pyx_v_Xest = __pyx_f_46_cython_magic_29b52b9ef7f7fc3ece7fd1d8438167ec_apply_filter_pyx(((PyArrayObject *)__pyx_v_X), ((PyArrayObject *)__pyx_t_2));
      __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
+32:             err[k,i] = (R-abs(Xest)**2)*Xest
      __pyx_t_18 = __pyx_v_k;
      __pyx_t_19 = __pyx_v_i;
      *__Pyx_BufPtrStrided2d(__pyx_t_double_complex *, __pyx_pybuffernd_err.rcbuffer->pybuffer.buf, __pyx_t_18, __pyx_pybuffernd_err.diminfo[0].strides, __pyx_t_19, __pyx_pybuffernd_err.diminfo[1].strides) = __Pyx_c_prod_double(__pyx_t_double_complex_from_parts((__pyx_v_R - pow(__Pyx_c_abs_double(__pyx_v_Xest), 2.0)), 0), __pyx_v_Xest);
+33:             wxy[k] += mu*np.conj(err[k,i])*X
      __pyx_t_17 = __pyx_v_k;
      __pyx_t_2 = __Pyx_GetItemInt(((PyObject *)__pyx_v_wxy), __pyx_t_17, int, 1, __Pyx_PyInt_From_int, 0, 0, 0); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 33, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_2);
      __pyx_t_1 = PyFloat_FromDouble(__pyx_v_mu); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 33, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_1);
      __Pyx_GetModuleGlobalName(__pyx_t_3, __pyx_n_s_np); if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 33, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_3);
      __pyx_t_5 = __Pyx_PyObject_GetAttrStr(__pyx_t_3, __pyx_n_s_conj); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 33, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_5);
      __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
      __pyx_t_20 = __pyx_v_k;
      __pyx_t_21 = __pyx_v_i;
      __pyx_t_22 = (*__Pyx_BufPtrStrided2d(__pyx_t_double_complex *, __pyx_pybuffernd_err.rcbuffer->pybuffer.buf, __pyx_t_20, __pyx_pybuffernd_err.diminfo[0].strides, __pyx_t_21, __pyx_pybuffernd_err.diminfo[1].strides));
      __pyx_t_3 = __pyx_PyComplex_FromComplex(__pyx_t_22); if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 33, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_3);
      __pyx_t_23 = NULL;
      if (CYTHON_UNPACK_METHODS && unlikely(PyMethod_Check(__pyx_t_5))) {
        __pyx_t_23 = PyMethod_GET_SELF(__pyx_t_5);
        if (likely(__pyx_t_23)) {
          PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_5);
          __Pyx_INCREF(__pyx_t_23);
          __Pyx_INCREF(function);
          __Pyx_DECREF_SET(__pyx_t_5, function);
        }
      }
      __pyx_t_6 = (__pyx_t_23) ? __Pyx_PyObject_Call2Args(__pyx_t_5, __pyx_t_23, __pyx_t_3) : __Pyx_PyObject_CallOneArg(__pyx_t_5, __pyx_t_3);
      __Pyx_XDECREF(__pyx_t_23); __pyx_t_23 = 0;
      __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
      if (unlikely(!__pyx_t_6)) __PYX_ERR(0, 33, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_6);
      __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
      __pyx_t_5 = PyNumber_Multiply(__pyx_t_1, __pyx_t_6); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 33, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_5);
      __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
      __Pyx_DECREF(__pyx_t_6); __pyx_t_6 = 0;
      __pyx_t_6 = PyNumber_Multiply(__pyx_t_5, ((PyObject *)__pyx_v_X)); if (unlikely(!__pyx_t_6)) __PYX_ERR(0, 33, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_6);
      __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
      __pyx_t_5 = PyNumber_InPlaceAdd(__pyx_t_2, __pyx_t_6); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 33, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_5);
      __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
      __Pyx_DECREF(__pyx_t_6); __pyx_t_6 = 0;
      if (unlikely(__Pyx_SetItemInt(((PyObject *)__pyx_v_wxy), __pyx_t_17, __pyx_t_5, int, 1, __Pyx_PyInt_From_int, 0, 0, 0) < 0)) __PYX_ERR(0, 33, __pyx_L1_error)
      __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
    }
  }
+34:     return err,wxy
  __Pyx_XDECREF(__pyx_r);
  __pyx_t_5 = PyTuple_New(2); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 34, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_5);
  __Pyx_INCREF(((PyObject *)__pyx_v_err));
  __Pyx_GIVEREF(((PyObject *)__pyx_v_err));
  PyTuple_SET_ITEM(__pyx_t_5, 0, ((PyObject *)__pyx_v_err));
  __Pyx_INCREF(((PyObject *)__pyx_v_wxy));
  __Pyx_GIVEREF(((PyObject *)__pyx_v_wxy));
  PyTuple_SET_ITEM(__pyx_t_5, 1, ((PyObject *)__pyx_v_wxy));
  __pyx_r = __pyx_t_5;
  __pyx_t_5 = 0;
  goto __pyx_L0;

As you can see there are still yellow sections insight the tight loops. To really optimize for speed we want to eliminate those. Let me show you the fully optimized code first and then explain the different bits.

Optimized cython code

In [13]:
%%cython -c=-Ofast -c=-ffast-math -c=-mfpmath=sse -c=-funroll-loops -c=-march=native
import numpy as np
cimport numpy as cnp
import cython

cdef extern from "complex.h" nogil:
    double complex conj(double complex)

@cython.boundscheck(False) # turn off bounds-checking for entire function
@cython.wraparound(False)  # turn off negative index wrapping for entire function
cdef double complex apply_filter_pyx(double complex[:,:] E, double complex[:,:] wx) nogil:
    cdef int j, k, M, N
    cdef double complex Xest = 0
    M = E.shape[0]
    N = E.shape[1]
    for i in range(M):
        for j in range(N):
            Xest += E[i,j] * conj(wx[i,j])
    return Xest


@cython.boundscheck(False) # turn off bounds-checking for entire function
@cython.wraparound(False)  # turn off negative index wrapping for entire function
def cma_pyx1(double complex[:,:] E, double complex[:,:,:] wxy, double mu, double R, int os):
    cdef int i,j,l,k,pols, N, ntaps
    cdef double complex[:,:] err
    cdef double complex[:,:] X
    cdef double complex Xest
    ntaps = wxy.shape[2]
    pols = E.shape[0]
    L = E.shape[1]
    N = (L//os//ntaps-1)*ntaps
    err = np.zeros((pols, L), dtype=np.complex128)
    for k in range(pols):
        for i in range(N):
            X = E[:, i*os:i*os+ntaps]
            Xest = apply_filter_pyx(X, wxy[k])
            err[k,i] = (R-abs(Xest)**2)*Xest
            for j in range(pols):
                for l in range(ntaps):
                    wxy[k, j, l] = wxy[k, j, l] + mu * conj(err[k, i])*X[j,l]
    return err,wxy
In [14]:
start = timer()
err, wxy = cma_pyx1(s3, wxyin, 1e-3, 1, s3.os)
end=timer()
t_cython1 = end-start
print("Cython initial time = {:.3f}".format(t_cython1))
out = equalisation.apply_filter(s3, wxy)
print("BER = {}".format(out.cal_ber()))
Cython initial time = 0.029
BER = [0.00140514 0.00122512]

Now this took only about 0.03 seconds on my PC. A speed-up of more than a factor 100. That’s quite sizable! Let’s have a look again at the annotated output.

In [15]:
%%cython -a
import numpy as np
cimport numpy as cnp
import cython

cdef extern from "complex.h" nogil:
    double complex conj(double complex)

@cython.boundscheck(False) # turn off bounds-checking for entire function
@cython.wraparound(False)  # turn off negative index wrapping for entire function
cdef double complex apply_filter_pyx(double complex[:,:] E, double complex[:,:] wx):
    cdef int j, k, M, N
    cdef double complex Xest =0
    M = E.shape[0]
    N = E.shape[1]
    for i in range(M):
        for j in range(N):
            Xest += E[i,j] * conj(wx[i,j])
    return Xest


@cython.boundscheck(False) # turn off bounds-checking for entire function
@cython.wraparound(False)  # turn off negative index wrapping for entire function
def cma_pyx1(double complex[:,:] E, double complex[:,:,:] wxy, double mu, double R):
    cdef int i,j,k,pols, N, ntaps,os
    cdef double complex[:,:] err
    cdef double complex[:,:] X
    cdef double complex Xest
    os = 2
    ntaps = wxy.shape[2]
    pols = E.shape[0]
    L = E.shape[1]
    N = (L//os//ntaps-1)*ntaps
    err = np.zeros((pols, L), dtype=np.complex128)
    for k in range(pols):
        for i in range(N):
            X = E[:, i*os:i*os+ntaps]
            Xest = apply_filter_pyx(X, wxy[k])
            err[k,i] = (R-abs(Xest)**2)*Xest
            for j in range(pols):
                for l in range(ntaps):
                    wxy[k, j, l] = wxy[k, j, l] + mu * conj(err[k, i])*X[j,l]
    return err,wxy
Out[15]:
Cython: _cython_magic_fa443b0c003eba56aae88c12b70aa4bf.pyx

Generated by Cython 0.29.20

Yellow lines hint at Python interaction.
Click on a line that starts with a “+” to see the C code that Cython generated for it.

+01: import numpy as np
  __pyx_t_1 = __Pyx_Import(__pyx_n_s_numpy, 0, 0); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 1, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_np, __pyx_t_1) < 0) __PYX_ERR(0, 1, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
/* … */
  __pyx_t_1 = __Pyx_PyDict_NewPresized(0); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 1, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_test, __pyx_t_1) < 0) __PYX_ERR(0, 1, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
 02: cimport numpy as cnp
 03: import cython
 04: 
 05: cdef extern from "complex.h" nogil:
 06:     double complex conj(double complex)
 07: 
 08: @cython.boundscheck(False) # turn off bounds-checking for entire function
 09: @cython.wraparound(False)  # turn off negative index wrapping for entire function
+10: cdef double complex apply_filter_pyx(double complex[:,:] E, double complex[:,:] wx):
static __pyx_t_double_complex __pyx_f_46_cython_magic_fa443b0c003eba56aae88c12b70aa4bf_apply_filter_pyx(__Pyx_memviewslice __pyx_v_E, __Pyx_memviewslice __pyx_v_wx) {
  int __pyx_v_j;
  int __pyx_v_M;
  int __pyx_v_N;
  __pyx_t_double_complex __pyx_v_Xest;
  int __pyx_v_i;
  __pyx_t_double_complex __pyx_r;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("apply_filter_pyx", 0);
/* … */
  /* function exit code */
  __pyx_L0:;
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}
 11:     cdef int j, k, M, N
+12:     cdef double complex Xest =0
  __pyx_v_Xest = __pyx_t_double_complex_from_parts(0, 0);
+13:     M = E.shape[0]
  __pyx_v_M = (__pyx_v_E.shape[0]);
+14:     N = E.shape[1]
  __pyx_v_N = (__pyx_v_E.shape[1]);
+15:     for i in range(M):
  __pyx_t_1 = __pyx_v_M;
  __pyx_t_2 = __pyx_t_1;
  for (__pyx_t_3 = 0; __pyx_t_3 < __pyx_t_2; __pyx_t_3+=1) {
    __pyx_v_i = __pyx_t_3;
+16:         for j in range(N):
    __pyx_t_4 = __pyx_v_N;
    __pyx_t_5 = __pyx_t_4;
    for (__pyx_t_6 = 0; __pyx_t_6 < __pyx_t_5; __pyx_t_6+=1) {
      __pyx_v_j = __pyx_t_6;
+17:             Xest += E[i,j] * conj(wx[i,j])
      __pyx_t_7 = __pyx_v_i;
      __pyx_t_8 = __pyx_v_j;
      __pyx_t_9 = __pyx_v_i;
      __pyx_t_10 = __pyx_v_j;
      __pyx_v_Xest = __Pyx_c_sum_double(__pyx_v_Xest, __Pyx_c_prod_double((*((__pyx_t_double_complex *) ( /* dim=1 */ (( /* dim=0 */ (__pyx_v_E.data + __pyx_t_7 * __pyx_v_E.strides[0]) ) + __pyx_t_8 * __pyx_v_E.strides[1]) ))), conj((*((__pyx_t_double_complex *) ( /* dim=1 */ (( /* dim=0 */ (__pyx_v_wx.data + __pyx_t_9 * __pyx_v_wx.strides[0]) ) + __pyx_t_10 * __pyx_v_wx.strides[1]) ))))));
    }
  }
+18:     return Xest
  __pyx_r = __pyx_v_Xest;
  goto __pyx_L0;
 19: 
 20: 
 21: @cython.boundscheck(False) # turn off bounds-checking for entire function
 22: @cython.wraparound(False)  # turn off negative index wrapping for entire function
+23: def cma_pyx1(double complex[:,:] E, double complex[:,:,:] wxy, double mu, double R):
/* Python wrapper */
static PyObject *__pyx_pw_46_cython_magic_fa443b0c003eba56aae88c12b70aa4bf_1cma_pyx1(PyObject *__pyx_self, PyObject *__pyx_args, PyObject *__pyx_kwds); /*proto*/
static PyMethodDef __pyx_mdef_46_cython_magic_fa443b0c003eba56aae88c12b70aa4bf_1cma_pyx1 = {"cma_pyx1", (PyCFunction)(void*)(PyCFunctionWithKeywords)__pyx_pw_46_cython_magic_fa443b0c003eba56aae88c12b70aa4bf_1cma_pyx1, METH_VARARGS|METH_KEYWORDS, 0};
static PyObject *__pyx_pw_46_cython_magic_fa443b0c003eba56aae88c12b70aa4bf_1cma_pyx1(PyObject *__pyx_self, PyObject *__pyx_args, PyObject *__pyx_kwds) {
  __Pyx_memviewslice __pyx_v_E = { 0, 0, { 0 }, { 0 }, { 0 } };
  __Pyx_memviewslice __pyx_v_wxy = { 0, 0, { 0 }, { 0 }, { 0 } };
  double __pyx_v_mu;
  double __pyx_v_R;
  PyObject *__pyx_r = 0;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("cma_pyx1 (wrapper)", 0);
  {
    static PyObject **__pyx_pyargnames[] = {&__pyx_n_s_E,&__pyx_n_s_wxy,&__pyx_n_s_mu,&__pyx_n_s_R,0};
    PyObject* values[4] = {0,0,0,0};
    if (unlikely(__pyx_kwds)) {
      Py_ssize_t kw_args;
      const Py_ssize_t pos_args = PyTuple_GET_SIZE(__pyx_args);
      switch (pos_args) {
        case  4: values[3] = PyTuple_GET_ITEM(__pyx_args, 3);
        CYTHON_FALLTHROUGH;
        case  3: values[2] = PyTuple_GET_ITEM(__pyx_args, 2);
        CYTHON_FALLTHROUGH;
        case  2: values[1] = PyTuple_GET_ITEM(__pyx_args, 1);
        CYTHON_FALLTHROUGH;
        case  1: values[0] = PyTuple_GET_ITEM(__pyx_args, 0);
        CYTHON_FALLTHROUGH;
        case  0: break;
        default: goto __pyx_L5_argtuple_error;
      }
      kw_args = PyDict_Size(__pyx_kwds);
      switch (pos_args) {
        case  0:
        if (likely((values[0] = __Pyx_PyDict_GetItemStr(__pyx_kwds, __pyx_n_s_E)) != 0)) kw_args--;
        else goto __pyx_L5_argtuple_error;
        CYTHON_FALLTHROUGH;
        case  1:
        if (likely((values[1] = __Pyx_PyDict_GetItemStr(__pyx_kwds, __pyx_n_s_wxy)) != 0)) kw_args--;
        else {
          __Pyx_RaiseArgtupleInvalid("cma_pyx1", 1, 4, 4, 1); __PYX_ERR(0, 23, __pyx_L3_error)
        }
        CYTHON_FALLTHROUGH;
        case  2:
        if (likely((values[2] = __Pyx_PyDict_GetItemStr(__pyx_kwds, __pyx_n_s_mu)) != 0)) kw_args--;
        else {
          __Pyx_RaiseArgtupleInvalid("cma_pyx1", 1, 4, 4, 2); __PYX_ERR(0, 23, __pyx_L3_error)
        }
        CYTHON_FALLTHROUGH;
        case  3:
        if (likely((values[3] = __Pyx_PyDict_GetItemStr(__pyx_kwds, __pyx_n_s_R)) != 0)) kw_args--;
        else {
          __Pyx_RaiseArgtupleInvalid("cma_pyx1", 1, 4, 4, 3); __PYX_ERR(0, 23, __pyx_L3_error)
        }
      }
      if (unlikely(kw_args > 0)) {
        if (unlikely(__Pyx_ParseOptionalKeywords(__pyx_kwds, __pyx_pyargnames, 0, values, pos_args, "cma_pyx1") < 0)) __PYX_ERR(0, 23, __pyx_L3_error)
      }
    } else if (PyTuple_GET_SIZE(__pyx_args) != 4) {
      goto __pyx_L5_argtuple_error;
    } else {
      values[0] = PyTuple_GET_ITEM(__pyx_args, 0);
      values[1] = PyTuple_GET_ITEM(__pyx_args, 1);
      values[2] = PyTuple_GET_ITEM(__pyx_args, 2);
      values[3] = PyTuple_GET_ITEM(__pyx_args, 3);
    }
    __pyx_v_E = __Pyx_PyObject_to_MemoryviewSlice_dsds___pyx_t_double_complex(values[0], PyBUF_WRITABLE); if (unlikely(!__pyx_v_E.memview)) __PYX_ERR(0, 23, __pyx_L3_error)
    __pyx_v_wxy = __Pyx_PyObject_to_MemoryviewSlice_dsdsds___pyx_t_double_complex(values[1], PyBUF_WRITABLE); if (unlikely(!__pyx_v_wxy.memview)) __PYX_ERR(0, 23, __pyx_L3_error)
    __pyx_v_mu = __pyx_PyFloat_AsDouble(values[2]); if (unlikely((__pyx_v_mu == (double)-1) && PyErr_Occurred())) __PYX_ERR(0, 23, __pyx_L3_error)
    __pyx_v_R = __pyx_PyFloat_AsDouble(values[3]); if (unlikely((__pyx_v_R == (double)-1) && PyErr_Occurred())) __PYX_ERR(0, 23, __pyx_L3_error)
  }
  goto __pyx_L4_argument_unpacking_done;
  __pyx_L5_argtuple_error:;
  __Pyx_RaiseArgtupleInvalid("cma_pyx1", 1, 4, 4, PyTuple_GET_SIZE(__pyx_args)); __PYX_ERR(0, 23, __pyx_L3_error)
  __pyx_L3_error:;
  __Pyx_AddTraceback("_cython_magic_fa443b0c003eba56aae88c12b70aa4bf.cma_pyx1", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __Pyx_RefNannyFinishContext();
  return NULL;
  __pyx_L4_argument_unpacking_done:;
  __pyx_r = __pyx_pf_46_cython_magic_fa443b0c003eba56aae88c12b70aa4bf_cma_pyx1(__pyx_self, __pyx_v_E, __pyx_v_wxy, __pyx_v_mu, __pyx_v_R);
  int __pyx_lineno = 0;
  const char *__pyx_filename = NULL;
  int __pyx_clineno = 0;

  /* function exit code */
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}

static PyObject *__pyx_pf_46_cython_magic_fa443b0c003eba56aae88c12b70aa4bf_cma_pyx1(CYTHON_UNUSED PyObject *__pyx_self, __Pyx_memviewslice __pyx_v_E, __Pyx_memviewslice __pyx_v_wxy, double __pyx_v_mu, double __pyx_v_R) {
  int __pyx_v_i;
  int __pyx_v_j;
  int __pyx_v_k;
  int __pyx_v_pols;
  int __pyx_v_N;
  int __pyx_v_ntaps;
  int __pyx_v_os;
  __Pyx_memviewslice __pyx_v_err = { 0, 0, { 0 }, { 0 }, { 0 } };
  __Pyx_memviewslice __pyx_v_X = { 0, 0, { 0 }, { 0 }, { 0 } };
  __pyx_t_double_complex __pyx_v_Xest;
  PyObject *__pyx_v_L = NULL;
  int __pyx_v_l;
  PyObject *__pyx_r = NULL;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("cma_pyx1", 0);
/* … */
  /* function exit code */
  __pyx_L1_error:;
  __Pyx_XDECREF(__pyx_t_1);
  __Pyx_XDECREF(__pyx_t_2);
  __Pyx_XDECREF(__pyx_t_3);
  __Pyx_XDECREF(__pyx_t_5);
  __Pyx_XDECREF(__pyx_t_6);
  __PYX_XDEC_MEMVIEW(&__pyx_t_7, 1);
  __Pyx_AddTraceback("_cython_magic_fa443b0c003eba56aae88c12b70aa4bf.cma_pyx1", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __pyx_r = NULL;
  __pyx_L0:;
  __PYX_XDEC_MEMVIEW(&__pyx_v_err, 1);
  __PYX_XDEC_MEMVIEW(&__pyx_v_X, 1);
  __Pyx_XDECREF(__pyx_v_L);
  __PYX_XDEC_MEMVIEW(&__pyx_v_E, 1);
  __PYX_XDEC_MEMVIEW(&__pyx_v_wxy, 1);
  __Pyx_XGIVEREF(__pyx_r);
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}
/* … */
  __pyx_tuple__26 = PyTuple_Pack(16, __pyx_n_s_E, __pyx_n_s_wxy, __pyx_n_s_mu, __pyx_n_s_R, __pyx_n_s_i, __pyx_n_s_j, __pyx_n_s_k, __pyx_n_s_pols, __pyx_n_s_N, __pyx_n_s_ntaps, __pyx_n_s_os, __pyx_n_s_err, __pyx_n_s_X, __pyx_n_s_Xest, __pyx_n_s_L, __pyx_n_s_l); if (unlikely(!__pyx_tuple__26)) __PYX_ERR(0, 23, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_tuple__26);
  __Pyx_GIVEREF(__pyx_tuple__26);
/* … */
  __pyx_t_1 = PyCFunction_NewEx(&__pyx_mdef_46_cython_magic_fa443b0c003eba56aae88c12b70aa4bf_1cma_pyx1, NULL, __pyx_n_s_cython_magic_fa443b0c003eba56aa); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 23, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_cma_pyx1, __pyx_t_1) < 0) __PYX_ERR(0, 23, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
  __pyx_codeobj__27 = (PyObject*)__Pyx_PyCode_New(4, 0, 16, 0, CO_OPTIMIZED|CO_NEWLOCALS, __pyx_empty_bytes, __pyx_empty_tuple, __pyx_empty_tuple, __pyx_tuple__26, __pyx_empty_tuple, __pyx_empty_tuple, __pyx_kp_s_home_jschrod_cache_ipython_cyth, __pyx_n_s_cma_pyx1, 23, __pyx_empty_bytes); if (unlikely(!__pyx_codeobj__27)) __PYX_ERR(0, 23, __pyx_L1_error)
 24:     cdef int i,j,k,pols, N, ntaps,os
 25:     cdef double complex[:,:] err
 26:     cdef double complex[:,:] X
 27:     cdef double complex Xest
+28:     os = 2
  __pyx_v_os = 2;
+29:     ntaps = wxy.shape[2]
  __pyx_v_ntaps = (__pyx_v_wxy.shape[2]);
+30:     pols = E.shape[0]
  __pyx_v_pols = (__pyx_v_E.shape[0]);
+31:     L = E.shape[1]
  __pyx_t_1 = PyInt_FromSsize_t((__pyx_v_E.shape[1])); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 31, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  __pyx_v_L = __pyx_t_1;
  __pyx_t_1 = 0;
+32:     N = (L//os//ntaps-1)*ntaps
  __pyx_t_1 = __Pyx_PyInt_From_int(__pyx_v_os); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 32, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  __pyx_t_2 = PyNumber_FloorDivide(__pyx_v_L, __pyx_t_1); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 32, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
  __pyx_t_1 = __Pyx_PyInt_From_int(__pyx_v_ntaps); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 32, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  __pyx_t_3 = PyNumber_FloorDivide(__pyx_t_2, __pyx_t_1); if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 32, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_3);
  __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
  __pyx_t_1 = __Pyx_PyInt_SubtractObjC(__pyx_t_3, __pyx_int_1, 1, 0, 0); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 32, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
  __pyx_t_3 = __Pyx_PyInt_From_int(__pyx_v_ntaps); if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 32, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_3);
  __pyx_t_2 = PyNumber_Multiply(__pyx_t_1, __pyx_t_3); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 32, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
  __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
  __pyx_t_4 = __Pyx_PyInt_As_int(__pyx_t_2); if (unlikely((__pyx_t_4 == (int)-1) && PyErr_Occurred())) __PYX_ERR(0, 32, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
  __pyx_v_N = __pyx_t_4;
+33:     err = np.zeros((pols, L), dtype=np.complex128)
  __Pyx_GetModuleGlobalName(__pyx_t_2, __pyx_n_s_np); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 33, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  __pyx_t_3 = __Pyx_PyObject_GetAttrStr(__pyx_t_2, __pyx_n_s_zeros); if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 33, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_3);
  __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
  __pyx_t_2 = __Pyx_PyInt_From_int(__pyx_v_pols); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 33, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  __pyx_t_1 = PyTuple_New(2); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 33, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  __Pyx_GIVEREF(__pyx_t_2);
  PyTuple_SET_ITEM(__pyx_t_1, 0, __pyx_t_2);
  __Pyx_INCREF(__pyx_v_L);
  __Pyx_GIVEREF(__pyx_v_L);
  PyTuple_SET_ITEM(__pyx_t_1, 1, __pyx_v_L);
  __pyx_t_2 = 0;
  __pyx_t_2 = PyTuple_New(1); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 33, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  __Pyx_GIVEREF(__pyx_t_1);
  PyTuple_SET_ITEM(__pyx_t_2, 0, __pyx_t_1);
  __pyx_t_1 = 0;
  __pyx_t_1 = __Pyx_PyDict_NewPresized(1); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 33, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  __Pyx_GetModuleGlobalName(__pyx_t_5, __pyx_n_s_np); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 33, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_5);
  __pyx_t_6 = __Pyx_PyObject_GetAttrStr(__pyx_t_5, __pyx_n_s_complex128); if (unlikely(!__pyx_t_6)) __PYX_ERR(0, 33, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_6);
  __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
  if (PyDict_SetItem(__pyx_t_1, __pyx_n_s_dtype, __pyx_t_6) < 0) __PYX_ERR(0, 33, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_6); __pyx_t_6 = 0;
  __pyx_t_6 = __Pyx_PyObject_Call(__pyx_t_3, __pyx_t_2, __pyx_t_1); if (unlikely(!__pyx_t_6)) __PYX_ERR(0, 33, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_6);
  __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
  __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
  __pyx_t_7 = __Pyx_PyObject_to_MemoryviewSlice_dsds___pyx_t_double_complex(__pyx_t_6, PyBUF_WRITABLE); if (unlikely(!__pyx_t_7.memview)) __PYX_ERR(0, 33, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_6); __pyx_t_6 = 0;
  __pyx_v_err = __pyx_t_7;
  __pyx_t_7.memview = NULL;
  __pyx_t_7.data = NULL;
+34:     for k in range(pols):
  __pyx_t_4 = __pyx_v_pols;
  __pyx_t_8 = __pyx_t_4;
  for (__pyx_t_9 = 0; __pyx_t_9 < __pyx_t_8; __pyx_t_9+=1) {
    __pyx_v_k = __pyx_t_9;
+35:         for i in range(N):
    __pyx_t_10 = __pyx_v_N;
    __pyx_t_11 = __pyx_t_10;
    for (__pyx_t_12 = 0; __pyx_t_12 < __pyx_t_11; __pyx_t_12+=1) {
      __pyx_v_i = __pyx_t_12;
+36:             X = E[:, i*os:i*os+ntaps]
      __pyx_t_7.data = __pyx_v_E.data;
      __pyx_t_7.memview = __pyx_v_E.memview;
      __PYX_INC_MEMVIEW(&__pyx_t_7, 0);
      __pyx_t_7.shape[0] = __pyx_v_E.shape[0];
__pyx_t_7.strides[0] = __pyx_v_E.strides[0];
    __pyx_t_7.suboffsets[0] = -1;

__pyx_t_13 = -1;
      if (unlikely(__pyx_memoryview_slice_memviewslice(
    &__pyx_t_7,
    __pyx_v_E.shape[1], __pyx_v_E.strides[1], __pyx_v_E.suboffsets[1],
    1,
    1,
    &__pyx_t_13,
    (__pyx_v_i * __pyx_v_os),
    ((__pyx_v_i * __pyx_v_os) + __pyx_v_ntaps),
    0,
    1,
    1,
    0,
    1) < 0))
{
    __PYX_ERR(0, 36, __pyx_L1_error)
}

__PYX_XDEC_MEMVIEW(&__pyx_v_X, 1);
      __pyx_v_X = __pyx_t_7;
      __pyx_t_7.memview = NULL;
      __pyx_t_7.data = NULL;
+37:             Xest = apply_filter_pyx(X, wxy[k])
      __pyx_t_7.data = __pyx_v_wxy.data;
      __pyx_t_7.memview = __pyx_v_wxy.memview;
      __PYX_INC_MEMVIEW(&__pyx_t_7, 0);
      {
    Py_ssize_t __pyx_tmp_idx = __pyx_v_k;
    Py_ssize_t __pyx_tmp_stride = __pyx_v_wxy.strides[0];
        __pyx_t_7.data += __pyx_tmp_idx * __pyx_tmp_stride;
}

__pyx_t_7.shape[0] = __pyx_v_wxy.shape[1];
__pyx_t_7.strides[0] = __pyx_v_wxy.strides[1];
    __pyx_t_7.suboffsets[0] = -1;

__pyx_t_7.shape[1] = __pyx_v_wxy.shape[2];
__pyx_t_7.strides[1] = __pyx_v_wxy.strides[2];
    __pyx_t_7.suboffsets[1] = -1;

__pyx_v_Xest = __pyx_f_46_cython_magic_fa443b0c003eba56aae88c12b70aa4bf_apply_filter_pyx(__pyx_v_X, __pyx_t_7);
      __PYX_XDEC_MEMVIEW(&__pyx_t_7, 1);
      __pyx_t_7.memview = NULL;
      __pyx_t_7.data = NULL;
+38:             err[k,i] = (R-abs(Xest)**2)*Xest
      __pyx_t_14 = __pyx_v_k;
      __pyx_t_15 = __pyx_v_i;
      *((__pyx_t_double_complex *) ( /* dim=1 */ (( /* dim=0 */ (__pyx_v_err.data + __pyx_t_14 * __pyx_v_err.strides[0]) ) + __pyx_t_15 * __pyx_v_err.strides[1]) )) = __Pyx_c_prod_double(__pyx_t_double_complex_from_parts((__pyx_v_R - pow(__Pyx_c_abs_double(__pyx_v_Xest), 2.0)), 0), __pyx_v_Xest);
+39:             for j in range(pols):
      __pyx_t_13 = __pyx_v_pols;
      __pyx_t_16 = __pyx_t_13;
      for (__pyx_t_17 = 0; __pyx_t_17 < __pyx_t_16; __pyx_t_17+=1) {
        __pyx_v_j = __pyx_t_17;
+40:                 for l in range(ntaps):
        __pyx_t_18 = __pyx_v_ntaps;
        __pyx_t_19 = __pyx_t_18;
        for (__pyx_t_20 = 0; __pyx_t_20 < __pyx_t_19; __pyx_t_20+=1) {
          __pyx_v_l = __pyx_t_20;
+41:                     wxy[k, j, l] = wxy[k, j, l] + mu * conj(err[k, i])*X[j,l]
          __pyx_t_21 = __pyx_v_k;
          __pyx_t_22 = __pyx_v_j;
          __pyx_t_23 = __pyx_v_l;
          __pyx_t_24 = __pyx_v_k;
          __pyx_t_25 = __pyx_v_i;
          __pyx_t_26 = __pyx_v_j;
          __pyx_t_27 = __pyx_v_l;
          __pyx_t_28 = __pyx_v_k;
          __pyx_t_29 = __pyx_v_j;
          __pyx_t_30 = __pyx_v_l;
          *((__pyx_t_double_complex *) ( /* dim=2 */ (( /* dim=1 */ (( /* dim=0 */ (__pyx_v_wxy.data + __pyx_t_28 * __pyx_v_wxy.strides[0]) ) + __pyx_t_29 * __pyx_v_wxy.strides[1]) ) + __pyx_t_30 * __pyx_v_wxy.strides[2]) )) = __Pyx_c_sum_double((*((__pyx_t_double_complex *) ( /* dim=2 */ (( /* dim=1 */ (( /* dim=0 */ (__pyx_v_wxy.data + __pyx_t_21 * __pyx_v_wxy.strides[0]) ) + __pyx_t_22 * __pyx_v_wxy.strides[1]) ) + __pyx_t_23 * __pyx_v_wxy.strides[2]) ))), __Pyx_c_prod_double(__Pyx_c_prod_double(__pyx_t_double_complex_from_parts(__pyx_v_mu, 0), conj((*((__pyx_t_double_complex *) ( /* dim=1 */ (( /* dim=0 */ (__pyx_v_err.data + __pyx_t_24 * __pyx_v_err.strides[0]) ) + __pyx_t_25 * __pyx_v_err.strides[1]) ))))), (*((__pyx_t_double_complex *) ( /* dim=1 */ (( /* dim=0 */ (__pyx_v_X.data + __pyx_t_26 * __pyx_v_X.strides[0]) ) + __pyx_t_27 * __pyx_v_X.strides[1]) )))));
        }
      }
    }
  }
+42:     return err,wxy
  __Pyx_XDECREF(__pyx_r);
  __pyx_t_6 = __pyx_memoryview_fromslice(__pyx_v_err, 2, (PyObject *(*)(char *)) __pyx_memview_get___pyx_t_double_complex, (int (*)(char *, PyObject *)) __pyx_memview_set___pyx_t_double_complex, 0);; if (unlikely(!__pyx_t_6)) __PYX_ERR(0, 42, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_6);
  __pyx_t_1 = __pyx_memoryview_fromslice(__pyx_v_wxy, 3, (PyObject *(*)(char *)) __pyx_memview_get___pyx_t_double_complex, (int (*)(char *, PyObject *)) __pyx_memview_set___pyx_t_double_complex, 0);; if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 42, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  __pyx_t_2 = PyTuple_New(2); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 42, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  __Pyx_GIVEREF(__pyx_t_6);
  PyTuple_SET_ITEM(__pyx_t_2, 0, __pyx_t_6);
  __Pyx_GIVEREF(__pyx_t_1);
  PyTuple_SET_ITEM(__pyx_t_2, 1, __pyx_t_1);
  __pyx_t_6 = 0;
  __pyx_t_1 = 0;
  __pyx_r = __pyx_t_2;
  __pyx_t_2 = 0;
  goto __pyx_L0;
 43: 

As you can see we have eliminated all the yellow sections inside the loops. So let’s step through the optimizations

Unroll loops

The first thing you want to do is unroll the loops. Therefore in apply_filter

for i in range(M):
    Xest += np.dot(E[i], np.conj(wx[i]))

becomes:

for i in range(M):
    for j in range(N):
        Xest += E[i,j] * np.conj(wx[i,j])

and in cma_pyx1

wxy[k] += mu*np.conj(err[k,i])*X

becomes:

for j in range(pols):
    for l in range(ntaps):
        wxy[k, j, l] = wxy[k, j, l] + mu * np.conj(err[k, i])*X[j,l]

Use C-functions in loops

The problem with the above code is that we still call np.conj inside the tight loop and that’s a python function. So to eliminate that we have to import the c-function and use that like so:

cdef extern from "complex.h" nogil:
    double complex conj(double complex)

This imports the complex conjugate function conj from the standard library and we then use that instead inside the loops.

Compiler flags

You might have also noticed the compiler flags. I did quite a bit of testing and this set did yield a significant speed gain, however the gains can be application specific and you should test yourself. A word of caution -ffast-math enables transformations that are not strictly IEEE compliant. However, this is not an issue for our use-case. Generally, in the processing here we have SNRs which are way below floating point precision. In fact we could (and often do) run code in single precision and see absolutely no difference.

Parallelisation

It would also be possible to parallise the loop over the two polarisations. Interestingly enough that did reduce my performance slightly so I did not do it here.

Conclusion

Using Cython we can achieve a more than 100 times speed-up compared to the numpy version. That’s quite a sizable performance gain and for us it’s the difference between being almost “real-time” why we are adjusting our experiment and it being completely unusable. However, it comes at a cost. I would argue the optimized cython code is significantly more complex with a lot of boilerplate. So it somewhat defeats the purpose of using Python in the first place. In a later post I will explore other methods for speeding up the code.