Cython, Python, Fortran, and all that

If you spend any amount of time on writing some sort of numerically intensive code either as a hobby or for work, chances are you'll eventually run across a performance problem. That is, some calculation just takes way too long to complete, and you feel like it's time to optimize things.

Obviously, in such a scenario, the first thing to consider is whether the algorithm is optimal or not. Optimizing the language is essentially a total waste of time if you're using a \( O(n^2) \) algorithm where a \( O(n) \) algorithm is available. Nevertheless, sometimes the algorithm is more or less optimal and you still need more speed.

If you're using Python with NumPy and SciPy, you should be aware that the functions in both of those libraries are written in C and Fortran behind the scenes. So, if what you're doing is, say, diagonalizing a matrix, chances are you're already using Fortran under the hood. You're unlikely to find more performance in some eigenvalue calculation this way.

Yet, in other problems, performance can genuinely be improved. Consider the following physics problem. You have some charge distributed in space and want to find the electric potential. This obeys the equation

\begin{align} \nabla ^2 V (\vec {x}) = - \rho (\vec{x})/\epsilon _0 \end{align}

By discretizing this equation (if you don't know how to do that, see here ), we arrive at the finite-difference version

\begin{align} \frac{1}{4}\frac{(V(x-a,y)+V(x+a,y)+V(x,y-a)+V(x,y+a))}{a^2} + \frac{a^2}{4\epsilon _0}\rho (x,y) = 0 \end{align}

Here, we've used a 2-dimensional rectangular grid with one cell being of width and length \( a \). We then set the charge density \( \rho \) to some predetermined value. I choose to set it to \( v(x,y) = 6xy(1-y)-2x^3 \). I'll now look at several solutions for writing this code: Python, Cython and Fortran and C. This equation is solved using the relaxation method in all three cases, so the algorithms should be identical. (actually, this example is cheating, since there are better methods as well!) Also, such problems often have "boundary conditions", meaning that the potential is set to some predetermined value at the edges or "boundaries" of the system. Here, we choose to set the boundaries to zero, except at \( x=1 \), where we set the potential to \( y(1-y) \). We do this because this particular problem has an analytical solution: \( V(x,y) = y(1-y)x^3 \). We can therefore be sure our codes work correctly by comparing with this solution.

Python

I used the following python code (partially taken out of Mark Newman's computational physics book - a book which I can heartily recommend):

    import numpy as np
    import matplotlib.pyplot as plt
    import datetime


    M = 100      
    a = 1/(M-1)
    target = 1e-8
    analytical_tol = 5.0e-3

    # Create arrays to hold potential values
    phi = np.zeros([M,M],float)
    phiprime = np.zeros([M,M],float)

    #Analytical solution to the problem
    sol = np.zeros([M,M],float)
    begin = datetime.datetime.now()
    def realsol(x,y):
        return y*(1-y)*x**3
    for i in range(M):
        for j in range(M):
            sol[i,j] = realsol(a*i,a*j)

    #Boundary conditions
    for i in range(M):
        phi[M-1,i] = i*a*(1-i*a)
        phiprime[M-1,i] = i*a*(1-i*a)

    #Source term
    def ro(x,y):
        return -(6*x*y*(1-y) - 2*x**3)


    # Main loop
    delta = 1.0
    while delta>target:

        # Calculate new values of the potential
        a2 = a**2
        for i in range(1,M-1):
            for j in range(1,M-1):

                phiprime[i,j] = (phi[i+1,j] + phi[i-1,j] \
                                + phi[i,j+1] + phi[i,j-1])/4 \
                        + a2*ro(i*a,j*a)/4																											

        # Calculate maximum difference from old values
        delta = np.max(abs(phi-phiprime))

        # Swap the two arrays around
        phi,phiprime = phiprime,phi
    maxsol = np.max(abs(sol))
    maxphi = np.max(abs(phi))
    sol = sol/maxsol
    phi = phi/maxphi
    assert np.sum(np.abs(sol-phi))/(M*M) < analytical_tol
    end = datetime.datetime.now()
    dif = end-begin
    print(dif.total_seconds())
    

Essentially, what is happening is that we first guess V (called phi in the code), then calculate the new one from our discrete equation, and iterate until the previous solution and the solution from this iteration are close enough to each other.

On my machine, this code takes almost 200 seconds to run. Not that big of a deal, but you can easily imagine adding more grid points, or a third dimension. In that case, scaling is sure to be bad.

Cython

Cython is a Python-to-C compiler. Any valid Python code is also valid Cython. However, since it translates Python to C, you can get performance speedups, as C is a rather fast compiled language.

There are a few reasons why Python is slow, one of them being that Python is dynamically typed and interpreted, so it has to do a lot of work during the execution of the program to figure out what the user wants. Because of this, Cython offers the programmer the ability to declare C types to the program, so that it can better optimize the code. It also readily works with NumPy arrays. Here's the Cython code:


    import numpy as np
    cimport numpy as np
    import matplotlib.pyplot as plt
    cimport cython
    from libc.math cimport pow
    import time
    import datetime
    
    
    cdef double ro(double x,double y):
        cdef double res
        res = -(6.0*x*y*(1.0-y)-2.0*pow(x,3.0))
        return res
    cdef double boundary(double y):
        return y*(1-y)
    cdef double analytical_solution(double x, double y):
        return y*(1-y)*x**3
    @cython.cdivision(True)
    @cython.boundscheck(False)
    cdef void run_calculation():
        cdef int M, counter
        cdef double V,target,a, analytical_tol
        M = 100         # Grid squares on a side
        target = 1e-8   # Target accuracy
        a = 1.0/(M-1)
        analytical_tol = 5.0e-3
    
        counter = 0
        cdef np.ndarray[np.double_t,ndim=2] phi = \
            np.zeros([M,M],dtype=np.double)
    
        cdef np.ndarray[np.double_t,ndim=2] phiprime = \
            np.zeros([M,M],np.double)
        cdef np.ndarray[np.double_t,ndim=2] roa = \
            np.zeros([M,M],np.double)
        cdef np.ndarray[np.double_t,ndim=2] analytical_sol = \
            np.zeros([M,M],np.double)
        cdef int i,j
        cdef double delta
        cdef double a2
        a2 = a**2
        delta = 1.0
    
        for i in range(M):
            for j in range(M):
                roa[i,j] = ro(i*a, j*a)
        for i in range(M):
            phiprime[M-1,i] = boundary(i*a)
            phi[M-1,i] = boundary(i*a)
        for i in range(M):
            for j in range(M):
                analytical_sol[i,j] = analytical_solution(i*a,j*a)
    
        while delta>target:
            counter += 1
            for i in range(1,M-1):
                for j in range(1,M-1):
                    phiprime[i,j] = (phi[i+1,j] + phi[i-1,j] \
                                     + phi[i,j+1] + phi[i,j-1])/4 \
                            + a2*roa[i,j]/4.0
    
        # Calculate maximum difference from old values
            delta = np.max(abs(phi-phiprime))
        # Swap the two arrays around
            phi,phiprime = phiprime,phi
        assert np.sum(np.abs(analytical_sol-phi))/(M*M) < analytical_tol
        
        print(counter)
        
    begin = datetime.datetime.now()
    run_calculation()
    end = datetime.datetime.now()
    dif = end-begin
    print(dif.total_seconds())

The code is otherwise the same, but I've wrapped everything in a function to better use cython's compiler directives. Cdivision turns off some safety checks relating to division, and boundscheck turns off checking bounds on arrays (you can now accidentally overflow the array: for example, if you have an array with 10 items, you could try to access element 11). Also, I've defined C types for our variables to help the Cython compiler. You can read more at cython.org.

The difference is remarkable. This code takes 1.02 seconds to run, which is an insane performance boost. This is possible because Cython is particularly good at optimizing loops with heavy numerical operations. When Python accesses a numpy array, it does all sorts of relatively complicated logic to figure out the types involved, checking array bounds, going through Python class methods, possible overloaded operators, etc. Array access is therefore fairly slow in Python, but is extremely fast in Cython, which just translates the whole thing to a C array.

Fortran

Fortran is the grandfather of scientific programming, and was designed specifically with scientific computing in mind in the 1950s. Let's see how the old beast does in solving this problem.


    module naiveimpl
    implicit none
    public
    integer, parameter :: dp=kind(0.d0)
    contains
    
    pure real(dp) function rho(x,y)
        real(dp), intent(in) :: x,y
        rho = -(6.0_dp*x*y*(1.0_dp-y)-2.0_dp*x**3.0_dp)
    end function
    
    pure elemental real(dp) function boundary(y)
        real(dp), intent(in) :: y
    
        boundary = y*(1-y)
    end function
    
    pure real(dp) function analytical_sol(x,y)
            real(dp), intent(in) :: x,y
            analytical_sol = y*(1-y)*x**3
    end function
    
    integer function run_naive(M)
        integer, intent(in):: M
        integer, parameter :: dp=kind(0.d0)
        real(dp)           :: phiprime(0:M-1,0:M-1), phi(0:M-1,0:M-1), a2, analytical(0:M-1,0:M-1), temp(0:M-1,0:M-1),delta,a, avgerror
        real(dp),parameter :: target=1E-8_dp, analytical_tol = 5.0e-3
        integer            :: i,j, iter
        a = 1.0/(M-1)
        delta = 1.0_dp
        iter = 0
        phi = 0
        phiprime = 0
        phiprime(M-1,:) = boundary([(i*a,i=0,M-1)])
        phi(M-1,:) = boundary([(i*a,i=0,M-1)])
        do i=0,M-1
            do j=0, M-1
                analytical(i,j) = analytical_sol(i*a,j*a)
            end do
        end do
        do while (delta > target )
            iter = iter + 1
            a2 = a**2.0_dp
            do i=1, M-2
                do j=1, M-2
                    phiprime(i,j) = (phi(i+1,j) + phi(i-1,j) + phi(i,j+1) + phi(i,j-1))/4.0_dp &
                    + a2*rho(i*a,j*a)/4.0_dp
                end do
            end do
            delta = maxval(abs(phiprime - phi))
            temp = phi
            phi = phiprime 
            phiprime = temp
        
        end do
        phi = phi/maxval(abs(phi))
        analytical = analytical/maxval(abs(analytical))
        avgerror = sum(abs(phi-analytical))/(M*M)
        print *, avgerror
        if(avgerror > analytical_tol) then
            print *, "Failed to agree with analytical solution", avgerror
        end if
        run_naive = iter
    end function
    end module naiveimpl
    
    
    program poisson
    use naiveimpl, only: run_naive
    implicit none
    integer, parameter :: dp=kind(0.d0), M=100 
    integer            :: iter
    real(dp)           :: b,e
    
    
    call cpu_time(b)
    iter = run_naive(M)
    call cpu_time(e)
    print *, e-b, iter
    end program

Even better, the code now takes around 0.34 seconds to run. Increasing the size of the grid to 200x200 gives timings of 6.25 seconds for Fortran and 32.77 seconds for Cython, so Fortran is considerably faster.

C, optimized

To squeeze more performance out of this code, I think we can say Python and Cython are out as top contenders, but C is always a good candidate for speed. We can also optimize the code somewhat. For example, the charge density array should be precomputed instead of calling a function on every iteration through the grid. The swap can also be improved by removing the temp variable, which is not necessary in this particular case. With these things in mind, the optimized C code is:


    #include 
        #include 
        #include 
        #include 
        #include 
        #define M 200
        #define analytical_tol 5.0e-3
        
        /*
         *
         * Code by ARC, @arunningcroc
         *
         */
        void swap(double (*a)[M], double (*b)[M], int n)
        {
          double swp;
          for (int i = 0; i < n; i++) {
            for (int j = 0; j < n; j++) {
              a[i][j] = b[i][j];
            }
          }
        }
        double mmax(double (*phi)[M], double (*phip)[M], int n)
        {
          double max = 0.0;
          double diff = 0.0;
          for (int i = 0; i < n; i++) {
            for (int j = 0; j < n; j++) {
              diff = fabs(phi[i][j]-phip[i][j]);
              if (diff > max)
                max = diff;
            }
          }
          return max;
        }
        void normalize(double (*phi)[M]) {
          double max = -50000;
          for(int i=0; i max) max = phi[i][j];
            }
          }
          for(int i=0; i toler) {
            iter += 1;
            
            for (int i=1; i < M-1; i++) {
              for (int j=1; j < M-1; j++) {
                phip[i][j] = (phi[i+1][j] + phi[i-1][j] +
                              phi[i][j+1] + phi[i][j-1])/4.0 +
                              a2*rhoa[i][j]/4.0;
              }
            }
            delta = mmax(phi, phip, M);
            swap(phi, phip, M);
        
          }
          if(get_average_error(phi, analytical) > analytical_tol) {
            printf("Failed to reach analytical solution, error %f\n", get_average_error(phi, analytical) );
          }
          return iter;
        
        }
        
        int main(int argc, char *argv[])
        {
          int iter;
          clock_t start = clock();
          iter = run(1e-8, 1.0/(M-1));
          clock_t end = clock();
          double total = ((double)(end - start)) / CLOCKS_PER_SEC;
          printf("Execution time %f, iters: %d \n",total, iter);
        }
        

Similar modifications can also be made to the Fortran program. This improves speed in both C and Fortran codes, but the improvement is bigger in the Fortran code.

Fortran can be improved even further by taking in to account modern features of the language. Damian Rouson provided a modernized Fortran version of a Jacobi solver, which I modified to work with this problem. See the end result here, if you're curious.

Summary of results

Here are the best results for each language and style:

Language M=100 (seconds) M=300 (seconds)
Python
196
n/a
Cython
0.46
43.6
Fortran (old style)
0.23
18.3
Fortran (modern, courtesy of Damian Rouson)
0.18
8.08
C (optimized)
0.13
8.26

Some notes:

Obviously this is a small and "non-scientific" case study of a single problem, and I can't claim to be an expert on any of these languages when it comes to performance. I'm sure there are C gurus who could improve on my code, likewise for Cython and Fortran.

I would nevertheless suggest that Fortran, even in 2021, offers a good alternative. The code was easier to write than C, as Fortran handles memory allocations even for large arrays automatically, and also collects memory for allocated arrays (when they go out of scope). Fortran also contains many optimized intrinsic functions for mathematical operations on arrays as well as better array notation than C. In C, arrays are essentially pointers to memory; Fortran, by contrast, is practically built on arrays. So when your problem is dealing with numerical arrays, Fortran feels like a more natural choice than C. Theres now a community for those interested in Fortran, so check that out.

If you have existing Python code that needs speeding up, Cython is a great choice, as we can see here. Massive performance gains can sometimes be achieved with very little effort.