## Cython, Python, Fortran, and all that

If you spend any amount of time on writing some sort of numerically intensive code either as a hobby or for work, chances are you'll eventually run across a performance problem. That is, some calculation just takes way too long to complete, and you feel like it's time to optimize things.

Obviously, in such a scenario, the first thing to consider is whether the algorithm is optimal or not. Optimizing the language is essentially a total waste of time if you're using a \( O(n^2) \) algorithm where a \( O(n) \) algorithm is available. Nevertheless, sometimes the algorithm is more or less optimal and you still need more speed.

If you're using Python with NumPy and SciPy, you should be aware that the functions in both of those libraries are written in C and Fortran behind the scenes. So, if what you're doing is, say, diagonalizing a matrix, chances are you're already using Fortran under the hood. You're unlikely to find more performance in some eigenvalue calculation this way.

Yet, in other problems, performance can genuinely be improved. Consider the following physics problem. You have some charge distributed in space and want to find the electric potential. This obeys the equation

\begin{align} \nabla ^2 V (\vec {x}) = - \rho (\vec{x})/\epsilon _0 \end{align}By discretizing this equation (if you don't know how to do that, see here ), we arrive at the finite-difference version

\begin{align} \frac{1}{4}\frac{(V(x-a,y)+V(x+a,y)+V(x,y-a)+V(x,y+a))}{a^2} + \frac{a^2}{4\epsilon _0}\rho (x,y) = 0 \end{align}Here, we've used a 2-dimensional rectangular grid with one cell being of width and length \( a \). We then set the charge density \( \rho \) to some predetermined value. I choose to set it to \( v(x,y) = 6xy(1-y)-2x^3 \). I'll now look at several solutions for writing this code: Python, Cython and Fortran and C. This equation is solved using the relaxation method in all three cases, so the algorithms should be identical. (actually, this example is cheating, since there are better methods as well!) Also, such problems often have "boundary conditions", meaning that the potential is set to some predetermined value at the edges or "boundaries" of the system. Here, we choose to set the boundaries to zero, except at \( x=1 \), where we set the potential to \( y(1-y) \). We do this because this particular problem has an analytical solution: \( V(x,y) = y(1-y)x^3 \). We can therefore be sure our codes work correctly by comparing with this solution.

### Python

I used the following python code (partially taken out of Mark Newman's computational physics book - a book which I can heartily recommend):```
import numpy as np
import matplotlib.pyplot as plt
import datetime
M = 100
a = 1/(M-1)
target = 1e-8
analytical_tol = 5.0e-3
# Create arrays to hold potential values
phi = np.zeros([M,M],float)
phiprime = np.zeros([M,M],float)
#Analytical solution to the problem
sol = np.zeros([M,M],float)
begin = datetime.datetime.now()
def realsol(x,y):
return y*(1-y)*x**3
for i in range(M):
for j in range(M):
sol[i,j] = realsol(a*i,a*j)
#Boundary conditions
for i in range(M):
phi[M-1,i] = i*a*(1-i*a)
phiprime[M-1,i] = i*a*(1-i*a)
#Source term
def ro(x,y):
return -(6*x*y*(1-y) - 2*x**3)
# Main loop
delta = 1.0
while delta>target:
# Calculate new values of the potential
a2 = a**2
for i in range(1,M-1):
for j in range(1,M-1):
phiprime[i,j] = (phi[i+1,j] + phi[i-1,j] \
+ phi[i,j+1] + phi[i,j-1])/4 \
+ a2*ro(i*a,j*a)/4
# Calculate maximum difference from old values
delta = np.max(abs(phi-phiprime))
# Swap the two arrays around
phi,phiprime = phiprime,phi
maxsol = np.max(abs(sol))
maxphi = np.max(abs(phi))
sol = sol/maxsol
phi = phi/maxphi
assert np.sum(np.abs(sol-phi))/(M*M) < analytical_tol
end = datetime.datetime.now()
dif = end-begin
print(dif.total_seconds())
```

Essentially, what is happening is that we first guess V (called phi in the code), then calculate the new one from our discrete equation, and iterate until the previous solution and the solution from this iteration are close enough to each other.

On my machine, this code takes almost 200 seconds to run. Not that big of a deal, but you can easily imagine adding more grid points, or a third dimension. In that case, scaling is sure to be bad.

### Cython

Cython is a Python-to-C compiler. Any valid Python code is also valid Cython. However, since it translates Python to C, you can get performance speedups, as C is a rather fast compiled language.

There are a few reasons why Python is slow, one of them being that Python is dynamically typed and interpreted, so it has to do a lot of work during the execution of the program to figure out what the user wants. Because of this, Cython offers the programmer the ability to declare C types to the program, so that it can better optimize the code. It also readily works with NumPy arrays. Here's the Cython code:

```
import numpy as np
cimport numpy as np
import matplotlib.pyplot as plt
cimport cython
from libc.math cimport pow
import time
import datetime
cdef double ro(double x,double y):
cdef double res
res = -(6.0*x*y*(1.0-y)-2.0*pow(x,3.0))
return res
cdef double boundary(double y):
return y*(1-y)
cdef double analytical_solution(double x, double y):
return y*(1-y)*x**3
@cython.cdivision(True)
@cython.boundscheck(False)
cdef void run_calculation():
cdef int M, counter
cdef double V,target,a, analytical_tol
M = 100 # Grid squares on a side
target = 1e-8 # Target accuracy
a = 1.0/(M-1)
analytical_tol = 5.0e-3
counter = 0
cdef np.ndarray[np.double_t,ndim=2] phi = \
np.zeros([M,M],dtype=np.double)
cdef np.ndarray[np.double_t,ndim=2] phiprime = \
np.zeros([M,M],np.double)
cdef np.ndarray[np.double_t,ndim=2] roa = \
np.zeros([M,M],np.double)
cdef np.ndarray[np.double_t,ndim=2] analytical_sol = \
np.zeros([M,M],np.double)
cdef int i,j
cdef double delta
cdef double a2
a2 = a**2
delta = 1.0
for i in range(M):
for j in range(M):
roa[i,j] = ro(i*a, j*a)
for i in range(M):
phiprime[M-1,i] = boundary(i*a)
phi[M-1,i] = boundary(i*a)
for i in range(M):
for j in range(M):
analytical_sol[i,j] = analytical_solution(i*a,j*a)
while delta>target:
counter += 1
for i in range(1,M-1):
for j in range(1,M-1):
phiprime[i,j] = (phi[i+1,j] + phi[i-1,j] \
+ phi[i,j+1] + phi[i,j-1])/4 \
+ a2*roa[i,j]/4.0
# Calculate maximum difference from old values
delta = np.max(abs(phi-phiprime))
# Swap the two arrays around
phi,phiprime = phiprime,phi
assert np.sum(np.abs(analytical_sol-phi))/(M*M) < analytical_tol
print(counter)
begin = datetime.datetime.now()
run_calculation()
end = datetime.datetime.now()
dif = end-begin
print(dif.total_seconds())
```

The code is otherwise the same, but I've wrapped everything in a function to better use cython's compiler directives. Cdivision turns off some safety checks relating to division, and boundscheck turns off checking bounds on arrays (you can now accidentally overflow the array: for example, if you have an array with 10 items, you could try to access element 11). Also, I've defined C types for our variables to help the Cython compiler. You can read more at cython.org.

The difference is remarkable. This code takes 1.02 seconds to run, which is an insane performance boost. This is possible because Cython is particularly good at optimizing loops with heavy numerical operations. When Python accesses a numpy array, it does all sorts of relatively complicated logic to figure out the types involved, checking array bounds, going through Python class methods, possible overloaded operators, etc. Array access is therefore fairly slow in Python, but is extremely fast in Cython, which just translates the whole thing to a C array.

### Fortran

Fortran is the grandfather of scientific programming, and was designed specifically with scientific computing in mind in the 1950s. Let's see how the old beast does in solving this problem.

```
module naiveimpl
implicit none
public
integer, parameter :: dp=kind(0.d0)
contains
pure real(dp) function rho(x,y)
real(dp), intent(in) :: x,y
rho = -(6.0_dp*x*y*(1.0_dp-y)-2.0_dp*x**3.0_dp)
end function
pure elemental real(dp) function boundary(y)
real(dp), intent(in) :: y
boundary = y*(1-y)
end function
pure real(dp) function analytical_sol(x,y)
real(dp), intent(in) :: x,y
analytical_sol = y*(1-y)*x**3
end function
integer function run_naive(M)
integer, intent(in):: M
integer, parameter :: dp=kind(0.d0)
real(dp) :: phiprime(0:M-1,0:M-1), phi(0:M-1,0:M-1), a2, analytical(0:M-1,0:M-1), temp(0:M-1,0:M-1),delta,a, avgerror
real(dp),parameter :: target=1E-8_dp, analytical_tol = 5.0e-3
integer :: i,j, iter
a = 1.0/(M-1)
delta = 1.0_dp
iter = 0
phi = 0
phiprime = 0
phiprime(M-1,:) = boundary([(i*a,i=0,M-1)])
phi(M-1,:) = boundary([(i*a,i=0,M-1)])
do i=0,M-1
do j=0, M-1
analytical(i,j) = analytical_sol(i*a,j*a)
end do
end do
do while (delta > target )
iter = iter + 1
a2 = a**2.0_dp
do i=1, M-2
do j=1, M-2
phiprime(i,j) = (phi(i+1,j) + phi(i-1,j) + phi(i,j+1) + phi(i,j-1))/4.0_dp &
+ a2*rho(i*a,j*a)/4.0_dp
end do
end do
delta = maxval(abs(phiprime - phi))
temp = phi
phi = phiprime
phiprime = temp
end do
phi = phi/maxval(abs(phi))
analytical = analytical/maxval(abs(analytical))
avgerror = sum(abs(phi-analytical))/(M*M)
print *, avgerror
if(avgerror > analytical_tol) then
print *, "Failed to agree with analytical solution", avgerror
end if
run_naive = iter
end function
end module naiveimpl
program poisson
use naiveimpl, only: run_naive
implicit none
integer, parameter :: dp=kind(0.d0), M=100
integer :: iter
real(dp) :: b,e
call cpu_time(b)
iter = run_naive(M)
call cpu_time(e)
print *, e-b, iter
end program
```

Even better, the code now takes around 0.34 seconds to run. Increasing the size of the grid to 200x200 gives timings of 6.25 seconds for Fortran and 32.77 seconds for Cython, so Fortran is considerably faster.

### C, optimized

To squeeze more performance out of this code, I think we can say Python and Cython are out as top contenders, but C is always a good candidate for speed. We can also optimize the code somewhat. For example, the charge density array should be precomputed instead of calling a function on every iteration through the grid. The swap can also be improved by removing the temp variable, which is not necessary in this particular case. With these things in mind, the optimized C code is:

```
#include
```
#include
#include
#include
#include
#define M 200
#define analytical_tol 5.0e-3
/*
*
* Code by ARC, @arunningcroc
*
*/
void swap(double (*a)[M], double (*b)[M], int n)
{
double swp;
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
a[i][j] = b[i][j];
}
}
}
double mmax(double (*phi)[M], double (*phip)[M], int n)
{
double max = 0.0;
double diff = 0.0;
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
diff = fabs(phi[i][j]-phip[i][j]);
if (diff > max)
max = diff;
}
}
return max;
}
void normalize(double (*phi)[M]) {
double max = -50000;
for(int i=0; i max) max = phi[i][j];
}
}
for(int i=0; i toler) {
iter += 1;
for (int i=1; i < M-1; i++) {
for (int j=1; j < M-1; j++) {
phip[i][j] = (phi[i+1][j] + phi[i-1][j] +
phi[i][j+1] + phi[i][j-1])/4.0 +
a2*rhoa[i][j]/4.0;
}
}
delta = mmax(phi, phip, M);
swap(phi, phip, M);
}
if(get_average_error(phi, analytical) > analytical_tol) {
printf("Failed to reach analytical solution, error %f\n", get_average_error(phi, analytical) );
}
return iter;
}
int main(int argc, char *argv[])
{
int iter;
clock_t start = clock();
iter = run(1e-8, 1.0/(M-1));
clock_t end = clock();
double total = ((double)(end - start)) / CLOCKS_PER_SEC;
printf("Execution time %f, iters: %d \n",total, iter);
}

Similar modifications can also be made to the Fortran program. This improves speed in both C and Fortran codes, but the improvement is bigger in the Fortran code.

Fortran can be improved even further by taking in to account modern features of the language. Damian Rouson provided a modernized Fortran version of a Jacobi solver, which I modified to work with this problem. See the end result here, if you're curious.

### Summary of results

Here are the best results for each language and style:

Language | M=100 (seconds) | M=300 (seconds) |
---|---|---|

Some notes:

- Utilizing modern features of Fortran improves performance significantly. This is probably because giving more information to the compiler with things like "do concurrent" allows the compiler to optimize more efficiently.
- The C code is considerably more verbose than either Fortran or Cython, because Fortran contains more intrinsic functions for dealing with arrays. The Fortran code can even be made more compact by array-slicing.

Obviously this is a small and "non-scientific" case study of a single problem, and I can't claim to be an expert on any of these languages when it comes to performance. I'm sure there are C gurus who could improve on my code, likewise for Cython and Fortran.

I would nevertheless suggest that Fortran, even in 2021, offers a good alternative. The code was easier to write than C, as Fortran handles memory allocations even for large arrays automatically, and also collects memory for allocated arrays (when they go out of scope). Fortran also contains many optimized intrinsic functions for mathematical operations on arrays as well as better array notation than C. In C, arrays are essentially pointers to memory; Fortran, by contrast, is practically built on arrays. So when your problem is dealing with numerical arrays, Fortran feels like a more natural choice than C. Theres now a community for those interested in Fortran, so check that out.

If you have existing Python code that needs speeding up, Cython is a great choice, as we can see here. Massive performance gains can sometimes be achieved with very little effort.