Apr 19 2008
A quick Cython introduction
I love python for its beautiful code and practicality. But it’s not going to win a pure speed race with most languages. Most people think of speed and ease-of-use as polar opposites - probably because they remember the pain of writing C. Cython tries to eliminate that duality and lets you have python syntax with C data types and functions - the best of both worlds. Keeping in mind that I’m by no means an expert at this, here are my notes based on my first real experiment with Cython:
EDIT: Based on some feedback I’ve received there seems to be some confusion - Cython is for generating C extensions to Python not standalone programs. The whole point is to speed up an existing python app one function at a time. No rewriting the whole application in C or Lisp. No writing C extensions by hand. Just an easy way to get C speed and C data types into your slow python functions.
So lets say we want to make this function faster. It is the “great circle” calculation, a quick spherical trig problem to calculate distance along the earth’s surface between two points:
p1.py
import math
def great_circle(lon1,lat1,lon2,lat2):
radius = 3956 #miles
x = math.pi/180.0
a = (90.0-lat1)*(x)
b = (90.0-lat2)*(x)
theta = (lon2-lon1)*(x)
c = math.acos((math.cos(a)*math.cos(b)) +
(math.sin(a)*math.sin(b)*math.cos(theta)))
return radius*c
Lets try it out and time it over 1/2 million function calls:
import timeit
lon1, lat1, lon2, lat2 = -72.345, 34.323, -61.823, 54.826
num = 500000
t = timeit.Timer("p1.great_circle(%f,%f,%f,%f)" % (lon1,lat1,lon2,lat2),
"import p1")
print "Pure python function", t.timeit(num), "sec"
About 2.2 seconds. Too slow!
Lets try a quick rewrite in Cython and see if that makes a difference:
c1.pyx
import math
def great_circle(float lon1,float lat1,float lon2,float lat2):
cdef float radius = 3956.0
cdef float pi = 3.14159265
cdef float x = pi/180.0
cdef float a,b,theta,c
a = (90.0-lat1)*(x)
b = (90.0-lat2)*(x)
theta = (lon2-lon1)*(x)
c = math.acos((math.cos(a)*math.cos(b)) + (math.sin(a)*math.sin(b)*math.cos(theta)))
return radius*c
Notice that we still import math - cython lets you mix and match python and C data types to some extent. The conversion is handled automatically though not without cost. In this example all we’ve done is define a python function, declare its input parameters to be floats, and declare a static C float data type for all the variables. It still uses the python math module to do the calcs.
Now we need to convert this to C code and compile the python extension. The best way to do this is through a setup.py distutils script. But we’ll do it the manual way for now to see what’s happening:
# this will create a c1.c file - the C source code to build a python extension cython c1.pyx # Compile the object file gcc -c -fPIC -I/usr/include/python2.5/ c1.c # Link it into a shared library gcc -shared c1.o -o c1.so
Now you should have a c1.so (or .dll) file which can be imported in python. Lets give it a run:
t = timeit.Timer("c1.great_circle(%f,%f,%f,%f)" % (lon1,lat1,lon2,lat2),
"import c1")
print "Cython function (still using python math)", t.timeit(num), "sec"
About 1.8 seconds. Not the kind of speedup we were hoping for but its a start. The bottleneck must be in the usage of the python math module. Lets use the C standard library trig functions instead:
c2.pyx
cdef extern from "math.h":
float cosf(float theta)
float sinf(float theta)
float acosf(float theta)
def great_circle(float lon1,float lat1,float lon2,float lat2):
cdef float radius = 3956.0
cdef float pi = 3.14159265
cdef float x = pi/180.0
cdef float a,b,theta,c
a = (90.0-lat1)*(x)
b = (90.0-lat2)*(x)
theta = (lon2-lon1)*(x)
c = acosf((cosf(a)*cosf(b)) + (sinf(a)*sinf(b)*cosf(theta)))
return radius*c
Instead of importing the math module, we use cdef extern which uses the C function declarations from the specified include header (in this case math.h from the C standard library). We’ve replaced the calls to some of the expensive python functions and are ready to build the new shared library and re-test:
t = timeit.Timer("c2.great_circle(%f,%f,%f,%f)" % (lon1,lat1,lon2,lat2),
"import c2")
print "Cython function (using trig function from math.h)", t.timeit(num), "sec"
Now that’s a bit more like it. 0.4 seconds - a 5x speed increase over the pure python function. What else can we do to speed things up? Well c2.great_circle() is still a python function which means that calling it incurs the overhead of the python API, constructing the argument tuple, etc. If we could write it as a pure C function, we might be able to speed things up a bit.
c3.pyx
cdef extern from "math.h":
float cosf(float theta)
float sinf(float theta)
float acosf(float theta)
cdef float _great_circle(float lon1,float lat1,float lon2,float lat2):
cdef float radius = 3956.0
cdef float pi = 3.14159265
cdef float x = pi/180.0
cdef float a,b,theta,c
a = (90.0-lat1)*(x)
b = (90.0-lat2)*(x)
theta = (lon2-lon1)*(x)
c = acosf((cosf(a)*cosf(b)) + (sinf(a)*sinf(b)*cosf(theta)))
return radius*c
def great_circle(float lon1,float lat1,float lon2,float lat2,int num):
cdef int i
cdef float x
for i from 0 < = i < num:
x = _great_circle(lon1,lat1,lon2,lat2)
return x
Notice that we still have a python function wrapper (def) which takes an extra argument, num. The looping is done inside this function with for i from 0 < = i < num: instead of the more pythonic but slower for i in range(num):. The actual work is done in a C function (cdef) which returns float type. This runs in 0.2 seconds - a 10x speed boost over the original python function.
Just to confirm that we’re doing things optimally, lets write a little app in pure C and time it:
#include <math .h>
#include <stdio .h>
#define NUM 500000
float great_circle(float lon1, float lat1, float lon2, float lat2){
float radius = 3956.0;
float pi = 3.14159265;
float x = pi/180.0;
float a,b,theta,c;
a = (90.0-lat1)*(x);
b = (90.0-lat2)*(x);
theta = (lon2-lon1)*(x);
c = acos((cos(a)*cos(b)) + (sin(a)*sin(b)*cos(theta)));
return radius*c;
}
int main() {
int i;
float x;
for (i=0; i < = NUM; i++)
x = great_circle(-72.345, 34.323, -61.823, 54.826);
printf("%f", x);
}
Now compile it with gcc -lm -o ctest ctest.c and test it with time ./ctest… about 0.2 seconds as well. This gives me confidence that my Cython extension is at least as efficient as my C code (which probably isn’t saying much as my C skills are weak).
Some cases will be more or less optimal for cython depending on how much looping, number-crunching and python-function-calling are slowing you down. In some cases people have reported 100 to 1000x speed boosts. For other tasks it might not be so helpful. Before going crazy rewriting your python code in Cython, keep this in mind:
“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.” — Donald Knuth
In other words, write your program in python first and see if it works alright. Most of the time it will… some times it will bog down. Use a profiler to find the slow functions and re-implement them in cython and you should see a quick return on investment.
Links:
WorldMill - a python module by Sean Gillies which uses Cython to provide a fast, clean python interface to the libgdal library for handling vector geospatial data.
Writing Fast Pyrex code (Pyrex is the predecessor of Cython with similar goals and syntax)
If you’re compiling with the commands you’ve shown, you’re compiling the C code without optimization enabled, which means that the code runs much slower than it ought to. Try adding -O3 to the gcc command lines.
Both instance of the word “it’s” in this post should be “its”. You didn’t mean “it is”. You meant the possessive pronoun.
Hi
Good examples and measurements. Have you tried psyco (http://psyco.sourceforge.net/)? It is usually significantly faster for repetetive calculations than regular Python, and only requries an import to work (less work than rewriting Python to Cython.)
Thank you Grammar Police. In an effort to learn Cython, I seem to have forgotten English.
Henrik, Good point. Psyco would be a decent quick fix. For this example, simply doing an import psyco; psyco.full() increased the speed of the pure python function by 2x to 1.1 seconds. Still not the kind of gain you see with C but it’s better than nothing. And it looks like the quick psyco fix would be better than a poorly implemented Cython solution.
You forgot something pretty crucial, the gcc optimization flags; instead of compiling the library with `gcc -c -fPIC -I/usr/include/python2.2/ mymodule.c`, compile it with `gcc -O2 -c -fPIC -I/usr/include/python2.2/ mymodule.c`, and instead of `gcc -lm -o ctest ctest.c` use `gcc -O2 -lm -o ctest ctest.c`. It’s quite a noticeable difference on my machine, a speedup of around 6x.
Without psyco:
bulla@badoo:~/code/python> python p1.py
Pure python function Pure python function 3.09896183014 sec
2.43880605698 sec
With psyco:
bulla@badoo:~/code/python> python p1.py
Pure python function Pure python function 0.152606964111 sec
1.23215317726 sec
Yeah but, if you ask me, by the time you get to your last example you might as well just write c and be done with it. right?
Jay,
That would be the case if this is a standalone program. But the whole point was to speed up an existing python codebase one function at a time by writing C extensions. This is one a small part of the overall program and it would be non-trivial to rewrite the entire thing in C from scratch. Which means you need to make a C extension to python - and Cython seems the best way to do that.
More on psyco:
normal version:
Pure python function 3.91737318039 sec
with psyco:
Pure python function 1.15499091148 sec
with psyco and changing all math.xxx to xxx
(from math import acos, cos, sin, pi)
Pure python function 0.049054145813 sec
not bad, a 100x improvement without writing any code
Thanks a lot for this posting. I’m working on a program involving real-time knot simulation and had just re-implemented part of it in Cython. That produced great results (brought a calculation down from 5 to 0.2), but I was still using Python’s sqrt function. Taking your advice, I moved over to C’s sqrtf, which brought it down to 0.04!
I have to say I’m glad for Cython’s existance. I had rewrite a lot of things (much less elegant.. grr) to make it efficient, but it still beats writing in pure C. Knuth’s quote is a good reminder, though. I may go back and make a couple of things clearer again now that I have a bit more of a margin to work with.
BTW, what editor are you using for your Python/Cython coding?
Also, I noticed your post on parallel programming a while back. Another thing you may want to check out is Clojure, a relatively new language built on the JVM. Two of it’s strong points is concurrency and interoperability with Java libraries. The creator gave a talk here recently and I was pretty impressed, though I haven’t had a chance to fuss with it myself yet.
@Shawn
I use vim for almost everything these days. Thanks for the heads up on Clojure; i’ll have to check that out.
@psyco fans
but psyco only works on x86, not even x86 64bit.
Great writeup would you be willing to have it included in the Cython docs? We can add you as an author
Gabriel
[…] Using Python to calculate great_circle circumference, then optimize the same function with Cython […]
[…] Using Python to calculate great_circle circumference, then optimize the same function with Cython […]
Hi , perrygeo. How about your learning of cython? I wanna ask u a question..
I noticed that the type of value that your c function returns to your .pyx def function is basic type(like int,double,etc.),and it works well.
However,now my c function returns a pointer p with the type (double *),which points to a list,and the p’s memory is allocated by malloc function. If I want to use the data stored in list p in my .pyx def function ,what should i do? Now I’ve tried to add the data in list p into a python list,but it doesn’t work and give the following error information:can’t convert to python object.. Anything can i do?
@ssam
Actually, Psyco has been working on intel/mac since 2007 and works just fine on x86_64 as long as the python install is 32 bit.
(Always google before you post.)