Sep 18 2007
Parallel python and GIS
Let’s face it - processing speeds aren’t going to be increasing according to Moore’s Law anymore; Instead of faster CPUs, we’ll be getting more of them. The future of programming, it seems to me, lies in the ability to leverage multiple processors. In other words, we have to write parallel code. Until I read Seans’ post, I was unware that there was a viable python solution. I had been growing quite dissillusioned by python’s dreaded Global Interpreter Lock which confines python to a single processing core. I’ve even started learning Erlang to leverage SMP processing (until I realized that Erlang and it’s standard libraries are virtually useless for anything that needs to handle geospatial data).
So I gave Parallel Python (pp) a shot. Since Sean also offered up a bounty for the first GIS application that used pp, I thought it might be a good time to try
A good candidate for parallel processing is any application that has to crunch away on lists/arrays of data and whose individual members be handled independently (see pmap in Erlang). I have been working on an application to smooth linework using bezier curves. It’s not quite polished yet but the image below shows the before and after

… but bezier curves aren’t quite the subject of this post. Let’s just say the algorithm takes some time to compute (if you’re using a high density of verticies) and can be handled one LineString feature at a time. This makes it a prime candidate for parallelization.
Given a list of input LineStrings, I could process them the sequential way:
smooth_lines = [] for line in lines: smooth_lines.append( calcBezierFromLine( line, num_bezpts, beztype, t) )
Or use pp to start up a “job server” which doles the tasks out to as many “workers”. A busy worker utilizes a single processing core so a good rule of thumb would be to start up as many workers as you have CPU cores:
numworkers = 2 # dual-core machine job_server = pp.Server(numworkers, ppservers=ppservers) smooth_lines = [] jobs = [(line, job_server.submit(calcBezierFromLine, (line, num_bezpts, beztype, t), \ (computeBezier, getPointOnCubicBezier), (”numpy”,) )) for line in lines] for input, job in jobs: smooth_lines.append( job() )
Theoretically the parellized version should run twice as fast as the sequential version on my core2 duo machine. And reality was pretty darn close to that:
$ time python bezier_smooth_pp.py 2 Shapefile contains 1114 lines Starting pp with 2 workers Completed 1114 new lines with 8 additional verticies for each line segment along a cubic bezier curve real 0m10.908s … $ time python bezier_smooth_pp.py 1 Shapefile contains 1114 lines Starting pp with 1 workers Completed 1114 new lines with 8 additional verticies for each line segment along a cubic bezier curve real 0m20.007s …
Just think of the possibilities. In the forseeable future, the average computer might have 8+ cores to work with. This could mean that your app will move 8x faster if you parallize the code (assuming there are no IO or bandwidth bottlenecks). I’d love to test it out on a system with more than 2 processing cores but, unfortunately, I don’t have access to any beowulf clusters, Sun UltraSparc servers, or 8-core Xeon Mac Pros. This is what I really need to complete my research
So if anyone want to donate to the cause, send me an email!
And to answer Sean’s bounty, I don’t consider this an actual application (yet) but I hope it can spur some interest and move things in that direction. But if you feel the need to send me some New Belgium swag (or one of the machines listed above), feel free
Have you thought of trying Amazon Web Services EC2?
If you are using multiple machines, considering using MPI or PVM to distribute the processing. Yeah, there are even Python interfaces (e.g. http://sourceforge.net/projects/pympi/). Trying out Amazon’s EC2 Web Services for this is certainly an interesting option.
There are *many* possibilities for taking advantage of multi-core/multi-processor computing as long as you keep an open mind and don’t restrict yourself to using only Python.
Well, C/C++, C#/.NET, python, perl, ruby and java are really the only languages that have the libraries I need in the geospatial realm. And they’re all adequate at best with regards to concurrency. Personally, I think that threads and shared memory are not the best way to go so I’m looking at a message-passing paradigm. Unfortunately Erlang and other functional languages that excel at this type of concurrency have very little ability to process spatial data sources.
MPI looks good as well though it is not quite as easy to set up as I would hope. PP really shines at ease of installation which is a must if you’re going to distribute the apps to less technically-oriented users.
EC2 looks like a neat option to try this out. If I get some free time, I’ll have to try it out!