gotoh2
A lightweight Python/C module for pairwise alignment of genetic sequences.
Motivation
Why did I bother writing yet another pairwise alignment program?
- This is, in part, a hobby-coding project that I started to keep my C coding skills up (since I do everything in Python and R these days).
- There was a major issue where I used to work: the lab relied heavily on an old in-house C program for pairwise alignment that was integrated into most of the bioinformatic pipelines. The code was difficult to read, poorly documented, and occasionally yielded inconsistent results (enough to become a problem).
- Over the years I’ve tried other pairwise alignment modules in Python and none1 were as fast, customizable and easy to use as calling an alignment function from the HyPhy shared library. However, calling HyPhy – an extensive software package for phylogenetic sequence analysis – is kind of overkill.
- Even though there are plenty of powerful multiple sequence alignment programs available, pairwise alignment can still be a useful tool – particularly when processing a very large number of fairly long sequences.
1 Other alternatives have come up since I started working on this in my spare time, so this is probably no longer true.
Objectives
- It should be maintainable. I’ve tried to write accessible C code.
- It should be correct. I’ve been implementing a bunch of unit tests to validate the implementation of Altschul and Eriksson’s modification of the Gotoh algorithm.
- It should be customizable. I’ve incorporated a basic interface at the Python level for importing different residue scoring matrices and for modifying gap penalties.
- It should be fast. The C should help here. Ideally, we want to be faster or similar to calling out from Python to another program.
Usage example
Aligning two sequences under default settings:
>>> from gotoh2 import Aligner
>>> g2 = Aligner()
>>> g2.align('ACGT', 'ACT')
('ACGT', 'AC-T', 4)
Benchmarks
Here I report some results for this module and other implementations of pairwise alignment algorithms as I could find for Python. I also did some subprocess
calls to the alignment program MUSCLE
for comparison. For each program, I performed pairwise alignment of 10 HIV-1 integrase sequences against the standard HXB2 reference sequence.
In summary, gotoh2
is a lot faster than the pairwise sequence alignment algorithm implemented in Biopython, and also seems to be substantially faster than other implementations in Python. It even compares somewhat favorably against calling out to MUSCLE
via subprocess.
Requirements
- Python - not sure about version support yet; this was developed in Python 2.7.
- ~NumPy - my next objective is to eliminate this requirement~
- A build environment including gcc, at least until I figure out how to distribute binaries