Bioinformatics with Python Cookbook
上QQ阅读APP看书,第一时间看更新

Getting ready

Organism genomes come in widely different sizes, ranging from viruses such as HIV, which is 9.7 kbp, to bacteria such as E. coli, to protozoans like Plasmodium falciparum, with a 22 Mbp spread across 14 chromosomes, mitochondrion, and apicoplast, to the fruit fly with three autosomes, a mitochondrion, and X/Y sex chromosomes, to humans with their three Gbp pairs spread across 22 autosomes, X/Y chromosomes, and mitochondria, all the way up to Paris japonica, a plant with 150 Gbp of genome. Along the way, you have different ploidy and different sex chromosome organizations.

As you can see, different organisms have very different genome sizes. This difference can be of several orders of magnitude. This can have significant implications for your programming style. Working with a large genome will require you to be more conservative with the usage of memory. Unfortunately, larger genomes would benefit from more speed-efficient programming techniques (as you have much more data to analyze); these are conflicting requirements. The general rule is that you have to be much more careful with efficiency (both speed and memory) with larger genomes.

To make this recipe less of a burden, we will use a small eukaryotic genome from P. falciparum. This genome still has many typical features of larger genomes (for example, multiple chromosomes). Therefore, it's a good compromise between complexity and size. Note that with a genome of the size of P. falciparum, it will be possible to perform many operations by loading the whole genome in-memory. However, we opted for a programming style that can be used with bigger genomes (for example, mammals) so that you can use this recipe in a more general way, but feel free to use more memory-intensive approaches with small genomes like this.

We will use Biopython, which you installed in Chapter 1, Python and the Surrounding Software Ecology. As usual, this recipe is available for the Jupyter Notebook at Chapter03/Reference_Genome.ipynb in the code bundle of this book.