Bioinformatics with Python Cookbook
上QQ阅读APP看书,第一时间看更新

There's more...

The purpose of this recipe is to get you up to speed with the PyVCF module. At this stage, you should be comfortable with the API. We will not spend too much time on usage details because this will be the main purpose of the next recipe: using the VCF module to study the quality of your variant calls.

It will probably not be a shocking revelation that PyVCF is not the fastest module on earth. The file format (highly text-based) makes processing a time-consuming task. There are two main strategies for dealing with this problem. One strategy is parallel processing, which we will discuss in the last chapter, Chapter 9, Python for Big Genomics Datasets. The second strategy is to convert to a more efficient format; we will provide an example of this in Chapter 4, Population Genetics. Note that VCF developers are working on a binary (BCF) version to deal with parts of these problems (http://www.1000genomes.org/wiki/analysis/variant-call-format/bcf-binary-vcf-version-2).