
How to do it...
Take a look at the following steps:
- Let's start by inspecting the information that we can get per record:
import vcf
v = vcf.Reader(filename='genotypes.vcf.gz')
print('Variant Level information')
infos = v.infos
for info in infos:
print(info)
print('Sample Level information')
fmts = v.formats
for fmt in fmts:
print(fmt)
We start by inspecting the annotations that are available for each record (remember that each record encodes a variant, such as SNP, CNV, INDELs, and so on, and the state of that variant per sample). At the variant (record) level, we find AC: the total number of ALT alleles in called genotypes, AF: the estimated allele frequency, NS: the number of samples with data, AN: the total number of alleles in called genotypes, and DP: the total read depth, among others. There are others, but they are mostly specific to the 1,000 Genomes Project (here, we will try to be as general as possible). Your own dataset may have more annotations (or none of these).
At the sample level, there are only two annotations in this file: GT—genotype, and DP—the per sample read depth. You have the per variant (total) read depth and the per sample read depth; be sure not to confuse both.
- Now that we know which information is available, let's inspect a single VCF record:
v = vcf.Reader(filename='genotypes.vcf.gz')
rec = next(v)
print(rec.CHROM, rec.POS, rec.ID, rec.REF, rec.ALT, rec.QUAL, rec.FILTER)
print(rec.INFO)
print(rec.FORMAT)
samples = rec.samples
print(len(samples))
sample = samples[0]
print(sample.called, sample.gt_alleles, sample.is_het, sample.phased)
print(int(sample['DP']))
We will start by retrieving the standard information: the chromosome, position, identifier, reference base, (typically just one), alternative bases (you can have more than one, but it's not uncommon as a first filtering approach to only accept a single ALT, for example, only accept biallelic SNPs), quality (as you might expect, Phred-scaled), and filter status. Regarding the filter status, remember that whatever the VCF file says, you may still want to apply extra filters (as in the next recipe).
We then print the additional variant-level information (AC, AS, AF, AN, DP, and so on), followed by the sample format (in this case, DP and GT). Finally, we count the number of samples and inspect a single sample to check whether it was called for this variant. Also, the reported alleles, heterozygosity, and phasing status (this dataset happens to be phased, which is not that common) are included.
- Let's check the type of variant and the number of nonbiallelic SNPs in a single pass:
from collections import defaultdict
f = vcf.Reader(filename='genotypes.vcf.gz')
my_type = defaultdict(int)
num_alts = defaultdict(int)
for rec in f:
my_type[rec.var_type, rec.var_subtype] += 1
if rec.is_snp:
num_alts[len(rec.ALT)] += 1
print(num_alts)
print(my_type)
We will now use the now common Python default dictionary. We find that this dataset has INDELs (both insertions and deletions), CNVs, and, of course, SNPs (roughly two-thirds being transitions with one-third transversions). There is a residual number (79) of tri-allelic SNPs.