Working with BAM/CRAM/SAM-formatted files
Opening a file
To begin with, import the pysam module and open apysam.AlignmentFile:
importpysamsamfile=pysam.AlignmentFile("ex1.bam","rb")
The above command opens the fileex1.bam for reading.Theb qualifier indicates that this is aBAM file.To open aSAM file, type:
importpysamsamfile=pysam.AlignmentFile("ex1.sam","r")
CRAM files are identified by ac qualifier:
importpysamsamfile=pysam.AlignmentFile("ex1.cram","rc")
Fetching reads mapped to aregion
Reads are obtained through a call to thepysam.AlignmentFile.fetch() method which returns an iterator.Each call to the iterator will returns apysam.AlignedSegmentobject:
iter=samfile.fetch("seq1",10,20)forxiniter:print(str(x))
pysam.AlignmentFile.fetch() returns all reads overlapping aregion sorted by the first aligned base in thereferencesequence. Note that it will also return reads that are only partiallyoverlapping with theregion. Thus the reads returned mightspan a region that is larger than the one queried.
Using the pileup-engine
In contrast tofetching, thepileup engine returns foreach base in thereference sequence the reads that map to thatparticular position. In the typical view of reads stacking verticallyon top of the reference sequence similar to a multiple alignment,fetching iterates over the rows of this implied multiplealignment while apileup iterates over thecolumns.
Callingpileup() will return an iteratorover eachcolumn (reference base) of a specifiedregion. Each call to the iterator returns an object of thetypepysam.PileupColumn that provides access to all thereads aligned to that particular reference position as well assome additional information:
iter=samfile.pileup('seq1',10,20)forxiniter:print(str(x))
Creating BAM/CRAM/SAM files from scratch
The following example shows how a newBAM file is constructedfrom scratch. The important part here is that thepysam.AlignmentFile class needs to receive the sequenceidentifiers. These can be given either as a dictionary in a headerstructure, as lists of names and sizes, or from a template file.Here, we use a header dictionary:
header={'HD':{'VN':'1.0'},'SQ':[{'LN':1575,'SN':'chr1'},{'LN':1584,'SN':'chr2'}]}withpysam.AlignmentFile(tmpfilename,"wb",header=header)asoutf:a=pysam.AlignedSegment()a.query_name="read_28833_29006_6945"a.query_sequence="AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG"a.flag=99a.reference_id=0a.reference_start=32a.mapping_quality=20a.cigar=((0,10),(2,1),(0,25))a.next_reference_id=0a.next_reference_start=199a.template_length=167a.query_qualities=pysam.qualitystring_to_array("<<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<<")a.tags=(("NM",1),("RG","L1"))outf.write(a)
Using streams
Pysam does not support reading and writing from true python fileobjects, but it does support reading and writing from stdin andstdout. The following example reads from stdin and writes to stdout:
infile=pysam.AlignmentFile("-","r")outfile=pysam.AlignmentFile("-","w",template=infile)forsininfile:outfile.write(s)
It will also work withBAM files. The following scriptconverts aBAM formatted file on stdin to aSAMformatted file on stdout:
infile=pysam.AlignmentFile("-","rb")outfile=pysam.AlignmentFile("-","w",template=infile)forsininfile:outfile.write(s)
Note that the file open mode needs to changed fromr torb.
Using samtools and bcftools commands within Python
Commands available insamtools andbcftools are available as simplefunction calls, with command line options provided as arguments. Forexample:
importpysam.samtoolspysam.samtools.sort("-o","output.bam","ex1.bam",catch_stdout=False)importpysam.bcftoolspysam.bcftools.index("--csi","ex2.vcf.gz")
corresponds to the command lines:
samtoolssort-ooutput.bamex1.bambcftoolsindex--csiex2.vcf.gz
Samtools commands are also imported into the mainpysam namespace.For example:
pysam.sort("-m","1000000","-o","output.bam","ex1.bam",catch_stdout=False)
To make them valid Python identifiers, the functionscram_size()andfqimport() are spelt thus, differently from theircorresponding commands.
In order to get usage information, try:
print(pysam.sort.usage())
Argument errors raise apysam.SamtoolsError:
pysam.sort()Traceback(mostrecentcalllast):File"x.py",line12,in<module>pysam.sort()File"/build/lib.linux-x86_64-2.6/pysam/__init__.py",line37,in__call__ifretval:raiseSamtoolsError("\n".join(stderr))pysam.SamtoolsError:'Usage: samtools sort [-n] [-m <maxMem>] <in.bam> <out.prefix>\n'
Messages fromsamtools on stderr are captured and areavailable using theget_messages() method:
pysam.sort.get_messages()
By default, pysam captures the samtools command’s standard output and returns itas the function’s return value. To redirect stdout to a file instead, either usethesave_stdout keyword argument, or use"-o","filename" in the argumentsand also usecatch_stdout=False to prevent pysam’s capturing from overridingyour redirection. Finally,catch_stdout=False by itself discards standard output,which may help resolve problems in environments such as IPython notebooks:
# Return valuepileup_text=pysam.samtools.mpileup("in.bam")# Save to filepysam.samtools.mpileup("in.bam",save_stdout=pileup_filename)pysam.samtools.mpileup("-o",pileup_filename,"in.bam",catch_stdout=False)# Discard standard outputpysam.samtools.mpileup("in.bam",catch_stdout=False)# Returns None
For eachcommand available as asamtools subcommand,the following functions are provided:
- pysam.samtools.command(args,*,catch_stdout=True,save_stdout=None,split_lines=False)
- Parameters:
- Returns:
Standard output if it was caught, otherwise None.
Ifsave_stdout is not None, the command’s standard ouput is written to thefile specified and the function returns None.
Otherwise, ifcatch_stdout is true, the command’s standard output is capturedand used as the function’s return value — either as a single
stror aslist[str]according tosplit_lines. Ifcatch_stdout is false,the command’s standard output is discarded and the function returns None.The command’s standard error is always captured and made available via
get_messages().
- pysam.samtools.command.get_messages()
Returns the standard error from the most recent invocation of the particular
command, either as a singlestror aslist[str]according tosplit_lines as specified in that invocation.
For eachcommand available as abcftools subcommand, thepysam.bcftools.command(),pysam.bcftools.command.get_messages(),andpysam.bcftools.command.usage() functions operate similarly.
Working with tabix-indexed files
To open a tabular file that has been indexed withtabix, useTabixFile:
importpysamtbx=pysam.TabixFile("example.bed.gz")
Similar tofetch, intervals within aregion can be retrieved by callingfetch():
forrowintbx.fetch("chr1",1000,2000):print(str(row))
This will return a tuple-like data structure in which columns canbe retrieved by numeric index:
forrowintbx.fetch("chr1",1000,2000):print("chromosome is",row[0])
By providing a parser tofetchorTabixFile, the data will we presented in parsedform:
forrowintbx.fetch("chr1",1000,2000,parser=pysam.asTuple()):print("chromosome is",row.contig)print("first field (chrom)=",row[0])
Pre-built parsers are available forbed(asBed) formatted files andgtf(asGTF) formatted files. Thus, additional fieldsbecome available through named access, for example:
forrowintbx.fetch("chr1",1000,2000,parser=pysam.asBed()):print("name is",row.name)
Working with VCF/BCF formatted files
To iterate through a VCF/BCF formatted file useVariantFile:
frompysamimportVariantFilebcf_in=VariantFile("test.bcf")# auto-detect input formatbcf_out=VariantFile('-','w',header=bcf_in.header)forrecinbcf_in.fetch('chr1',100000,200000):bcf_out.write(rec)
_pysam.VariantFile.fetch() iterates overVariantRecord objects which provides access tosimple variant attributes such ascontig,pos,ref:
forrecinbcf_in.fetch():print(rec.pos)
but also to complex attributes such as the contents to theinfo,formatandgenotype columns. Thesecomplex attributes are views on the underlying htslib data structuresand provide dictionary-like access to the data:
forrecinbcf_in.fetch():print(rec.info)print(rec.info.keys())print(rec.info["DP"])
Theheader attribute(VariantHeader) provides access informationstored in thevcf header. The complete header can be printed:
>>>print(bcf_in.header)##fileformat=VCFv4.2##FILTER=<ID=PASS,Description="All filters passed">##fileDate=20090805##source=myImputationProgramV3.1##reference=1000GenomesPilot-NCBI36##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of SamplesWith Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">##contig=<ID=M>##contig=<ID=17>##contig=<ID=20>##bcftools_viewVersion=1.3+htslib-1.3##bcftools_viewCommand=view -O b -o example_vcf42.bcfexample_vcf42.vcf.gz#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000
Individual contents such as contigs, info fields, samples, formats canbe retrieved as attributes fromheader:
>>>print(bcf_in.header.contigs)<pysam.cbcf.VariantHeaderContigs object at 0xf250f8>
To convert these views to native python types, iterate through the views:
>>>print(list((bcf_in.header.contigs)))['M', '17', '20']>>>print(list((bcf_in.header.filters)))['PASS', 'q10', 's50']>>>print(list((bcf_in.header.info)))['NS', 'DP', 'AF', 'AA', 'DB', 'H2']>>>print(list((bcf_in.header.samples)))['NA00001', 'NA00002', 'NA00003']
Alternatively, it is possible to iterate through all records in theheader returning objects of typeVariantHeaderRecord::
>>>forxinbcf_in.header.records:>>>print(x)>>>print(x.type,x.key)GENERIC fileformatFILTER FILTERGENERIC fileDateGENERIC sourceGENERIC referenceGENERIC phasingINFO INFOINFO INFOINFO INFOINFO INFOINFO INFOINFO INFOFILTER FILTERFILTER FILTERFORMAT FORMATFORMAT FORMATFORMAT FORMATFORMAT FORMATCONTIG contigCONTIG contigCONTIG contigGENERIC bcftools_viewVersionGENERIC bcftools_viewCommand
Extending pysam
Usingpyximport, it is (relatively) straight-forward to access pysaminternals and the underlyingsamtools library. An example is providedin thetests directory. The example emulates the samtoolsflagstat command and consists of three files:
The main script
pysam_flagstat.py. The important lines inthis script are:importpyximportpyximport.install()import_pysam_flagstat...flag_counts=_pysam_flagstat.count(pysam_in)
The first part imports, sets uppyximport and imports the cythonmodule
_pysam_flagstat. The second part calls thecountmethod in_pysam_flagstat.The cython implementation
_pysam_flagstat.pyx. This scriptimports the pysam API via:frompysam.libcalignmentfilecimportAlignmentFile,AlignedSegment
This statement imports, amongst others,
AlignedSegmentinto the namespace. Speed can be gained from declaringvariables. For example, to efficiently iterate over a file, anAlignedSegmentobject is declared:# loop over samfilecdefAlignedSegmentreadforreadinsamfile:...
A
pyxbldprovidingpyximport with build information.Required are the locations of the samtools and pysam headerlibraries of a source installation of pysam plus thecsamtools.soshared library. For example:defmake_ext(modname,pyxfilename):fromdistutils.extensionimportExtensionimportpysamreturnExtension(name=modname,sources=[pyxfilename],extra_link_args=pysam.get_libraries(),include_dirs=pysam.get_include(),define_macros=pysam.get_defines())
If the scriptpysam_flagstat.py is called the first time,pyximport will compile thecython extension_pysam_flagstat.pyx and make it available to thescript. Compilation requires a working compiler andcythoninstallation. Each time_pysam_flagstat.pyx is modified, anew compilation will take place.