Introduction

Pysam is a python module that makes it easy to read and manipulatemapped short read sequence data stored in SAM/BAM files. It is alightweight wrapper of thehtslib C-API.

This page provides a quick introduction in using pysam followed by theAPI. SeeWorking with BAM/CRAM/SAM-formatted files for more detailed usage instructions.

To use the module to read a file in BAM format, create aAlignmentFile object:

importpysamsamfile=pysam.AlignmentFile("ex1.bam","rb")

Once a file is opened you can iterate over all of the reads mapping toa specified region usingfetch(). Eachiteration returns aAlignedSegment object whichrepresents a single read along with its fields and optional tags:

forreadinsamfile.fetch('chr1',100,120):print(read)samfile.close()

To give:

EAS56_57:6:190:289:82099<<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<;69CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA01921EAS56_57:6:190:289:82099<<<<<<;<<<<<<<<<<;<<;<<<<;8<6;9;;2;137AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC73641EAS51_64:3:190:727:3080102<<<<<<<<<<<<<<<<<<<<<<<<<<<::<<<84499GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG99181...

You can also write to aAlignmentFile:

importpysamsamfile=pysam.AlignmentFile("ex1.bam","rb")pairedreads=pysam.AlignmentFile("allpaired.bam","wb",template=samfile)forreadinsamfile.fetch():ifread.is_paired:pairedreads.write(read)pairedreads.close()samfile.close()

An alternative way of accessing the data in a SAM file is by iteratingover each base of a specified region using thepileup() method. Each iteration returns aPileupColumn which represents all the reads in the SAMfile that map to a single base in the reference sequence. The list ofreads are represented asPileupRead objects in thePileupColumn.pileups property:

importpysamsamfile=pysam.AlignmentFile("ex1.bam","rb")forpileupcolumninsamfile.pileup("chr1",100,120):print("\ncoverage at base%s =%s"%(pileupcolumn.pos,pileupcolumn.n))forpileupreadinpileupcolumn.pileups:ifnotpileupread.is_delandnotpileupread.is_refskip:# query position is None if is_del or is_refskip is set.print('\tbase in read%s =%s'%(pileupread.alignment.query_name,pileupread.alignment.query_sequence[pileupread.query_position]))samfile.close()

The above code outputs:

coverageatbase99=1baseinreadEAS56_57:6:190:289:82=Acoverageatbase100=1baseinreadEAS56_57:6:190:289:82=Gcoverageatbase101=1baseinreadEAS56_57:6:190:289:82=Gcoverageatbase102=2baseinreadEAS56_57:6:190:289:82=GbaseinreadEAS51_64:3:190:727:308=G...

Commands available insamtools are available as simplefunction calls. For example:

pysam.sort("-o","output.bam","ex1.bam")

corresponds to the command line:

samtoolssort-ooutput.bamex1.bam

Analogous toAlignmentFile, aTabixFile allows fast random access to compressed andtabix indexed tab-separated file formats with genomic data:

importpysamtabixfile=pysam.TabixFile("example.gtf.gz")forgtfintabixfile.fetch("chr1",1000,2000):print(gtf.contig,gtf.start,gtf.end,gtf.gene_id)

TabixFile implements lazy parsing in order to iterateover large tables efficiently.

More detailed usage instructions are available atWorking with BAM/CRAM/SAM-formatted files.

Note

Coordinates in pysam are always 0-based (following the pythonconvention). SAM text files use 1-based coordinates.

Note

The above examples can be run in thetests directory of the

installation directory. Type ‘make’ before running them.

See also

https://github.com/pysam-developers/pysam

The pysam code repository, containing source code and downloadinstructions

http://pysam.readthedocs.org/en/latest/

The pysam website containing documentation

API

SAM/BAM/CRAM files

Objects of typeAlignmentFile allow working withBAM/SAM formatted files.

classpysam.AlignmentFile

AlignmentFile(filepath_or_object, mode=None, template=None,reference_names=None, reference_lengths=None, text=NULL,header=None, add_sq_text=False, check_header=True, check_sq=True,reference_filename=None, filename=None, index_filename=None,filepath_index=None, require_index=False, duplicate_filehandle=True,ignore_truncation=False, threads=1)

ASAM/BAM/CRAM formatted file.

Iffilepath_or_object is a string, the file is automaticallyopened. Iffilepath_or_object is a python File object, thealready opened file will be used.

If the file is opened for reading and an index exists (if file is BAM, a.bai file or if CRAM a .crai file), it will be opened automatically.index_filename may be specified explicitly. If the index is not namedin the standard manner, not located in the same directory as theBAM/CRAM file, or is remote. Without an index, random access viafetch() andpileup()is disabled.

For writing, the header of aSAM file/BAM file canbe constituted from several sources (see also the samtools formatspecification):

  1. Iftemplate is given, the header is copied from anotherAlignmentFile (template must be aAlignmentFile).

  2. Ifheader is given, the header is built from amulti-level dictionary.

  3. Iftext is given, new header text is copied from rawtext.

  4. The names (reference_names) and lengths(reference_lengths) are supplied directly as lists.

When reading or writing a CRAM file, the filename of a FASTA-formattedreference can be specified withreference_filename.

By default, if a file is opened in mode ‘r’, it is checkedfor a valid header (check_header = True) and a definition ofchromosome names (check_sq = True).

Parameters:
  • mode (string) –

    mode should ber for reading orw for writing. Thedefault is text mode (SAM). For binary (BAM)I/O you should appendb for compressed oru foruncompressedBAM output. Useh to output headerinformation in text (TAM) mode. Usec forCRAM formatted files.

    Ifb is present, it must immediately followr orw. Valid modes arer,w,wh,rb,wb,wbu,wb0,rc andwc. For instance, to open aBAM formatted file for reading, type:

    f=pysam.AlignmentFile('ex1.bam','rb')

    If mode is not specified, the method will try to auto-detectin the order ‘rb’, ‘r’, thus both the following should work:

    f1=pysam.AlignmentFile('ex1.bam')f2=pysam.AlignmentFile('ex1.sam')

  • template (AlignmentFile) – when writing, copy header from filetemplate.

  • header (dict orAlignmentHeader) – when writing, build header from a multi-level dictionary. Thefirst level are the four types (‘HD’, ‘SQ’, …). The secondlevel are a list of lines, with each line being a list oftag-value pairs. The header is constructed first from all thedefined fields, followed by user tags in alphabeticalorder. Alternatively, anAlignmentHeaderobject can be passed directly.

  • text (string) – when writing, use the string provided as the header

  • reference_names (list) – see reference_lengths

  • reference_lengths (list) – when writing or opening a SAM file without header build headerfrom list of chromosome names and lengths. By default, ‘SQ’and ‘LN’ tags will be added to the header text. This optioncan be changed by unsetting the flagadd_sq_text.

  • add_sq_text (bool) – do not add ‘SQ’ and ‘LN’ tags to header. This option permitsconstructionSAM formatted files without a header.

  • add_sam_header (bool) – when outputting SAM the default is to output a header. This isequivalent to opening the file in ‘wh’ mode. If this option isset to False, no header will be output. To read such a file,setcheck_header=False.

  • check_header (bool) – obsolete: when reading a SAM file, check if header is present(default=True)

  • check_sq (bool) – when reading, check if SQ entries are present in header(default=True)

  • reference_filename (string) – Path to a FASTA-formatted reference file. Valid only for CRAM files.When reading a CRAM file, this overrides both$REF_PATH and the URLspecified in the header (UR tag), which are normally used to findthe reference.

  • index_filename (string) – Explicit path to the index file. Only needed if the index is notnamed in the standard manner, not located in the same directory asthe BAM/CRAM file, or is remote. An IOError is raised if the indexcannot be found or is invalid.

  • filepath_index (string) – Alias forindex_filename.

  • require_index (bool) – When reading, require that an index file is present and is valid orraise an IOError. (default=False)

  • filename (string) – Alternative to filepath_or_object. Filename of the fileto be opened.

  • duplicate_filehandle (bool) – By default, file handles passed either directly or throughFile-like objects will be duplicated before passing them tohtslib. The duplication prevents issues where the same streamwill be closed by htslib and through destruction of thehigh-level python object. Set to False to turn offduplication.

  • ignore_truncation (bool) – Issue a warning, instead of raising an error if the current fileappears to be truncated due to a missing EOF marker. Only appliesto bgzipped formats. (Default=False)

  • format_options (list) – A list of key=value strings, as accepted by –input-fmt-option and–output-fmt-option in samtools.

  • threads (integer) – Number of threads to use for compressing/decompressing BAM/CRAM files.Setting threads to > 1 cannot be combined withignore_truncation.(Default=1)

check_index(self)

return True if index is present.

Raises:
close(self)

closes thepysam.AlignmentFile.

count(self,contig=None,start=None,stop=None,region=None,until_eof=False,read_callback='nofilter',reference=None,end=None)

count the number of reads inregion

The region is specified bycontig,start andstop.reference andend are also accepted for backwardcompatibility as synonyms forcontig andstop,respectively. Alternatively, asamtoolsregionstring can be supplied.

ASAM file does not allow random access and ifregion orcontig are given, an exception is raised.

Parameters:
  • contig (string) – reference_name of the genomic region (chromosome)

  • start (int) – start of the genomic region (0-based inclusive)

  • stop (int) – end of the genomic region (0-based exclusive)

  • region (string) – a region string in samtools format.

  • until_eof (bool) – count until the end of the file, possibly includingunmapped reads as well.

  • read_callback (string orfunction) –

    select a call-back to ignore reads when counting. It canbe either a string with the following values:

    all

    skip reads in which any of the followingflags are set: BAM_FUNMAP, BAM_FSECONDARY, BAM_FQCFAIL,BAM_FDUP

    nofilter

    uses every single read

    Alternatively,read_callback can be a functioncheck_read(read) that should return True only forthose reads that shall be included in the counting.

  • reference (string) – backward compatible synonym forcontig

  • end (int) – backward compatible synonym forstop

Raises:

ValueError – if the genomic coordinates are out of range or invalid.

count_coverage(self,contig,start=None,stop=None,region=None,quality_threshold=15,read_callback='all',reference=None,end=None)

count the coverage of genomic positions by reads inregion.

The region is specified bycontig,start andstop.reference andend are also accepted for backwardcompatibility as synonyms forcontig andstop,respectively. Alternatively, asamtoolsregionstring can be supplied. The coverage is computed per-base [ACGT].

Parameters:
  • contig (string) – reference_name of the genomic region (chromosome)

  • start (int) – start of the genomic region (0-based inclusive). If notgiven, count from the start of the chromosome.

  • stop (int) – end of the genomic region (0-based exclusive). If not given,count to the end of the chromosome.

  • region (string) – a region string.

  • quality_threshold (int) – quality_threshold is the minimum quality score (in phred) abase has to reach to be counted.

  • read_callback (string orfunction) –

    select a call-back to ignore reads when counting. It canbe either a string with the following values:

    all

    skip reads in which any of the followingflags are set: BAM_FUNMAP, BAM_FSECONDARY, BAM_FQCFAIL,BAM_FDUP

    nofilter

    uses every single read

    Alternatively,read_callback can be a functioncheck_read(read) that should return True only forthose reads that shall be included in the counting.

  • reference (string) – backward compatible synonym forcontig

  • end (int) – backward compatible synonym forstop

Raises:

ValueError – if the genomic coordinates are out of range or invalid.

Returns:

four array.arrays of the same length in order A C G T

Return type:

tuple

fetch(self,contig=None,start=None,stop=None,region=None,tid=None,until_eof=False,multiple_iterators=False,reference=None,end=None)

fetch reads aligned in aregion.

Seeparse_region() for more informationon how genomic regions can be specified.reference andend are also accepted for backward compatibility as synonymsforcontig andstop, respectively.

Without acontig orregion all mapped reads in the filewill be fetched. The reads will be returned ordered by referencesequence, which will not necessarily be the order within thefile. This mode of iteration still requires an index. If there isno index, useuntil_eof=True.

If onlycontig is set, all reads aligned tocontigwill be fetched.

ASAM file does not allow random access. Ifregionorcontig are given, an exception is raised.

Parameters:
  • until_eof (bool) – Ifuntil_eof is True, all reads from the current fileposition will be returned in order as they are within thefile. Using this option will also fetch unmapped reads.

  • multiple_iterators (bool) – Ifmultiple_iterators is True, multipleiterators on the same file can be used at the same time. Theiterator returned will receive its own copy of a filehandle tothe file effectively re-opening the file. Re-opening a filecreates some overhead, so beware.

Returns:

An iterator over a collection of reads.

Return type:

IteratorRow

Raises:

ValueError – if the genomic coordinates are out of range or invalid or the file does not permit random access to genomic coordinates.

find_introns(self,read_iterator)

Return a dictionary {(start, stop): count}Listing the intronic sites in the reads (identified by ‘N’ in the cigar strings),and their support ( = number of reads ).

read_iterator can be the result of a .fetch(…) call.Or it can be a generator filtering such reads. Examplesamfile.find_introns((read for read in samfile.fetch(…) if read.is_reverse)

find_introns_slow(self,read_iterator)

Return a dictionary {(start, stop): count}Listing the intronic sites in the reads (identified by ‘N’ in the cigar strings),and their support ( = number of reads ).

read_iterator can be the result of a .fetch(…) call.Or it can be a generator filtering such reads. Examplesamfile.find_introns((read for read in samfile.fetch(…) if read.is_reverse)

get_index_statistics(self)

return statistics about mapped/unmapped reads per chromosome asthey are stored in the index, similarly to the statistics printedby thesamtoolsidxstats command.

CRAI indexes do not record these statistics, so for a CRAM filewith a .crai index the returned statistics will all be 0.

Returns:

a list of records for each chromosome. Each record has theattributes ‘contig’, ‘mapped’, ‘unmapped’ and ‘total’.

Return type:

list

get_reference_length(self,reference)

returnreference length corresponding to numericaltid

get_reference_name(self,tid)

returnreference name corresponding to numericaltid

get_tid(self,reference)

return the numericaltid corresponding toreference

returns -1 if reference is not known.

getrname(self,tid)

deprecated, useget_reference_name() instead

gettid(self,reference)

deprecated, useget_tid() instead

has_index(self)

return true if htsfile has an existing (and opened) index.

head(self,n,multiple_iterators=True)

return an iterator over the first n alignments.

This iterator is is useful for inspecting the bam-file.

Parameters:

multiple_iterators (bool) – is set to True by default in order toavoid changing the current file position.

Returns:

an iterator over a collection of reads

Return type:

IteratorRowHead

is_valid_tid(self,inttid)

return True if the numericaltid is valid; False otherwise.

Note that the unmapped tid code (-1) counts as an invalid.

lengths

tuple of the lengths of thereference sequences. This is aread-only attribute. The lengths are in the same order aspysam.AlignmentFile.references

mapped

int with total number of mapped alignments according to thestatistics recorded in the index. This is a read-onlyattribute.(This will be 0 for a CRAM file indexed by a .crai index, as thatindex format does not record these statistics.)

mate(self,AlignedSegmentread)

return the mate ofpysam.AlignedSegmentread.

Note

Calling this method will change the file position.This might interfere with any iterators that havenot re-opened the file.

Note

This method is too slow for high-throughput processing.If a read needs to be processed with its mate, workfrom a read name sorted file or, better, cache reads.

Returns:

the mate

Return type:

AlignedSegment

Raises:

ValueError – if the read is unpaired or the mate is unmapped

nocoordinate

int with total number of reads without coordinates according to thestatistics recorded in the index, i.e., the statistic printed for “*”by thesamtoolsidxstats command. This is a read-only attribute.(This will be 0 for a CRAM file indexed by a .crai index, as thatindex format does not record these statistics.)

nreferences

int with the number ofreference sequences in the file.This is a read-only attribute.

pileup(self,contig=None,start=None,stop=None,region=None,reference=None,end=None,**kwargs)

perform apileup within aregion. The region isspecified bycontig,start andstop (using0-based indexing).reference andend are also accepted forbackward compatibility as synonyms forcontig andstop,respectively. Alternatively, a samtools ‘region’ stringcan be supplied.

Without ‘contig’ or ‘region’ all reads will be used for thepileup. The reads will be returned ordered bycontig sequence, which will not necessarily be theorder within the file.

Note thatSAM formatted files do not allow randomaccess. In these files, if a ‘region’ or ‘contig’ aregiven an exception is raised.

Note

‘all’ reads which overlap the region are returned. Thefirst base returned will be the first base of the firstread ‘not’ necessarily the first base of the region usedin the query.

Parameters:
  • truncate (bool) – By default, the samtools pileup engine outputs all readsoverlapping a region. If truncate is True and a region isgiven, only columns in the exact region specified arereturned.

  • max_depth (int) – Maximum read depth permitted. The default limit is ‘8000’.

  • stepper (string) –

    The stepper controls how the iterator advances.Possible options for the stepper are

    all

    skip reads in which any of the following flags are set:BAM_FUNMAP, BAM_FSECONDARY, BAM_FQCFAIL, BAM_FDUP

    nofilter

    uses every single read turning off any filtering.

    samtools

    same filter and read processing as in samtoolspileup. For full compatibility, this requires a‘fastafile’ to be given. The following options all pertainto filtering of thesamtools stepper.

  • fastafile (FastaFile object.) – This is required for some of the steppers.

  • ignore_overlaps (bool) – If set to True, detect if read pairs overlap and only takethe higher quality base. This is the default.

  • flag_filter (int) – ignore reads where any of the bits in the flag are set. The default isBAM_FUNMAP | BAM_FSECONDARY | BAM_FQCFAIL | BAM_FDUP.

  • flag_require (int) – only use reads where certain flags are set. The default is 0.

  • ignore_orphans (bool) – ignore orphans (paired reads that are not in a proper pair).The default is to ignore orphans.

  • min_base_quality (int) – Minimum base quality. Bases below the minimum quality willnot be output. The default is 13.

  • adjust_capq_threshold (int) – adjust mapping quality. The default is 0 for noadjustment. The recommended value for adjustment is 50.

  • min_mapping_quality (int) – only use reads above a minimum mapping quality. The default is 0.

  • compute_baq (bool) – re-alignment computing per-Base Alignment Qualities (BAQ). Thedefault is to do re-alignment. Realignment requires a referencesequence. If none is present, no realignment will be performed.

  • redo_baq (bool) – recompute per-Base Alignment Quality on the fly ignoringexisting base qualities. The default is False (use existingbase qualities).

Returns:

an iterator over genomic positions.

Return type:

IteratorColumn

references

tuple with the names ofreference sequences. This is aread-only attribute

text

deprecated, usereferences andlengths instead

unmapped

int with total number of unmapped reads according to the statisticsrecorded in the index. This number of reads includes the number of readswithout coordinates. This is a read-only attribute.(This will be 0 for a CRAM file indexed by a .crai index, as thatindex format does not record these statistics.)

write(self,AlignedSegmentread)int

write a singlepysam.AlignedSegment to disk.

Raises:

ValueError – if the writing failed

Returns:

the number of bytes written. If the file is closed,this will be 0.

Return type:

int

classpysam.AlignmentHeader

header information for aAlignmentFile object

Parameters:
  • header_dict (dict) – build header from a multi-level dictionary. Thefirst level are the four types (‘HD’, ‘SQ’, …). The secondlevel are a list of lines, with each line being a list oftag-value pairs. The header is constructed first from all thedefined fields, followed by user tags in alphabeticalorder. Alternatively, anAlignmentHeaderobject can be passed directly.

  • text (string) – use the string provided as the header

  • reference_names (list) – see reference_lengths

  • reference_lengths (list) – build header from list of chromosome names and lengths. Bydefault, ‘SQ’ and ‘LN’ tags will be added to the headertext. This option can be changed by unsetting the flagadd_sq_text.

  • add_sq_text (bool) – do not add ‘SQ’ and ‘LN’ tags to header. This option permitsconstructionSAM formatted files without a header.

as_dict(self)

deprecated, useto_dict() instead

copy(self)
classmethodfrom_dict(cls,header_dict)
classmethodfrom_references(cls,reference_names,reference_lengths,text=None,add_sq_text=True)
classmethodfrom_text(cls,text)
get(self,*args)
get_reference_length(self,reference)
get_reference_name(self,tid)
get_tid(self,reference)

return the numericaltid corresponding toreference

returns -1 if reference is not known.

is_valid_tid(self,inttid)

return True if the numericaltid is valid; False otherwise.

Note that the unmapped tid code (-1) counts as an invalid.

items(self)
iteritems(self)
keys(self)
lengths

tuple of the lengths of thereference sequences. This is aread-only attribute. The lengths are in the same order aspysam.AlignmentFile.references

nreferences

int with the number ofreference sequences in the file.

This is a read-only attribute.

references

tuple with the names ofreference sequences. This is aread-only attribute

to_dict(self)

return two-level dictionary with header information from the file.

The first level contains the record (HD,SQ, etc) andthe second level contains the fields (VN,LN, etc).

The parser is validating and will raise an AssertionError ifif encounters any record or field tags that are not part ofthe SAM specification. Use thepysam.AlignmentFile.text attribute to get the unparsedheader.

The parsing follows the SAM format specification with theexception of theCL field. This option will consume therest of a header line irrespective of any additional fields.This behaviour has been added to accommodate command lineoptions that contain characters that are not valid fieldseparators.

If no @SQ entries are within the text section of the header,this will be automatically added from the reference names andlengths stored in the binary part of the header.

values(self)

AnAlignedSegment represents an aligned segment withina SAM/BAM file.

classpysam.AlignedSegment(AlignmentHeaderheader=None)

Class representing an aligned segment.

This class stores a handle to the samtools C-structure representingan aligned read. Member read access is forwarded to the C-structureand converted into python objects. This implementation should be fast,as only the data needed is converted.

For write access, the C-structure is updated in-place. This isnot the most efficient way to build BAM entries, as the variablelength data is concatenated and thus needs to be resized ifa field is updated. Furthermore, the BAM entry might bein an inconsistent state.

One issue to look out for is that the sequence should alwaysbe setbefore the quality scores. Setting the sequence willalso erase any quality scores that were set previously.

Parameters:

headerAlignmentHeader object to map numericalidentifiers to chromosome names. If not given, an emptyheader is created.

aend

deprecated, usereference_end instead.

alen

deprecated, usereference_length instead.

aligned_pairs

deprecated, useget_aligned_pairs() instead.

bin

properties bin

blocks

deprecated, useget_blocks() instead.

cigar

deprecated, usecigarstring orcigartuples instead.

cigarstring

thecigar alignment as a string.

The cigar string is a string of alternating integersand characters denoting the length and the type ofan operation.

Note

The order length,operation is specified in theSAM format. It is different from the order ofthecigar property.

Returns None if not present.

To unset the cigarstring, assign None or theempty string.

cigartuples

thecigar alignment. The alignmentis returned as a list of tuples of (operation, length).

If the alignment is not present, None is returned.

The operations are:

M

pysam.CIGAR_OPS.CMATCH

0

I

pysam.CIGAR_OPS.CINS

1

D

pysam.CIGAR_OPS.CDEL

2

N

pysam.CIGAR_OPS.CREF_SKIP

3

S

pysam.CIGAR_OPS.CSOFT_CLIP

4

H

pysam.CIGAR_OPS.CHARD_CLIP

5

P

pysam.CIGAR_OPS.CPAD

6

=

pysam.CIGAR_OPS.CEQUAL

7

X

pysam.CIGAR_OPS.CDIFF

8

B

pysam.CIGAR_OPS.CBACK

9

Note

The output is a list of (operation, length) tuples, such as[(0,30)].This is different from the SAM specification andthecigarstring property, which uses a(length, operation) order, for example:30M.

To unset the cigar property, assign an empty listor None.

compare(self,AlignedSegmentother)

return -1,0,1, if contents in this are binary<,=,> toother

flag

properties flag

classmethodfrom_dict(cls,sam_dict,AlignmentHeaderheader)

parses a dictionary representation of the aligned segment.

Parameters:

sam_dict – dictionary of alignment values, keys corresponding to output fromtodict().

classmethodfromstring(cls,sam,AlignmentHeaderheader)

parses a string representation of the aligned segment.

The input format should be valid SAM format.

Parameters:

samSAM formatted string

get_aligned_pairs(self,matches_only=False,with_seq=False,with_cigar=False)

a list of aligned read (query) and reference positions.

Each item in the returned list is a tuple consisting ofthe 0-based offset from the start of the read sequencefollowed by the 0-based reference position.

For inserts, deletions, skipping either query or referenceposition may be None.

For padding in the reference, the reference position willalways be None.

Parameters:
  • matches_only (bool) – If True, only matched bases are returned — no None on eitherside.

  • with_seq (bool) – If True, return a third element in the tuple containing thereference sequence. For CIGAR ‘P’ (padding in the reference)operations, the third tuple element will be None. Substitutionsare lower-case. This option requires an MD tag to be present.

  • with_cigar (bool) – If True, return an extra element in the tuple containing theCIGAR operator corresponding to this position tuple.

Returns:

aligned_pairs

Return type:

list of tuples

get_blocks(self)

a list of start and end positions ofaligned gapless blocks.

The start and end positions are in genomiccoordinates.

Blocks are not normalized, i.e. two blocksmight be directly adjacent. This happens ifthe two blocks are separated by an insertionin the read.

get_cigar_stats(self)

summary of operations in cigar string.

The output order in the array is “MIDNSHP=X” followed by afield for the NM tag. If the NM tag is not present, thisfield will always be 0. (Accessing this field via index -1avoids changes if more CIGAR operators are added in future.)

M

pysam.CIGAR_OPS.CMATCH

0

I

pysam.CIGAR_OPS.CINS

1

D

pysam.CIGAR_OPS.CDEL

2

N

pysam.CIGAR_OPS.CREF_SKIP

3

S

pysam.CIGAR_OPS.CSOFT_CLIP

4

H

pysam.CIGAR_OPS.CHARD_CLIP

5

P

pysam.CIGAR_OPS.CPAD

6

=

pysam.CIGAR_OPS.CEQUAL

7

X

pysam.CIGAR_OPS.CDIFF

8

B

pysam.CIGAR_OPS.CBACK

9

NM

NM tag

10 or -1

If no cigar string is present, empty arrays will be returned.

Returns:

two arrays. The first contains the nucleotide counts withineach cigar operation, the second contains the number of blocksfor each cigar operation.

Return type:

arrays

get_forward_qualities(self)

return the original base qualities of the read sequence,in the same format as thequery_qualities property.

Reads mapped to the reverse strand have their base qualities storedreversed in the BAM file. This method returns such reads’ base qualitiesreversed back to their original orientation.

get_forward_sequence(self)

return the original read sequence.

Reads mapped to the reverse strand are stored reverse complemented inthe BAM file. This method returns such reads reverse complemented backto their original orientation.

Returns None if the record has no query sequence.

get_overlap(self,uint32_tstart,uint32_tend)

return number of aligned bases of read overlapping the intervalstart andend on the reference sequence.

Return None if cigar alignment is not available.

get_reference_positions(self,full_length=False)

a list of reference positions that this read aligns to.

By default, this method returns the (0-based) positions on thereference that are within the read’s alignment, leaving gapscorresponding to deletions and other reference skips.

Whenfull_length is True, the returned list is the same lengthas the read and additionally includes None values correspondingto insertions or soft-clipping, i.e., to bases of the read thatare not aligned to a reference position.(See alsoget_aligned_pairs() which additionally returnsthe corresponding positions along the read.)

get_reference_sequence(self)

return the reference sequence in the region that is covered by thealignment of the read to the reference.

This method requires the MD tag to be set.

get_tag(self,tag,with_value_type=False)

retrieves data from the optional alignment sectiongiven a two-lettertag denoting the field.

The returned value is cast into an appropriate python type.

This method is the fastest way to access the optionalalignment section if only few tags need to be retrieved.

Possible value types are “AcCsSiIfZHB” (see BAM formatspecification) as well as additional value type ‘d’ asimplemented in htslib.

Parameters:
  • tag – data tag.

  • with_value_type – Optional[bool]if set to True, the return value is a tuple of (tag value, typecode). (default False)

Returns:

A python object with the value of thetag. The type of theobject depends on the data type in the data record.

Raises:

KeyError – Iftag is not present, a KeyError is raised.

get_tags(self,with_value_type=False)

the fields in the optional alignment section.

Returns a list of all fields in the optionalalignment section. Values are converted to appropriate pythonvalues. For example:[(NM,2),(RG,"GJP00TM04")]

Ifwith_value_type is set, the value type as encode inthe AlignedSegment record will be returned as well:

[(NM, 2, “i”), (RG, “GJP00TM04”, “Z”)]

This method will convert all values in the optional alignmentsection. When getting only one or few tags, please seeget_tag() for a quicker way to achieve this.

has_tag(self,tag)

returns true if the optional alignment sectioncontains a giventag.

infer_query_length(self,always=False)

infer query length from CIGAR alignment.

This method deduces the query length from the CIGAR alignmentbut does not include hard-clipped bases.

Returns None if CIGAR alignment is not present.

Ifalways is set to True,infer_read_length is used instead.This is deprecated and only present for backward compatibility.

infer_read_length(self)

infer read length from CIGAR alignment.

This method deduces the read length from the CIGAR alignmentincluding hard-clipped bases.

Returns None if CIGAR alignment is not present.

inferred_length

deprecated, useinfer_query_length() instead.

is_duplicate

true if optical or PCR duplicate

is_forward

true if read is mapped to forward strand(implemented in terms ofis_reverse)

is_mapped

true if read itself is mapped(implemented in terms ofis_unmapped)

is_paired

true if read is paired in sequencing

is_proper_pair

true if read is mapped in a proper pair

is_qcfail

true if QC failure

is_read1

true if this is read1

is_read2

true if this is read2

is_reverse

true if read is mapped to reverse strand

is_secondary

true if not primary alignment

is_supplementary

true if this is a supplementary alignment

is_unmapped

true if read itself is unmapped

isize

deprecated, usetemplate_length instead.

mapping_quality

mapping quality

mapq

deprecated, usemapping_quality instead.

mate_is_forward

true if the mate is mapped to forward strand(implemented in terms ofmate_is_reverse)

mate_is_mapped

true if the mate is mapped(implemented in terms ofmate_is_unmapped)

mate_is_reverse

true if the mate is mapped to reverse strand

mate_is_unmapped

true if the mate is unmapped

modified_bases

Modified bases annotations from MM/ML tags. The output isDict[(canonical base, strand, modification)] -> [ (pos,qual), …]with qual being (256*probability), or -1 if unknown.Strand==0 for forward and 1 for reverse strand modification

modified_bases_forward

Modified bases annotations from MM/ML tags. The output isDict[(canonical base, strand, modification)] -> [ (pos,qual), …]with qual being (256*probability), or -1 if unknown.Strand==0 for forward and 1 for reverse strand modification.The positions are with respect to the original sequence from get_forward_sequence()

mpos

deprecated, usenext_reference_startinstead.

mrnm

deprecated, usenext_reference_id instead.

next_reference_id

thereference id of the mate/next read.

next_reference_name

reference name of the mate/next read (None if noAlignmentFile is associated)

next_reference_start

the position of the mate/next read.

opt(self,tag)

deprecated, useget_tag() instead.

overlap(self)

deprecated, useget_overlap() instead.

pnext

deprecated, usenext_reference_start instead.

pos

deprecated, usereference_start instead.

positions

deprecated, useget_reference_positions() instead.

qend

deprecated, usequery_alignment_end instead.

qlen

deprecated, usequery_alignment_lengthinstead.

qname

deprecated, usequery_name instead.

qqual

deprecated, usequery_alignment_qualities orquery_alignment_qualities_strinstead.

qstart

deprecated, usequery_alignment_start instead.

qual

deprecated, usequery_qualities orquery_qualities_str instead.

query

deprecated, usequery_alignment_sequenceinstead.

query_alignment_end

end index of the aligned query portion of the sequence (0-based,exclusive).

This the index just past the last base inquery_sequencethat is not soft-clipped. (For unmapped reads and when CIGAR isunavailable, this will be the length of the query/read.)

query_alignment_length

length of the aligned query sequence.

This is the number of bases inquery_sequence that are notsoft-clipped. (For unmapped reads and when CIGAR is unavailable, thiswill equal the length of the query/read.)

query_alignment_qualities

aligned query sequence quality values (None if not present). Theseare the quality values that correspond toquery_alignment_sequence, that is, they exclude qualities ofsoft clipped bases. This is equal toquery_qualities[query_alignment_start:query_alignment_end].

Quality scores are returned as a python array of unsignedchars. Note that this is not the ASCII-encoded value typicallyseen in FASTQ or SAM formatted files. Thus, no offset of 33needs to be subtracted.

This property is read-only.

query_alignment_qualities_str

aligned query sequence quality values, returned as an ASCII-encoded stringsimilar to that in FASTQ or SAM files, or None if base qualities are not present.These are the quality values that correspond toquery_alignment_sequence,i.e., excluding qualities corresponding to soft-clipped bases.

This property is read-only.

query_alignment_sequence

aligned portion of the read.

This is a substring ofquery_sequence that excludes flankingbases that weresoft clipped (None if not present). Itis equal toquery_sequence[query_alignment_start:query_alignment_end].

SAM/BAM files may include extra flanking bases that are notpart of the alignment. These bases may be the result of theSmith-Waterman or other algorithms, which may not requirealignments that begin at the first residue or end at the last.In addition, extra sequencing adapters, multiplex identifiers,and low-quality bases that were not considered for alignmentmay have been retained.

query_alignment_start

start index of the aligned query portion of the sequence (0-based,inclusive).

This the index of the first base inquery_sequence that is notsoft-clipped. (For unmapped reads and when CIGAR is unavailable, thiswill be zero.)

query_length

the length of the query/read.

This value corresponds to the length of the sequence suppliedin the BAM/SAM file. The length of a query is 0 if there is nosequence in the BAM/SAM file. In those cases, the read lengthcan be inferred from the CIGAR alignment, seepysam.AlignedSegment.infer_query_length().

The length includes soft-clipped bases and is equal tolen(query_sequence).

This property is read-only but is updated when a new querysequence is assigned to this AlignedSegment.

Returns 0 if not available.

query_name

the query template name (None if not present)

query_qualities

read sequence base qualities, includingsoft clipped bases(None if not present).

Quality scores are returned as a python array of unsignedchars. Note that this is not the ASCII-encoded value typicallyseen in FASTQ or SAM formatted files. Thus, no offset of 33needs to be subtracted.

Note that to set quality scores the sequence has to be setbeforehand as this will determine the expected length of thequality score array. Setting will raise a ValueError if thelength of the new quality scores is not the same as thelength of the existing sequence.

Quality scores to be set may be specified as a Python arrayor other iterable of ints, or as a string of ASCII-encoodedFASTQ/SAM-style base quality characters.

query_qualities_str

read sequence base qualities, includingsoft clipped bases,returned as an ASCII-encoded string similar to that in FASTQ or SAM files,or None if base qualities are not present.

Note that to set quality scores the sequence has to be set beforehandas this will determine the expected length of the quality score string.Setting will raise a ValueError if the length of the new quality scoresis not the same as the length of the existing sequence.

query_sequence

read sequence bases, includingsoft clipped bases(None if not present).

Assigning to this attribute will invalidate any quality scores.Thus, to in-place edit the sequence and quality scores, copies ofthe quality scores need to be taken. Consider trimming for example:

q=read.query_qualitiesread.query_sequence=read.query_sequence[5:10]read.query_qualities=q[5:10]

The sequence is returned as it is stored in the BAM file. (This willbe the reverse complement of the original read sequence if the mapperhas aligned the read to the reverse strand.)

reference_end

aligned reference position of the read on the reference genome.

reference_end points to one past the last aligned residue.Returns None if not available (read is unmapped or no cigaralignment present).

reference_id

reference ID

Note

This field contains the index of the reference sequence inthe sequence dictionary. To obtain the name of thereference sequence, useget_reference_name()

reference_length

aligned length of the read on the reference genome.

This is equal toreference_end - reference_start.Returns None if not available.

reference_name

reference name

reference_start

0-based leftmost coordinate

rlen

deprecated, usequery_length instead.

rname

deprecated, usereference_id instead.

rnext

deprecated, usenext_reference_id instead.

seq

deprecated, usequery_sequence instead.

setTag(self,tag,value,value_type=None,replace=True)

deprecated, useset_tag() instead.

set_tag(self,tag,value,value_type=None,replace=True)

sets a particular fieldtag tovalue in the optional alignmentsection.

value_type describes the type ofvalue that is to enteredinto the alignment record. It can be set explicitly to one ofthe valid one-letter type codes. If unset, an appropriate typewill be chosen automatically based on the python type ofvalue.

An existing value of the sametag will be overwritten unlessreplace is set to False. This is usually not recommended as atag may only appear once in the optional alignment section.

Ifvalue isNone, the tag will be deleted.

This method accepts valid SAM specification value types, whichare:

A:printablechari:signedintf:floatZ:printablestringH:BytearrayinhexformatB:Integerornumericarray

Additionally, it will accept the integer BAM types (‘cCsSI’)

For htslib compatibility, ‘a’ is synonymous with ‘A’ and themethod accepts a ‘d’ type code for a double precision float.

When deducing the type code by the python type ofvalue, thefollowing mapping is applied:

i:pythonintf:pythonfloatZ:pythonstrorbytesB:pythonarray.array,listortuple

Note that a single character string will be output as ‘Z’ andnot ‘A’ as the former is the more general type.

set_tags(self,tags)

sets the fields in the optional alignment section witha list of (tag, value) tuples.

The value type of the values is determined from thepython type. Optionally, a type may be given explicitly asa third value in the tuple, For example:

x.set_tags([(NM, 2, “i”), (RG, “GJP00TM04”, “Z”)]

This method will not enforce the rule that the same tag may appearonly once in the optional alignment section.

tags

deprecated, useget_tags() instead.

template_length

the observed query template length

tid

deprecated, usereference_id instead.

tlen

deprecated, usetemplate_length instead.

to_dict(self)

returns a json representation of the aligned segment.

Field names are abbreviated versions of the class attributes.

to_string(self)

returns a string representation of the aligned segment.

The output format is valid SAM format if a header is associatedwith the AlignedSegment.

tostring(self,htsfile=None)

deprecated, useto_string() instead.

Parameters:

htsfile – (deprecated) AlignmentFile object to map numericalidentifiers to chromosome names. This parameter is presentfor backwards compatibility and ignored.

classpysam.PileupColumn

A pileup of reads at a particular reference sequence position(column). A pileup column contains all the reads that mapto a certain target base.

This class is a proxy for results returned by the samtools pileupengine. If the underlying engine iterator advances, the resultsof this column will change.

get_mapping_qualities(self)

query mapping quality scores at pileup column position.

Returns:

a list of quality scores

Return type:

list

get_num_aligned(self)

return number of aligned bases at pileup column position.

This method applies a base quality filter and the number isequal to the size ofget_query_sequences(),get_mapping_qualities(), etc.

get_query_names(self)

query/read names aligned at pileup column position.

Returns:

a list of query names at pileup column position.

Return type:

list

get_query_positions(self)

positions in read at pileup column position.

Returns:

a list of read positions

Return type:

list

get_query_qualities(self)

query base quality scores at pileup column position.

Returns:

a list of quality scores

Return type:

list

get_query_sequences(self,boolmark_matches=False,boolmark_ends=False,booladd_indels=False)

query bases/sequences at pileup column position.

Optionally, the bases/sequences can be annotated according to the samtoolsmpileup format. This is the format description from the samtools mpileup tool:

Information on match, mismatch, indel, strand, mappingquality and start and end of a read are all encoded at theread base column. At this column, a dot stands for a matchto the reference base on the forward strand, a comma for amatch on the reverse strand, a '>' or '<' for a referenceskip, `ACGTN' for a mismatch on the forward strand and`acgtn' for a mismatch on the reverse strand. A pattern`\+[0-9]+[ACGTNacgtn]+' indicates there is an insertionbetween this reference position and the next referenceposition. The length of the insertion is given by theinteger in the pattern, followed by the insertedsequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+'represents a deletion from the reference. The deleted baseswill be presented as `*' in the following lines. Also atthe read base column, a symbol `^' marks the start of aread. The ASCII of the character following `^' minus 33gives the mapping quality. A symbol `$' marks the end of aread segment

To reproduce samtools mpileup format, set all of mark_matches,mark_ends and add_indels to True.

Parameters:
  • mark_matches (bool) – If True, output bases matching the reference as “.” or “,”for forward and reverse strand, respectively. This markrequires the reference sequence. If no reference ispresent, this option is ignored.

  • mark_ends (bool) – If True, add markers “^” and “$” for read start and end, respectively.

  • add_indels (bool) – If True, add bases for bases inserted into or skipped from thereference. The latter requires a reference sequence file to havebeen given, e.g. viapileup(fastafile = …). If no referencesequence is available, skipped bases are represented as ‘N’s.

Returns:

a list of bases/sequences per read at pileup column position.

Return type:

list

n

deprecated, usensegments instead.

nsegments

number of reads mapping to this column.

Note that this number ignores the base quality filter.

pileups

list of reads (pysam.PileupRead) aligned to this column

pos

deprecated, usereference_pos instead.

reference_id

the reference sequence number as defined in the header

reference_name

reference name (None if no AlignmentFile is associated)

reference_pos

the position in the reference sequence (0-based).

set_min_base_quality(self,min_base_quality)

set the minimum base quality for this pileup column.

tid

deprecated, usereference_id instead.

classpysam.PileupRead

Representation of a read aligned to a particular position in thereference sequence.

alignment

apysam.AlignedSegment object of the aligned read

indel

indel length for the position following the current pileup site.

This quantity peeks ahead to the next cigar operation in thisalignment. If the next operation is an insertion, indel willbe positive. If the next operation is a deletion, it will benegation. 0 if the next operation is not an indel.

is_del

1 iff the base on the padded read is a deletion

is_head

1 iff the base on the padded read is the left-most base.

is_refskip

1 iff the base on the padded read is part of CIGAR N op.

is_tail

1 iff the base on the padded read is the right-most base.

level

the level of the read in the “viewer” mode. Note that this valueis currently not computed.

query_position

position of the read base at the pileup site, 0-based.None ifis_del oris_refskip is set.

query_position_or_next

position of the read base at the pileup site, 0-based.

If the current position is a deletion, returns the nextaligned base.

classpysam.IndexedReads(AlignmentFilesamfile,intmultiple_iterators=True)

Index a Sam/BAM-file by query name while keeping theoriginal sort order intact.

The index is kept in memory and can be substantial.

By default, the file is re-opened to avoid conflicts if multipleoperators work on the same file. Setmultiple_iterators = Falseto not re-opensamfile.

Parameters:
  • samfile (AlignmentFile) – File to be indexed.

  • multiple_iterators (bool) – Flag indicating whether the file should be reopened. Reopening preventsexisting iterators being affected by the indexing.

build(self)

build the index.

find(self,query_name)

findquery_name in index.

Returns:

Returns an iterator over all reads with query_name.

Return type:

IteratorRowSelection

Raises:

KeyError – if thequery_name is not in the index.

Tabix files

TabixFile opens tabular files that have beenindexed withtabix.

classpysam.TabixFile

Random access to bgzf formatted files thathave been indexed bytabix.

The file is automatically opened. The index file of file<filename> is expected to be called<filename>.tbiby default (see parameterindex).

Parameters:
  • filename (string) – Filename of bgzf file to be opened.

  • index (string) – The filename of the index. If not set, the default is toassume that the index is calledfilename.tbi

  • mode (char) – The file opening mode. Currently, onlyr is permitted.

  • parser (pysam.Parser) – sets the default parser for this tabix file. Ifparseris None, the results are returned as an unparsed string.Otherwise,parser is assumed to be a functor that will returnparsed data (see for exampleasTuple andasGTF).

  • encoding (string) – The encoding passed to the parser

  • threads (integer) – Number of threads to use for decompressing Tabix files.(Default=1)

Raises:
close(self)

closes thepysam.TabixFile.

contigs

list of chromosome names

fetch(self,reference=None,start=None,end=None,region=None,parser=None,multiple_iterators=False)

fetch one or more rows in aregion using 0-basedindexing. The region is specified byreference,start andend. Alternatively, a samtoolsregionstring can be supplied.

Withoutreference orregion all entries will be fetched.

If onlyreference is set, all reads matching onreferencewill be fetched.

Ifparser is None, the default parser will be used forparsing.

Setmultiple_iterators to true if you will be using multipleiterators on the same file at the same time. The iteratorreturned will receive its own copy of a filehandle to the fileeffectively re-opening the file. Re-opening a file createssome overhead, so beware.

header

the file header.

The file header consists of the lines at the beginning of afile that are prefixed by the comment character#.

Note

The header is returned as an iterator presenting lineswithout the newline character.

To iterate over tabix files, usetabix_iterator():

pysam.tabix_iterator(infile,parser)

return an iterator over all entries in a file.

Results are returned parsed as specified by theparser. Ifparser is None, the results are returned as an unparsed string.Otherwise,parser is assumed to be a functor that will returnparsed data (see for exampleasTuple andasGTF).

pysam.tabix_compress(filename_in,filename_out,force=False)

compressfilename_in writing the output tofilename_out.

Raise an IOError iffilename_out already exists, unlessforceis set.

pysam.tabix_index(filename,force=False,seq_col=None,start_col=None,end_col=None,preset=None,meta_char='#',intline_skip=0,zerobased=False,intmin_shift=-1,index=None,keep_original=False,csi=False)

index tab-separatedfilename using tabix.

An existing index will not be overwritten unlessforce is set.

The index will be built from coordinates in columnsseq_col,start_col andend_col.

The contents offilename have to be sorted by contig andposition - the method does not check if the file is sorted.

Column indices are 0-based. Note that this is different from thetabix command line utility where column indices start at 1.

Coordinates in the file are assumed to be 1-based unlesszerobased is set.

Ifpreset is provided, the column coordinates are taken from apreset. Valid values for preset are “gff”, “bed”, “sam”, “vcf”,psltbl”, “pileup”.

Lines beginning withmeta_char and the firstline_skip lineswill be skipped.

Iffilename is not detected as a gzip file it will be automaticallycompressed. The original file will be removed and only the compressedfile will be retained.

By default or whenmin_shift is 0, creates a TBI index. Ifmin_shiftis greater than zero and/orcsi is True, creates a CSI index with aminimal interval size of 1<<min_shift (1<<14 if onlycsi is set).

index controls the filename which should be used for creating the index.If not set, the default is to append.tbi tofilename.

When automatically compressing files, ifkeep_original is set theuncompressed file will not be deleted.

returns the filename of the compressed data

classpysam.asTuple

converts atabix row into a python tuple.

A field in a row is accessed by numeric index.

classpysam.asVCF

converts atabix row into a VCF record withthe following fields:

Column

Field

Contents

1

contig

chromosome

2

pos

chromosomal position, zero-based

3

id

id

4

ref

reference allele

5

alt

alternate alleles

6

qual

quality

7

filter

filter

8

info

info

9

format

format specifier.

Access to genotypes is via index:

contig=vcf.contigfirst_sample_genotype=vcf[0]second_sample_genotype=vcf[1]
classpysam.asBed

converts atabix row into a bed recordwith the following fields:

Column

Field

Contents

1

contig

contig

2

start

genomic start coordinate (zero-based)

3

end

genomic end coordinate plus one(zero-based)

4

name

name of feature.

5

score

score of feature

6

strand

strand of feature

7

thickStart

thickStart

8

thickEnd

thickEnd

9

itemRGB

itemRGB

10

blockCount

number of bocks

11

blockSizes

‘,’ separated string of block sizes

12

blockStarts

‘,’ separated string of block genomicstart positions

Only the first three fields are required. Additionalfields are optional, but if one is defined, all the precedingneed to be defined as well.

classpysam.asGTF

converts atabix row into a GTF record with the followingfields:

Column

Name

Content

1

contig

the chromosome name

2

feature

The feature type

3

source

The feature source

4

start

genomic start coordinate(0-based)

5

end

genomic end coordinate(0-based)

6

score

feature score

7

strand

strand

8

frame

frame

9

attributes

the attribute field

GTF formatted entries also define the following fields thatare derived from the attributes field:

Name

Content

gene_id

the gene identifier

transcript_id

the transcript identifier

FASTA files

classpysam.FastaFile

Random access to fasta formatted files thathave been indexed byfaidx.

The file is automatically opened. The index file of file<filename> is expected to be called<filename>.fai.

Parameters:
  • filename (string) – Filename of fasta file to be opened.

  • filepath_index (string) – Optional, filename of the index. By default this isthe filename + “.fai”.

  • filepath_index_compressed (string) – Optional, filename of the index if fasta file is. By default this isthe filename + “.gzi”.

Raises:
close(self)

close the file.

closed

bool indicating the current state of the file object.This is a read-only attribute; the close() method changes the value.

fetch(self,reference=None,start=None,end=None,region=None)

fetch sequences in aregion.

A region caneither be specified byreference,start andend.start andend denote 0-based, half-openintervals.

Alternatively, a samtoolsregion string can besupplied.

If any of the coordinates are missing they will be replaced by theminimum (start) or maximum (end) coordinate.

Note that region strings are 1-based, whilestart andend denotean interval in python coordinates.The region is specified byreference,start andend.

Returns:

string

Return type:

a string with the sequence specified by the region.

Raises:
filename

filename associated with this object. This is a read-only attribute.

get_reference_length(self,reference)

return the length of reference.

is_open(self)

return true if samfile has been opened.

lengths

tuple with the lengths ofreference sequences.

nreferences

int with the number ofreference sequences in the file.This is a read-only attribute.

references

tuple with the names ofreference sequences.

FASTQ files

classpysam.FastxFile

Stream access tofasta orfastq formatted files.

The file is automatically opened.

Entries in the file can be both fastq or fasta formatted or even amixture of the two.

This file object permits iterating over all entries in thefile. Random access is not implemented. The iteration returnsobjects of typeFastqProxy

Parameters:
  • filename (string) – Filename of fasta/fastq file to be opened.

  • persist (bool) – If True (default) make a copy of the entry in the file duringiteration. If set to False, no copy will be made. This willpermit much faster iteration, but an entry will not persistwhen the iteration continues and an entry is read-only.

Notes

Prior to version 0.8.2, this class was called FastqFile.

Raises:

IOError – if file could not be opened

Examples

>>>withpysam.FastxFile(filename)asfh:...forentryinfh:...print(entry.name)...print(entry.sequence)...print(entry.comment)...print(entry.quality)>>>withpysam.FastxFile(filename)asfin,open(out_filename,mode='w')asfout:...forentryinfin:...fout.write(str(entry)+'\n')
close(self)

close the file.

closed

bool indicating the current state of the file object.This is a read-only attribute; the close() method changes the value.

filename

string with the filename associated with this object.

is_open(self)

return true if samfile has been opened.

classpysam.FastqProxy

A single entry in a fastq file.

get_quality_array(self,intoffset=33)array

return quality values as integer array after subtracting offset.

name

The name of each entry in the fastq file.

quality

The quality score of each entry in the fastq file, represented as a string.

sequence

The sequence of each entry in the fastq file.

VCF/BCF files

classpysam.VariantFile(*args,**kwargs)

(filename, mode=None, index_filename=None, header=None, drop_samples=False,duplicate_filehandle=True, ignore_truncation=False, threads=1)

AVCF/BCF formatted file. The file is automaticallyopened.

If an index for a variant file exists (.csi or .tbi), it will beopened automatically. Without an index random access to recordsviafetch() is disabled.

For writing, aVariantHeader object must be provided,typically obtained from anotherVCF file/BCFfile.

Parameters:
  • mode (string) –

    mode should ber for reading orw for writing. The default istext mode (VCF). For binary (BCF) I/O you should appendb for compressed oru for uncompressedBCF output.

    Ifb is present, it must immediately followr orw. Validmodes arer,w,wh,rb,wb,wbu andwb0.For instance, to open aBCF formatted file for reading, type:

    f=pysam.VariantFile('ex1.bcf','r')

    If mode is not specified, we will try to auto-detect the file type. Allof the following should work:

    f1=pysam.VariantFile('ex1.bcf')f2=pysam.VariantFile('ex1.vcf')f3=pysam.VariantFile('ex1.vcf.gz')

  • index_filename (string) – Explicit path to an index file.

  • header (VariantHeader) –VariantHeader object required for writing.

  • drop_samples (bool) – Ignore sample information when reading.

  • duplicate_filehandle (bool) – By default, file handles passed either directly or throughFile-like objects will be duplicated before passing them tohtslib. The duplication prevents issues where the same streamwill be closed by htslib and through destruction of thehigh-level python object. Set to False to turn offduplication.

  • ignore_truncation (bool) – Issue a warning, instead of raising an error if the current fileappears to be truncated due to a missing EOF marker. Only appliesto bgzipped formats. (Default=False)

  • threads (integer) – Number of threads to use for compressing/decompressing VCF/BCF files.Setting threads to > 1 cannot be combined withignore_truncation.(Default=1)

close(self)

closes thepysam.VariantFile.

copy(self)
fetch(self,contig=None,start=None,stop=None,region=None,reopen=False,end=None,reference=None)

fetch records in aregion, specified either bycontig,start, andend (which are 0-based, half-open);or alternatively by a samtoolsregion string (which is1-based inclusive).

Withoutcontig orregion all mapped records will be fetched. Therecords will be returned ordered by contig, which will not necessarilybe the order within the file.

Setreopen to true if you will be using multiple iterators on thesame file at the same time. The iterator returned will receive itsown copy of a filehandle to the file effectively re-opening thefile. Re-opening a file incurrs some overhead, so use with care.

If onlycontig is set, all records oncontig will be fetched.If bothregion andcontig are given, an exception is raised.

Note that a bgzippedVCF.gz file without a tabix/CSI index(.tbi/.csi) or aBCF file without a CSI index can only beread sequentially.

get_reference_name(self,tid)

returnreference name corresponding to numericaltid

get_tid(self,reference)

return the numericaltid corresponding toreference

returns -1 if reference is not known.

is_valid_tid(self,tid)

return True if the numericaltid is valid; False otherwise.

returns -1 if reference is not known.

new_record(self,*args,**kwargs)

Create a new emptyVariantRecord.

SeeVariantHeader.new_record()

open(self,filename,mode='r',index_filename=None,VariantHeaderheader=None,drop_samples=False,duplicate_filehandle=True,ignore_truncation=False,threads=1)

open a vcf/bcf file.

If open is called on an existing VariantFile, the current file will beclosed and a new file will be opened.

reset(self)

reset file position to beginning of file just after the header.

subset_samples(self,include_samples)

Read only a subset of samples to reduce processing time and memory.Must be called prior to retrieving records.

write(self,VariantRecordrecord)int

write a singlepysam.VariantRecord to disk.

returns the number of bytes written.

classpysam.VariantHeader

header information for aVariantFile object

add_line(self,line)

Add a metadata line to this header

add_meta(self,key,value=None,items=None)

Add metadata to this header

add_record(self,VariantHeaderRecordrecord)

Add an existingVariantHeaderRecord to this header

add_sample(self,name)

Add a new sample to this header

add_samples(self,*args)

Add several new samples to this header.This function takes multiple arguments, each of which maybe either a sample name or an iterable returning sample names(e.g., a list of sample names).

alts

alt metadata (dict ID->record).

The data returned just a snapshot of alt records, is createdevery time the property is requested, and modifications willnot be reflected in the header metadata and vice versa.

i.e. it is just a dict that reflects the state of alt recordsat the time it is created.

contigs

contig information (VariantHeaderContigs)

copy(self)
filters

filter metadata (VariantHeaderMetadata)

formats

format metadata (VariantHeaderMetadata)

info

info metadata (VariantHeaderMetadata)

merge(self,VariantHeaderheader)
new_record(self,contig=None,start=0,stop=0,alleles=None,id=None,qual=None,filter=None,info=None,samples=None,**kwargs)

Create a new empty VariantRecord.

Arguments are currently experimental. Use with caution and expectchanges in upcoming releases.

records

header records (VariantHeaderRecords)

samples
version

VCF version

classpysam.VariantHeaderRecord(*args,**kwargs)

header record from aVariantHeader object

attrs

sequence of additional header attributes

get(self,key,default=None)

D.get(k[,d]) -> D[k] if k in D, else d. d defaults to None.

items(self)

D.items() -> list of D’s (key, value) pairs, as 2-tuples

iteritems(self)

D.iteritems() -> an iterator over the (key, value) items of D

iterkeys(self)

D.iterkeys() -> an iterator over the keys of D

itervalues(self)

D.itervalues() -> an iterator over the values of D

key

header key (the part before ‘=’, in FILTER/INFO/FORMAT/contig/fileformat etc.)

keys(self)

D.keys() -> list of D’s keys

pop(self,key,default=_nothing)
remove(self)
type

FILTER, INFO, FORMAT, CONTIG, STRUCTURED, or GENERIC

Type:

header type

update(self,items=None,**kwargs)

D.update([E, ]**F) -> None.

Update D from dict/iterable E and F.

value

header value. Set only for generic lines, None for FILTER/INFO, etc.

values(self)

D.values() -> list of D’s values

classpysam.VariantRecord(*args,**kwargs)

Variant record

alleles

tuple of reference allele followed by alt alleles

alts

tuple of alt alleles

chrom

chromosome/contig name

contig

chromosome/contig name

copy(self)

return a copy of this VariantRecord object

filter

filter information (seeVariantRecordFilter)

format

sample format metadata (seeVariantRecordFormat)

id

record identifier or None if not available

info

info data (seeVariantRecordInfo)

pos

record start position on chrom/contig (1-based inclusive)

qual

phred scaled quality score or None if not available

ref

reference allele

rid

internal reference id number

rlen

record length on chrom/contig (aka rec.stop - rec.start)

samples

sample data (seeVariantRecordSamples)

start

record start position on chrom/contig (0-based inclusive)

stop

record stop position on chrom/contig (0-based exclusive)

translate(self,VariantHeaderdst_header)

HTSFile

HTSFile is the base class forpysam.AlignmentFile andpysam.VariantFile.

classpysam.HTSFile

Base class for HTS file types

add_hts_options(self,format_options=None)

Given a list of key=value format option strings, add them to an open htsFile

category

General file format category. One of UNKNOWN, ALIGNMENTS,VARIANTS, INDEX, REGIONS

check_truncation(self,ignore_truncation=False)

Check if file is truncated.

close(self)
closed

return True if HTSFile is closed.

compression

File compression.

One of NONE, GZIP, BGZF, CUSTOM.

description

Vaguely human readable description of the file format

flush(self)

Flush any buffered data to the underlying output stream.

format

File format.

One of UNKNOWN, BINARY_FORMAT, TEXT_FORMAT, SAM, BAM,BAI, CRAM, CRAI, VCF, BCF, CSI, GZI, TBI, BED.

get_reference_name(self,tid)

returncontig name corresponding to numericaltid

get_tid(self,contig)

return the numericaltid corresponding tocontig

returns -1 if contig is not known.

is_bam

return True if HTSFile is reading or writing a BAM alignment file

is_bcf

return True if HTSFile is reading or writing a BCF variant file

is_closed

return True if HTSFile is closed.

is_cram

return True if HTSFile is reading or writing a BAM alignment file

is_open

return True if HTSFile is open and in a valid state.

is_read

return True if HTSFile is open for reading

is_sam

return True if HTSFile is reading or writing a SAM alignment file

is_valid_reference_name(self,contig)

return True if the contig namecontig is valid; False otherwise.

is_valid_tid(self,tid)

return True if the numericaltid is valid; False otherwise.

returns -1 if contig is not known.

is_vcf

return True if HTSFile is reading or writing a VCF variant file

is_write

return True if HTSFile is open for writing

parse_region(self,contig=None,start=None,stop=None,region=None,tid=None,reference=None,end=None)

parse alternative ways to specify a genomic region. A region caneither be specified bycontig,start andstop.start andstop denote 0-based, half-openintervals.reference andend are also accepted forbackward compatibility as synonyms forcontig andstop, respectively.

Alternatively, a samtoolsregion string can besupplied.

If any of the coordinates are missing they will be replaced bythe minimum (start) or maximum (stop) coordinate.

Note that region strings are 1-based inclusive, whilestartandstop denote an interval in 0-based, half-opencoordinates (like BED files and Python slices).

Ifcontig orregion or are*, unmapped reads at the endof a BAM file will be returned. Setting either to. williterate from the beginning of the file.

Returns:

a tuple offlag,tid,start andstop. The flag indicates whether no coordinates weresupplied and the genomic region is the complete genomic space.

Return type:

tuple

Raises:

ValueError – for invalid or out of bounds regions.

reset(self)

reset file position to beginning of file just after the header.

Returns:

The file position after moving the file pointer.

Return type:

pointer

seek(self,uint64_toffset,intwhence=io.SEEK_SET)

move file pointer to positionoffset, seepysam.HTSFile.tell().

tell(self)

return current file position, seepysam.HTSFile.seek().

version

Tuple of file format version numbers (major, minor)