Introduction

Pysam is a python module that makes it easy to read and manipulatemapped short read sequence data stored in SAM/BAM files. It is alightweight wrapper of thehtslib C-API.

This page provides a quick introduction in using pysam followed by theAPI. SeeWorking with BAM/CRAM/SAM-formatted files for more detailed usage instructions.

To use the module to read a file in BAM format, create aAlignmentFile object:

importpysamsamfile=pysam.AlignmentFile("ex1.bam","rb")

Once a file is opened you can iterate over all of the reads mapping toa specified region usingfetch(). Eachiteration returns aAlignedSegment object whichrepresents a single read along with its fields and optional tags:

forreadinsamfile.fetch('chr1',100,120):print(read)samfile.close()

To give:

EAS56_57:6:190:289:82099<<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<;69CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA01921EAS56_57:6:190:289:82099<<<<<<;<<<<<<<<<<;<<;<<<<;8<6;9;;2;137AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC73641EAS51_64:3:190:727:3080102<<<<<<<<<<<<<<<<<<<<<<<<<<<::<<<84499GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG99181...

You can also write to aAlignmentFile:

importpysamsamfile=pysam.AlignmentFile("ex1.bam","rb")pairedreads=pysam.AlignmentFile("allpaired.bam","wb",template=samfile)forreadinsamfile.fetch():ifread.is_paired:pairedreads.write(read)pairedreads.close()samfile.close()

An alternative way of accessing the data in a SAM file is by iteratingover each base of a specified region using thepileup() method. Each iteration returns aPileupColumn which represents all the reads in the SAMfile that map to a single base in the reference sequence. The list ofreads are represented asPileupRead objects in thePileupColumn.pileups property:

importpysamsamfile=pysam.AlignmentFile("ex1.bam","rb")forpileupcolumninsamfile.pileup("chr1",100,120):print("\ncoverage at base%s =%s"%(pileupcolumn.pos,pileupcolumn.n))forpileupreadinpileupcolumn.pileups:ifnotpileupread.is_delandnotpileupread.is_refskip:# query position is None if is_del or is_refskip is set.print('\tbase in read%s =%s'%(pileupread.alignment.query_name,pileupread.alignment.query_sequence[pileupread.query_position]))samfile.close()

The above code outputs:

coverageatbase99=1baseinreadEAS56_57:6:190:289:82=Acoverageatbase100=1baseinreadEAS56_57:6:190:289:82=Gcoverageatbase101=1baseinreadEAS56_57:6:190:289:82=Gcoverageatbase102=2baseinreadEAS56_57:6:190:289:82=GbaseinreadEAS51_64:3:190:727:308=G...

Commands available insamtools are available as simplefunction calls. For example:

pysam.sort("-o","output.bam","ex1.bam")

corresponds to the command line:

samtoolssort-ooutput.bamex1.bam

Analogous toAlignmentFile, aTabixFile allows fast random access to compressed andtabix indexed tab-separated file formats with genomic data:

importpysamtabixfile=pysam.TabixFile("example.gtf.gz")forgtfintabixfile.fetch("chr1",1000,2000):print(gtf.contig,gtf.start,gtf.end,gtf.gene_id)

TabixFile implements lazy parsing in order to iterateover large tables efficiently.

More detailed usage instructions are available atWorking with BAM/CRAM/SAM-formatted files.

Note

Coordinates in pysam are always 0-based (following the pythonconvention). SAM text files use 1-based coordinates.

Note

The above examples can be run in thetests directory of the: installation directory. Type ‘make’ before running them.

API

SAM/BAM/CRAM files

Objects of typeAlignmentFile allow working withBAM/SAM formatted files.

classpysam.AlignmentFile

AlignmentFile(filepath_or_object, mode=None, template=None,reference_names=None, reference_lengths=None, text=NULL,header=None, add_sq_text=False, check_header=True, check_sq=True,reference_filename=None, filename=None, index_filename=None,filepath_index=None, require_index=False, duplicate_filehandle=True,ignore_truncation=False, threads=1)

ASAM/BAM/CRAM formatted file.

Iffilepath_or_object is a string, the file is automaticallyopened. Iffilepath_or_object is a python File object, thealready opened file will be used.

If the file is opened for reading and an index exists (if file is BAM, a.bai file or if CRAM a .crai file), it will be opened automatically.index_filename may be specified explicitly. If the index is not namedin the standard manner, not located in the same directory as theBAM/CRAM file, or is remote. Without an index, random access viafetch() andpileup()is disabled.

For writing, the header of aSAM file/BAM file canbe constituted from several sources (see also the samtools formatspecification):

Iftemplate is given, the header is copied from anotherAlignmentFile (template must be aAlignmentFile).
Ifheader is given, the header is built from amulti-level dictionary.
Iftext is given, new header text is copied from rawtext.
The names (reference_names) and lengths(reference_lengths) are supplied directly as lists.

When reading or writing a CRAM file, the filename of a FASTA-formattedreference can be specified withreference_filename.

By default, if a file is opened in mode ‘r’, it is checkedfor a valid header (check_header = True) and a definition ofchromosome names (check_sq = True).

Parameters:

mode (string) –
mode should ber for reading orw for writing. Thedefault is text mode (SAM). For binary (BAM)I/O you should appendb for compressed oru foruncompressedBAM output. Useh to output headerinformation in text (TAM) mode. Usec forCRAM formatted files.
Ifb is present, it must immediately followr orw. Valid modes arer,w,wh,rb,wb,wbu,wb0,rc andwc. For instance, to open aBAM formatted file for reading, type:
```
f=pysam.AlignmentFile('ex1.bam','rb')
```
If mode is not specified, the method will try to auto-detectin the order ‘rb’, ‘r’, thus both the following should work:
```
f1=pysam.AlignmentFile('ex1.bam')f2=pysam.AlignmentFile('ex1.sam')
```
template (AlignmentFile) – when writing, copy header from filetemplate.
header (dict orAlignmentHeader) – when writing, build header from a multi-level dictionary. Thefirst level are the four types (‘HD’, ‘SQ’, …). The secondlevel are a list of lines, with each line being a list oftag-value pairs. The header is constructed first from all thedefined fields, followed by user tags in alphabeticalorder. Alternatively, anAlignmentHeaderobject can be passed directly.
text (string) – when writing, use the string provided as the header
reference_names (list) – see reference_lengths
reference_lengths (list) – when writing or opening a SAM file without header build headerfrom list of chromosome names and lengths. By default, ‘SQ’and ‘LN’ tags will be added to the header text. This optioncan be changed by unsetting the flagadd_sq_text.
add_sq_text (bool) – do not add ‘SQ’ and ‘LN’ tags to header. This option permitsconstructionSAM formatted files without a header.
add_sam_header (bool) – when outputting SAM the default is to output a header. This isequivalent to opening the file in ‘wh’ mode. If this option isset to False, no header will be output. To read such a file,setcheck_header=False.
check_header (bool) – obsolete: when reading a SAM file, check if header is present(default=True)
check_sq (bool) – when reading, check if SQ entries are present in header(default=True)
reference_filename (string) – Path to a FASTA-formatted reference file. Valid only for CRAM files.When reading a CRAM file, this overrides both$REF_PATH and the URLspecified in the header (UR tag), which are normally used to findthe reference.
index_filename (string) – Explicit path to the index file. Only needed if the index is notnamed in the standard manner, not located in the same directory asthe BAM/CRAM file, or is remote. An IOError is raised if the indexcannot be found or is invalid.
filepath_index (string) – Alias forindex_filename.
require_index (bool) – When reading, require that an index file is present and is valid orraise an IOError. (default=False)
filename (string) – Alternative to filepath_or_object. Filename of the fileto be opened.
duplicate_filehandle (bool) – By default, file handles passed either directly or throughFile-like objects will be duplicated before passing them tohtslib. The duplication prevents issues where the same streamwill be closed by htslib and through destruction of thehigh-level python object. Set to False to turn offduplication.
ignore_truncation (bool) – Issue a warning, instead of raising an error if the current fileappears to be truncated due to a missing EOF marker. Only appliesto bgzipped formats. (Default=False)
format_options (list) – A list of key=value strings, as accepted by –input-fmt-option and–output-fmt-option in samtools.
threads (integer) – Number of threads to use for compressing/decompressing BAM/CRAM files.Setting threads to > 1 cannot be combined withignore_truncation.(Default=1)

check_index(self)

return True if index is present.

Raises:

AttributeError – if htsfile isSAM formatted and thus has no index.
ValueError – if htsfile is closed or index could not be opened.

close(self): closes thepysam.AlignmentFile.

count(self,contig=None,start=None,stop=None,region=None,until_eof=False,read_callback='nofilter',reference=None,end=None)

count the number of reads inregion

The region is specified bycontig,start andstop.reference andend are also accepted for backwardcompatibility as synonyms forcontig andstop,respectively. Alternatively, asamtools regionstring can be supplied.

ASAM file does not allow random access and ifregion orcontig are given, an exception is raised.

Parameters:

contig (string) – reference_name of the genomic region (chromosome)
start (int) – start of the genomic region (0-based inclusive)
stop (int) – end of the genomic region (0-based exclusive)
region (string) – a region string in samtools format.
until_eof (bool) – count until the end of the file, possibly includingunmapped reads as well.
read_callback (string orfunction) –
select a call-back to ignore reads when counting. It canbe either a string with the following values:
all
skip reads in which any of the followingflags are set: BAM_FUNMAP, BAM_FSECONDARY, BAM_FQCFAIL,BAM_FDUP
nofilter
uses every single read
Alternatively,read_callback can be a functioncheck_read(read) that should return True only forthose reads that shall be included in the counting.
reference (string) – backward compatible synonym forcontig
end (int) – backward compatible synonym forstop

Raises:

ValueError – if the genomic coordinates are out of range or invalid.

count_coverage(self,contig,start=None,stop=None,region=None,quality_threshold=15,read_callback='all',reference=None,end=None)

count the coverage of genomic positions by reads inregion.

Parameters:

contig (string) – reference_name of the genomic region (chromosome)
start (int) – start of the genomic region (0-based inclusive). If notgiven, count from the start of the chromosome.
stop (int) – end of the genomic region (0-based exclusive). If not given,count to the end of the chromosome.
region (string) – a region string.
quality_threshold (int) – quality_threshold is the minimum quality score (in phred) abase has to reach to be counted.
read_callback (string orfunction) –
select a call-back to ignore reads when counting. It canbe either a string with the following values:
all
skip reads in which any of the followingflags are set: BAM_FUNMAP, BAM_FSECONDARY, BAM_FQCFAIL,BAM_FDUP
nofilter
uses every single read
Alternatively,read_callback can be a functioncheck_read(read) that should return True only forthose reads that shall be included in the counting.
reference (string) – backward compatible synonym forcontig
end (int) – backward compatible synonym forstop

Raises:

ValueError – if the genomic coordinates are out of range or invalid.

Returns:

four array.arrays of the same length in order A C G T

Return type:

tuple

fetch(self,contig=None,start=None,stop=None,region=None,tid=None,until_eof=False,multiple_iterators=False,reference=None,end=None)

fetch reads aligned in aregion.

Seeparse_region() for more informationon how genomic regions can be specified.reference andend are also accepted for backward compatibility as synonymsforcontig andstop, respectively.

Without acontig orregion all mapped reads in the filewill be fetched. The reads will be returned ordered by referencesequence, which will not necessarily be the order within thefile. This mode of iteration still requires an index. If there isno index, useuntil_eof=True.

If onlycontig is set, all reads aligned tocontigwill be fetched.

ASAM file does not allow random access. Ifregionorcontig are given, an exception is raised.

Parameters:

until_eof (bool) – Ifuntil_eof is True, all reads from the current fileposition will be returned in order as they are within thefile. Using this option will also fetch unmapped reads.
multiple_iterators (bool) – Ifmultiple_iterators is True, multipleiterators on the same file can be used at the same time. Theiterator returned will receive its own copy of a filehandle tothe file effectively re-opening the file. Re-opening a filecreates some overhead, so beware.

Returns:

An iterator over a collection of reads.

Return type:

IteratorRow

Raises:

ValueError – if the genomic coordinates are out of range or invalid or the file does not permit random access to genomic coordinates.

find_introns(self,read_iterator)

Return a dictionary {(start, stop): count}Listing the intronic sites in the reads (identified by ‘N’ in the cigar strings),and their support ( = number of reads ).

read_iterator can be the result of a .fetch(…) call.Or it can be a generator filtering such reads. Examplesamfile.find_introns((read for read in samfile.fetch(…) if read.is_reverse)

find_introns_slow(self,read_iterator)

Return a dictionary {(start, stop): count}Listing the intronic sites in the reads (identified by ‘N’ in the cigar strings),and their support ( = number of reads ).

read_iterator can be the result of a .fetch(…) call.Or it can be a generator filtering such reads. Examplesamfile.find_introns((read for read in samfile.fetch(…) if read.is_reverse)

get_index_statistics(self)

return statistics about mapped/unmapped reads per chromosome asthey are stored in the index, similarly to the statistics printedby thesamtoolsidxstats command.

CRAI indexes do not record these statistics, so for a CRAM filewith a .crai index the returned statistics will all be 0.

Returns:: a list of records for each chromosome. Each record has theattributes ‘contig’, ‘mapped’, ‘unmapped’ and ‘total’.
Return type:: list

get_reference_length(self,reference): returnreference length corresponding to numericaltid

get_reference_name(self,tid): returnreference name corresponding to numericaltid

get_tid(self,reference)

return the numericaltid corresponding toreference

returns -1 if reference is not known.

getrname(self,tid): deprecated, useget_reference_name() instead

gettid(self,reference): deprecated, useget_tid() instead

has_index(self): return true if htsfile has an existing (and opened) index.

head(self,n,multiple_iterators=True)

return an iterator over the first n alignments.

This iterator is is useful for inspecting the bam-file.

Parameters:: multiple_iterators (bool) – is set to True by default in order toavoid changing the current file position.
Returns:: an iterator over a collection of reads
Return type:: IteratorRowHead

is_valid_tid(self,inttid)

return True if the numericaltid is valid; False otherwise.

Note that the unmapped tid code (-1) counts as an invalid.

lengths: tuple of the lengths of thereference sequences. This is aread-only attribute. The lengths are in the same order aspysam.AlignmentFile.references

mapped: int with total number of mapped alignments according to thestatistics recorded in the index. This is a read-onlyattribute.(This will be 0 for a CRAM file indexed by a .crai index, as thatindex format does not record these statistics.)

mate(self,AlignedSegmentread)

return the mate ofpysam.AlignedSegmentread.

Note

Calling this method will change the file position.This might interfere with any iterators that havenot re-opened the file.

Note

This method is too slow for high-throughput processing.If a read needs to be processed with its mate, workfrom a read name sorted file or, better, cache reads.

Returns:: the mate
Return type:: AlignedSegment
Raises:: ValueError – if the read is unpaired or the mate is unmapped

nocoordinate: int with total number of reads without coordinates according to thestatistics recorded in the index, i.e., the statistic printed for “*”by thesamtoolsidxstats command. This is a read-only attribute.(This will be 0 for a CRAM file indexed by a .crai index, as thatindex format does not record these statistics.)

nreferences: int with the number ofreference sequences in the file.This is a read-only attribute.

pileup(self,contig=None,start=None,stop=None,region=None,reference=None,end=None,**kwargs)

perform apileup within aregion. The region isspecified bycontig,start andstop (using0-based indexing).reference andend are also accepted forbackward compatibility as synonyms forcontig andstop,respectively. Alternatively, a samtools ‘region’ stringcan be supplied.

Without ‘contig’ or ‘region’ all reads will be used for thepileup. The reads will be returned ordered bycontig sequence, which will not necessarily be theorder within the file.

Note thatSAM formatted files do not allow randomaccess. In these files, if a ‘region’ or ‘contig’ aregiven an exception is raised.

Note

‘all’ reads which overlap the region are returned. Thefirst base returned will be the first base of the firstread ‘not’ necessarily the first base of the region usedin the query.

Parameters:

truncate (bool) – By default, the samtools pileup engine outputs all readsoverlapping a region. If truncate is True and a region isgiven, only columns in the exact region specified arereturned.
max_depth (int) – Maximum read depth permitted. The default limit is ‘8000’.
stepper (string) –
The stepper controls how the iterator advances.Possible options for the stepper are
all
skip reads in which any of the following flags are set:BAM_FUNMAP, BAM_FSECONDARY, BAM_FQCFAIL, BAM_FDUP
nofilter
uses every single read turning off any filtering.
samtools
same filter and read processing as in samtoolspileup. For full compatibility, this requires a‘fastafile’ to be given. The following options all pertainto filtering of thesamtools stepper.
fastafile (FastaFile object.) – This is required for some of the steppers.
ignore_overlaps (bool) – If set to True, detect if read pairs overlap and only takethe higher quality base. This is the default.
flag_filter (int) – ignore reads where any of the bits in the flag are set. The default isBAM_FUNMAP | BAM_FSECONDARY | BAM_FQCFAIL | BAM_FDUP.
flag_require (int) – only use reads where certain flags are set. The default is 0.
ignore_orphans (bool) – ignore orphans (paired reads that are not in a proper pair).The default is to ignore orphans.
min_base_quality (int) – Minimum base quality. Bases below the minimum quality willnot be output. The default is 13.
adjust_capq_threshold (int) – adjust mapping quality. The default is 0 for noadjustment. The recommended value for adjustment is 50.
min_mapping_quality (int) – only use reads above a minimum mapping quality. The default is 0.
compute_baq (bool) – re-alignment computing per-Base Alignment Qualities (BAQ). Thedefault is to do re-alignment. Realignment requires a referencesequence. If none is present, no realignment will be performed.
redo_baq (bool) – recompute per-Base Alignment Quality on the fly ignoringexisting base qualities. The default is False (use existingbase qualities).

Returns:

an iterator over genomic positions.

Return type:

IteratorColumn

references: tuple with the names ofreference sequences. This is aread-only attribute

text: deprecated, usereferences andlengths instead

unmapped: int with total number of unmapped reads according to the statisticsrecorded in the index. This number of reads includes the number of readswithout coordinates. This is a read-only attribute.(This will be 0 for a CRAM file indexed by a .crai index, as thatindex format does not record these statistics.)

write(self,AlignedSegmentread)→int

write a singlepysam.AlignedSegment to disk.

Raises:: ValueError – if the writing failed
Returns:: the number of bytes written. If the file is closed,this will be 0.
Return type:: int

classpysam.AlignmentHeader

header information for aAlignmentFile object

Parameters:

header_dict (dict) – build header from a multi-level dictionary. Thefirst level are the four types (‘HD’, ‘SQ’, …). The secondlevel are a list of lines, with each line being a list oftag-value pairs. The header is constructed first from all thedefined fields, followed by user tags in alphabeticalorder. Alternatively, anAlignmentHeaderobject can be passed directly.
text (string) – use the string provided as the header
reference_names (list) – see reference_lengths
reference_lengths (list) – build header from list of chromosome names and lengths. Bydefault, ‘SQ’ and ‘LN’ tags will be added to the headertext. This option can be changed by unsetting the flagadd_sq_text.
add_sq_text (bool) – do not add ‘SQ’ and ‘LN’ tags to header. This option permitsconstructionSAM formatted files without a header.

as_dict(self): deprecated, useto_dict() instead

copy(self)

classmethodfrom_dict(cls,header_dict)

classmethodfrom_references(cls,reference_names,reference_lengths,text=None,add_sq_text=True)

classmethodfrom_text(cls,text)

get(self,*args)

get_reference_length(self,reference)

get_reference_name(self,tid)

get_tid(self,reference)

return the numericaltid corresponding toreference

returns -1 if reference is not known.

is_valid_tid(self,inttid)

return True if the numericaltid is valid; False otherwise.

Note that the unmapped tid code (-1) counts as an invalid.

items(self)

iteritems(self)

keys(self)

lengths: tuple of the lengths of thereference sequences. This is aread-only attribute. The lengths are in the same order aspysam.AlignmentFile.references

nreferences

int with the number ofreference sequences in the file.

This is a read-only attribute.

references: tuple with the names ofreference sequences. This is aread-only attribute

to_dict(self)

return two-level dictionary with header information from the file.

The first level contains the record (HD,SQ, etc) andthe second level contains the fields (VN,LN, etc).

The parser is validating and will raise an AssertionError ifif encounters any record or field tags that are not part ofthe SAM specification. Use thepysam.AlignmentFile.text attribute to get the unparsedheader.

The parsing follows the SAM format specification with theexception of theCL field. This option will consume therest of a header line irrespective of any additional fields.This behaviour has been added to accommodate command lineoptions that contain characters that are not valid fieldseparators.

If no @SQ entries are within the text section of the header,this will be automatically added from the reference names andlengths stored in the binary part of the header.

values(self)

AnAlignedSegment represents an aligned segment withina SAM/BAM file.

classpysam.AlignedSegment(AlignmentHeaderheader=None)

Class representing an aligned segment.

This class stores a handle to the samtools C-structure representingan aligned read. Member read access is forwarded to the C-structureand converted into python objects. This implementation should be fast,as only the data needed is converted.

For write access, the C-structure is updated in-place. This isnot the most efficient way to build BAM entries, as the variablelength data is concatenated and thus needs to be resized ifa field is updated. Furthermore, the BAM entry might bein an inconsistent state.

One issue to look out for is that the sequence should alwaysbe setbefore the quality scores. Setting the sequence willalso erase any quality scores that were set previously.

Parameters:: header –AlignmentHeader object to map numericalidentifiers to chromosome names. If not given, an emptyheader is created.

aend: deprecated, usereference_end instead.

alen: deprecated, usereference_length instead.

aligned_pairs: deprecated, useget_aligned_pairs() instead.

bin: properties bin

blocks: deprecated, useget_blocks() instead.

cigar: deprecated, usecigarstring orcigartuples instead.

cigarstring

thecigar alignment as a string.

The cigar string is a string of alternating integersand characters denoting the length and the type ofan operation.

Note

The order length,operation is specified in theSAM format. It is different from the order ofthecigar property.

Returns None if not present.

To unset the cigarstring, assign None or theempty string.

cigartuples

thecigar alignment. The alignmentis returned as a list of tuples of (operation, length).

If the alignment is not present, None is returned.

The operations are:

M	pysam.CIGAR_OPS.CMATCH	0
I	pysam.CIGAR_OPS.CINS	1
D	pysam.CIGAR_OPS.CDEL	2
N	pysam.CIGAR_OPS.CREF_SKIP	3
S	pysam.CIGAR_OPS.CSOFT_CLIP	4
H	pysam.CIGAR_OPS.CHARD_CLIP	5
P	pysam.CIGAR_OPS.CPAD	6
=	pysam.CIGAR_OPS.CEQUAL	7
X	pysam.CIGAR_OPS.CDIFF	8
B	pysam.CIGAR_OPS.CBACK	9

Note

The output is a list of (operation, length) tuples, such as[(0,30)].This is different from the SAM specification andthecigarstring property, which uses a(length, operation) order, for example:30M.

To unset the cigar property, assign an empty listor None.

compare(self,AlignedSegmentother): return -1,0,1, if contents in this are binary<,=,> toother

flag: properties flag

classmethodfrom_dict(cls,sam_dict,AlignmentHeaderheader)

parses a dictionary representation of the aligned segment.

Parameters:: sam_dict – dictionary of alignment values, keys corresponding to output fromtodict().

classmethodfromstring(cls,sam,AlignmentHeaderheader)

parses a string representation of the aligned segment.

The input format should be valid SAM format.

Parameters:: sam –SAM formatted string

get_aligned_pairs(self,matches_only=False,with_seq=False,with_cigar=False)

a list of aligned read (query) and reference positions.

Each item in the returned list is a tuple consisting ofthe 0-based offset from the start of the read sequencefollowed by the 0-based reference position.

For inserts, deletions, skipping either query or referenceposition may be None.

For padding in the reference, the reference position willalways be None.

Parameters:

matches_only (bool) – If True, only matched bases are returned — no None on eitherside.
with_seq (bool) – If True, return a third element in the tuple containing thereference sequence. For CIGAR ‘P’ (padding in the reference)operations, the third tuple element will be None. Substitutionsare lower-case. This option requires an MD tag to be present.
with_cigar (bool) – If True, return an extra element in the tuple containing theCIGAR operator corresponding to this position tuple.

Returns:

aligned_pairs

Return type:

list of tuples

get_blocks(self)

a list of start and end positions ofaligned gapless blocks.

The start and end positions are in genomiccoordinates.

Blocks are not normalized, i.e. two blocksmight be directly adjacent. This happens ifthe two blocks are separated by an insertionin the read.

get_cigar_stats(self)

summary of operations in cigar string.

The output order in the array is “MIDNSHP=X” followed by afield for the NM tag. If the NM tag is not present, thisfield will always be 0. (Accessing this field via index -1avoids changes if more CIGAR operators are added in future.)

M	pysam.CIGAR_OPS.CMATCH	0
I	pysam.CIGAR_OPS.CINS	1
D	pysam.CIGAR_OPS.CDEL	2
N	pysam.CIGAR_OPS.CREF_SKIP	3
S	pysam.CIGAR_OPS.CSOFT_CLIP	4
H	pysam.CIGAR_OPS.CHARD_CLIP	5
P	pysam.CIGAR_OPS.CPAD	6
=	pysam.CIGAR_OPS.CEQUAL	7
X	pysam.CIGAR_OPS.CDIFF	8
B	pysam.CIGAR_OPS.CBACK	9
NM	NM tag	10 or -1

If no cigar string is present, empty arrays will be returned.

Returns:: two arrays. The first contains the nucleotide counts withineach cigar operation, the second contains the number of blocksfor each cigar operation.
Return type:: arrays

get_forward_qualities(self)

return the original base qualities of the read sequence,in the same format as thequery_qualities property.

Reads mapped to the reverse strand have their base qualities storedreversed in the BAM file. This method returns such reads’ base qualitiesreversed back to their original orientation.

get_forward_sequence(self)

return the original read sequence.

Reads mapped to the reverse strand are stored reverse complemented inthe BAM file. This method returns such reads reverse complemented backto their original orientation.

Returns None if the record has no query sequence.

get_overlap(self,uint32_tstart,uint32_tend)

return number of aligned bases of read overlapping the intervalstart andend on the reference sequence.

Return None if cigar alignment is not available.

get_reference_positions(self,full_length=False)

a list of reference positions that this read aligns to.

By default, this method returns the (0-based) positions on thereference that are within the read’s alignment, leaving gapscorresponding to deletions and other reference skips.

Whenfull_length is True, the returned list is the same lengthas the read and additionally includes None values correspondingto insertions or soft-clipping, i.e., to bases of the read thatare not aligned to a reference position.(See alsoget_aligned_pairs() which additionally returnsthe corresponding positions along the read.)

get_reference_sequence(self)

return the reference sequence in the region that is covered by thealignment of the read to the reference.

This method requires the MD tag to be set.

get_tag(self,tag,with_value_type=False)

retrieves data from the optional alignment sectiongiven a two-lettertag denoting the field.

The returned value is cast into an appropriate python type.

This method is the fastest way to access the optionalalignment section if only few tags need to be retrieved.

Possible value types are “AcCsSiIfZHB” (see BAM formatspecification) as well as additional value type ‘d’ asimplemented in htslib.

Parameters:

tag – data tag.
with_value_type – Optional[bool]if set to True, the return value is a tuple of (tag value, typecode). (default False)

Returns:

A python object with the value of thetag. The type of theobject depends on the data type in the data record.

Raises:

KeyError – Iftag is not present, a KeyError is raised.

get_tags(self,with_value_type=False)

the fields in the optional alignment section.

Returns a list of all fields in the optionalalignment section. Values are converted to appropriate pythonvalues. For example:[(NM,2),(RG,"GJP00TM04")]

Ifwith_value_type is set, the value type as encode inthe AlignedSegment record will be returned as well:

[(NM, 2, “i”), (RG, “GJP00TM04”, “Z”)]

This method will convert all values in the optional alignmentsection. When getting only one or few tags, please seeget_tag() for a quicker way to achieve this.

has_tag(self,tag): returns true if the optional alignment sectioncontains a giventag.

infer_query_length(self,always=False)

infer query length from CIGAR alignment.

This method deduces the query length from the CIGAR alignmentbut does not include hard-clipped bases.

Returns None if CIGAR alignment is not present.

Ifalways is set to True,infer_read_length is used instead.This is deprecated and only present for backward compatibility.

infer_read_length(self)

infer read length from CIGAR alignment.

This method deduces the read length from the CIGAR alignmentincluding hard-clipped bases.

Returns None if CIGAR alignment is not present.

inferred_length: deprecated, useinfer_query_length() instead.

is_duplicate: true if optical or PCR duplicate

is_forward: true if read is mapped to forward strand(implemented in terms ofis_reverse)

is_mapped: true if read itself is mapped(implemented in terms ofis_unmapped)

is_paired: true if read is paired in sequencing

is_proper_pair: true if read is mapped in a proper pair

is_qcfail: true if QC failure

is_read1: true if this is read1

is_read2: true if this is read2

is_reverse: true if read is mapped to reverse strand

is_secondary: true if not primary alignment

is_supplementary: true if this is a supplementary alignment

is_unmapped: true if read itself is unmapped

isize: deprecated, usetemplate_length instead.

mapping_quality: mapping quality

mapq: deprecated, usemapping_quality instead.

mate_is_forward: true if the mate is mapped to forward strand(implemented in terms ofmate_is_reverse)

mate_is_mapped: true if the mate is mapped(implemented in terms ofmate_is_unmapped)

mate_is_reverse: true if the mate is mapped to reverse strand

mate_is_unmapped: true if the mate is unmapped

modified_bases: Modified bases annotations from MM/ML tags. The output isDict[(canonical base, strand, modification)] -> [ (pos,qual), …]with qual being (256*probability), or -1 if unknown.Strand==0 for forward and 1 for reverse strand modification

modified_bases_forward: Modified bases annotations from MM/ML tags. The output isDict[(canonical base, strand, modification)] -> [ (pos,qual), …]with qual being (256*probability), or -1 if unknown.Strand==0 for forward and 1 for reverse strand modification.The positions are with respect to the original sequence from get_forward_sequence()

mpos: deprecated, usenext_reference_startinstead.

mrnm: deprecated, usenext_reference_id instead.

next_reference_id: thereference id of the mate/next read.

next_reference_name: reference name of the mate/next read (None if noAlignmentFile is associated)

next_reference_start: the position of the mate/next read.

opt(self,tag): deprecated, useget_tag() instead.

overlap(self): deprecated, useget_overlap() instead.

pnext: deprecated, usenext_reference_start instead.

pos: deprecated, usereference_start instead.

positions: deprecated, useget_reference_positions() instead.

qend: deprecated, usequery_alignment_end instead.

qlen: deprecated, usequery_alignment_lengthinstead.

qname: deprecated, usequery_name instead.

qqual: deprecated, usequery_alignment_qualities orquery_alignment_qualities_strinstead.

qstart: deprecated, usequery_alignment_start instead.

qual: deprecated, usequery_qualities orquery_qualities_str instead.

query: deprecated, usequery_alignment_sequenceinstead.

query_alignment_end

end index of the aligned query portion of the sequence (0-based,exclusive).

This the index just past the last base inquery_sequencethat is not soft-clipped. (For unmapped reads and when CIGAR isunavailable, this will be the length of the query/read.)

query_alignment_length

length of the aligned query sequence.

This is the number of bases inquery_sequence that are notsoft-clipped. (For unmapped reads and when CIGAR is unavailable, thiswill equal the length of the query/read.)

query_alignment_qualities

aligned query sequence quality values (None if not present). Theseare the quality values that correspond toquery_alignment_sequence, that is, they exclude qualities ofsoft clipped bases. This is equal toquery_qualities[query_alignment_start:query_alignment_end].

Quality scores are returned as a python array of unsignedchars. Note that this is not the ASCII-encoded value typicallyseen in FASTQ or SAM formatted files. Thus, no offset of 33needs to be subtracted.

This property is read-only.

query_alignment_qualities_str

aligned query sequence quality values, returned as an ASCII-encoded stringsimilar to that in FASTQ or SAM files, or None if base qualities are not present.These are the quality values that correspond toquery_alignment_sequence,i.e., excluding qualities corresponding to soft-clipped bases.

This property is read-only.

query_alignment_sequence

aligned portion of the read.

This is a substring ofquery_sequence that excludes flankingbases that weresoft clipped (None if not present). Itis equal toquery_sequence[query_alignment_start:query_alignment_end].

SAM/BAM files may include extra flanking bases that are notpart of the alignment. These bases may be the result of theSmith-Waterman or other algorithms, which may not requirealignments that begin at the first residue or end at the last.In addition, extra sequencing adapters, multiplex identifiers,and low-quality bases that were not considered for alignmentmay have been retained.

query_alignment_start

start index of the aligned query portion of the sequence (0-based,inclusive).

This the index of the first base inquery_sequence that is notsoft-clipped. (For unmapped reads and when CIGAR is unavailable, thiswill be zero.)

query_length

the length of the query/read.

This value corresponds to the length of the sequence suppliedin the BAM/SAM file. The length of a query is 0 if there is nosequence in the BAM/SAM file. In those cases, the read lengthcan be inferred from the CIGAR alignment, seepysam.AlignedSegment.infer_query_length().

The length includes soft-clipped bases and is equal tolen(query_sequence).

This property is read-only but is updated when a new querysequence is assigned to this AlignedSegment.

Returns 0 if not available.

query_name: the query template name (None if not present)

query_qualities

read sequence base qualities, includingsoft clipped bases(None if not present).

Note that to set quality scores the sequence has to be setbeforehand as this will determine the expected length of thequality score array. Setting will raise a ValueError if thelength of the new quality scores is not the same as thelength of the existing sequence.

Quality scores to be set may be specified as a Python arrayor other iterable of ints, or as a string of ASCII-encoodedFASTQ/SAM-style base quality characters.

query_qualities_str

read sequence base qualities, includingsoft clipped bases,returned as an ASCII-encoded string similar to that in FASTQ or SAM files,or None if base qualities are not present.

Note that to set quality scores the sequence has to be set beforehandas this will determine the expected length of the quality score string.Setting will raise a ValueError if the length of the new quality scoresis not the same as the length of the existing sequence.

query_sequence

read sequence bases, includingsoft clipped bases(None if not present).

Assigning to this attribute will invalidate any quality scores.Thus, to in-place edit the sequence and quality scores, copies ofthe quality scores need to be taken. Consider trimming for example:

q=read.query_qualitiesread.query_sequence=read.query_sequence[5:10]read.query_qualities=q[5:10]

The sequence is returned as it is stored in the BAM file. (This willbe the reverse complement of the original read sequence if the mapperhas aligned the read to the reverse strand.)

reference_end

aligned reference position of the read on the reference genome.

reference_end points to one past the last aligned residue.Returns None if not available (read is unmapped or no cigaralignment present).

reference_id: reference ID
Note
This field contains the index of the reference sequence inthe sequence dictionary. To obtain the name of thereference sequence, useget_reference_name()

reference_length

aligned length of the read on the reference genome.

This is equal toreference_end - reference_start.Returns None if not available.

reference_name: reference name

reference_start: 0-based leftmost coordinate

rlen: deprecated, usequery_length instead.

rname: deprecated, usereference_id instead.

rnext: deprecated, usenext_reference_id instead.

seq: deprecated, usequery_sequence instead.

setTag(self,tag,value,value_type=None,replace=True): deprecated, useset_tag() instead.

set_tag(self,tag,value,value_type=None,replace=True)

sets a particular fieldtag tovalue in the optional alignmentsection.

value_type describes the type ofvalue that is to enteredinto the alignment record. It can be set explicitly to one ofthe valid one-letter type codes. If unset, an appropriate typewill be chosen automatically based on the python type ofvalue.

An existing value of the sametag will be overwritten unlessreplace is set to False. This is usually not recommended as atag may only appear once in the optional alignment section.

Ifvalue isNone, the tag will be deleted.

This method accepts valid SAM specification value types, whichare:

A:printablechari:signedintf:floatZ:printablestringH:BytearrayinhexformatB:Integerornumericarray

Additionally, it will accept the integer BAM types (‘cCsSI’)

For htslib compatibility, ‘a’ is synonymous with ‘A’ and themethod accepts a ‘d’ type code for a double precision float.

When deducing the type code by the python type ofvalue, thefollowing mapping is applied:

i:pythonintf:pythonfloatZ:pythonstrorbytesB:pythonarray.array,listortuple

Note that a single character string will be output as ‘Z’ andnot ‘A’ as the former is the more general type.

set_tags(self,tags)

sets the fields in the optional alignment section witha list of (tag, value) tuples.

The value type of the values is determined from thepython type. Optionally, a type may be given explicitly asa third value in the tuple, For example:

x.set_tags([(NM, 2, “i”), (RG, “GJP00TM04”, “Z”)]

This method will not enforce the rule that the same tag may appearonly once in the optional alignment section.

tags: deprecated, useget_tags() instead.

template_length: the observed query template length

tid: deprecated, usereference_id instead.

tlen: deprecated, usetemplate_length instead.

to_dict(self)

returns a json representation of the aligned segment.

Field names are abbreviated versions of the class attributes.

to_string(self)

returns a string representation of the aligned segment.

The output format is valid SAM format if a header is associatedwith the AlignedSegment.

tostring(self,htsfile=None)

deprecated, useto_string() instead.

Parameters:: htsfile – (deprecated) AlignmentFile object to map numericalidentifiers to chromosome names. This parameter is presentfor backwards compatibility and ignored.

classpysam.PileupColumn

A pileup of reads at a particular reference sequence position(column). A pileup column contains all the reads that mapto a certain target base.

This class is a proxy for results returned by the samtools pileupengine. If the underlying engine iterator advances, the resultsof this column will change.

get_mapping_qualities(self)

query mapping quality scores at pileup column position.

Returns:: a list of quality scores
Return type:: list

get_num_aligned(self)

return number of aligned bases at pileup column position.

This method applies a base quality filter and the number isequal to the size ofget_query_sequences(),get_mapping_qualities(), etc.

get_query_names(self)

query/read names aligned at pileup column position.

Returns:: a list of query names at pileup column position.
Return type:: list

get_query_positions(self)

positions in read at pileup column position.

Returns:: a list of read positions
Return type:: list

get_query_qualities(self)

query base quality scores at pileup column position.

Returns:: a list of quality scores
Return type:: list

get_query_sequences(self,boolmark_matches=False,boolmark_ends=False,booladd_indels=False)

query bases/sequences at pileup column position.

Optionally, the bases/sequences can be annotated according to the samtoolsmpileup format. This is the format description from the samtools mpileup tool:

Information on match, mismatch, indel, strand, mappingquality and start and end of a read are all encoded at theread base column. At this column, a dot stands for a matchto the reference base on the forward strand, a comma for amatch on the reverse strand, a '>' or '<' for a referenceskip, `ACGTN' for a mismatch on the forward strand and`acgtn' for a mismatch on the reverse strand. A pattern`\+[0-9]+[ACGTNacgtn]+' indicates there is an insertionbetween this reference position and the next referenceposition. The length of the insertion is given by theinteger in the pattern, followed by the insertedsequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+'represents a deletion from the reference. The deleted baseswill be presented as `*' in the following lines. Also atthe read base column, a symbol `^' marks the start of aread. The ASCII of the character following `^' minus 33gives the mapping quality. A symbol `$' marks the end of aread segment

To reproduce samtools mpileup format, set all of mark_matches,mark_ends and add_indels to True.

Parameters:

mark_matches (bool) – If True, output bases matching the reference as “.” or “,”for forward and reverse strand, respectively. This markrequires the reference sequence. If no reference ispresent, this option is ignored.
mark_ends (bool) – If True, add markers “^” and “$” for read start and end, respectively.
add_indels (bool) – If True, add bases for bases inserted into or skipped from thereference. The latter requires a reference sequence file to havebeen given, e.g. viapileup(fastafile = …). If no referencesequence is available, skipped bases are represented as ‘N’s.

Returns:

a list of bases/sequences per read at pileup column position.

Return type:

list

n: deprecated, usensegments instead.

nsegments

number of reads mapping to this column.

Note that this number ignores the base quality filter.

pileups: list of reads (pysam.PileupRead) aligned to this column

pos: deprecated, usereference_pos instead.

reference_id: the reference sequence number as defined in the header

reference_name: reference name (None if no AlignmentFile is associated)

reference_pos: the position in the reference sequence (0-based).

set_min_base_quality(self,min_base_quality): set the minimum base quality for this pileup column.

tid: deprecated, usereference_id instead.

classpysam.PileupRead

Representation of a read aligned to a particular position in thereference sequence.

alignment: apysam.AlignedSegment object of the aligned read

indel

indel length for the position following the current pileup site.

This quantity peeks ahead to the next cigar operation in thisalignment. If the next operation is an insertion, indel willbe positive. If the next operation is a deletion, it will benegation. 0 if the next operation is not an indel.

is_del: 1 iff the base on the padded read is a deletion

is_head: 1 iff the base on the padded read is the left-most base.

is_refskip: 1 iff the base on the padded read is part of CIGAR N op.

is_tail: 1 iff the base on the padded read is the right-most base.

level: the level of the read in the “viewer” mode. Note that this valueis currently not computed.

query_position: position of the read base at the pileup site, 0-based.None ifis_del oris_refskip is set.

query_position_or_next

position of the read base at the pileup site, 0-based.

If the current position is a deletion, returns the nextaligned base.

classpysam.IndexedReads(AlignmentFilesamfile,intmultiple_iterators=True)

Index a Sam/BAM-file by query name while keeping theoriginal sort order intact.

The index is kept in memory and can be substantial.

By default, the file is re-opened to avoid conflicts if multipleoperators work on the same file. Setmultiple_iterators = Falseto not re-opensamfile.

Parameters:

samfile (AlignmentFile) – File to be indexed.
multiple_iterators (bool) – Flag indicating whether the file should be reopened. Reopening preventsexisting iterators being affected by the indexing.

build(self): build the index.

find(self,query_name)

findquery_name in index.

Returns:: Returns an iterator over all reads with query_name.
Return type:: IteratorRowSelection
Raises:: KeyError – if thequery_name is not in the index.

Tabix files

TabixFile opens tabular files that have beenindexed withtabix.

classpysam.TabixFile

Random access to bgzf formatted files thathave been indexed bytabix.

The file is automatically opened. The index file of file<filename> is expected to be called<filename>.tbiby default (see parameterindex).

Parameters:

filename (string) – Filename of bgzf file to be opened.
index (string) – The filename of the index. If not set, the default is toassume that the index is calledfilename.tbi
mode (char) – The file opening mode. Currently, onlyr is permitted.
parser (pysam.Parser) – sets the default parser for this tabix file. Ifparseris None, the results are returned as an unparsed string.Otherwise,parser is assumed to be a functor that will returnparsed data (see for exampleasTuple andasGTF).
encoding (string) – The encoding passed to the parser
threads (integer) – Number of threads to use for decompressing Tabix files.(Default=1)

Raises:

ValueError – if index file is missing.
IOError – if file could not be opened

close(self): closes thepysam.TabixFile.

contigs: list of chromosome names

fetch(self,reference=None,start=None,end=None,region=None,parser=None,multiple_iterators=False)

fetch one or more rows in aregion using 0-basedindexing. The region is specified byreference,start andend. Alternatively, a samtoolsregionstring can be supplied.

Withoutreference orregion all entries will be fetched.

If onlyreference is set, all reads matching onreferencewill be fetched.

Ifparser is None, the default parser will be used forparsing.

Setmultiple_iterators to true if you will be using multipleiterators on the same file at the same time. The iteratorreturned will receive its own copy of a filehandle to the fileeffectively re-opening the file. Re-opening a file createssome overhead, so beware.

header

the file header.

The file header consists of the lines at the beginning of afile that are prefixed by the comment character#.

Note

The header is returned as an iterator presenting lineswithout the newline character.

To iterate over tabix files, usetabix_iterator():

pysam.tabix_iterator(infile,parser)

return an iterator over all entries in a file.

Results are returned parsed as specified by theparser. Ifparser is None, the results are returned as an unparsed string.Otherwise,parser is assumed to be a functor that will returnparsed data (see for exampleasTuple andasGTF).

pysam.tabix_compress(filename_in,filename_out,force=False)

compressfilename_in writing the output tofilename_out.

Raise an IOError iffilename_out already exists, unlessforceis set.

pysam.tabix_index(filename,force=False,seq_col=None,start_col=None,end_col=None,preset=None,meta_char='#',intline_skip=0,zerobased=False,intmin_shift=-1,index=None,keep_original=False,csi=False)

index tab-separatedfilename using tabix.

An existing index will not be overwritten unlessforce is set.

The index will be built from coordinates in columnsseq_col,start_col andend_col.

The contents offilename have to be sorted by contig andposition - the method does not check if the file is sorted.

Column indices are 0-based. Note that this is different from thetabix command line utility where column indices start at 1.

Coordinates in the file are assumed to be 1-based unlesszerobased is set.

Ifpreset is provided, the column coordinates are taken from apreset. Valid values for preset are “gff”, “bed”, “sam”, “vcf”,psltbl”, “pileup”.

Lines beginning withmeta_char and the firstline_skip lineswill be skipped.

Iffilename is not detected as a gzip file it will be automaticallycompressed. The original file will be removed and only the compressedfile will be retained.

By default or whenmin_shift is 0, creates a TBI index. Ifmin_shiftis greater than zero and/orcsi is True, creates a CSI index with aminimal interval size of 1<<min_shift (1<<14 if onlycsi is set).

index controls the filename which should be used for creating the index.If not set, the default is to append.tbi tofilename.

When automatically compressing files, ifkeep_original is set theuncompressed file will not be deleted.

returns the filename of the compressed data

classpysam.asTuple

converts atabix row into a python tuple.

A field in a row is accessed by numeric index.

classpysam.asVCF

converts atabix row into a VCF record withthe following fields:

Column	Field	Contents
1	contig	chromosome
2	pos	chromosomal position, zero-based
3	id	id
4	ref	reference allele
5	alt	alternate alleles
6	qual	quality
7	filter	filter
8	info	info
9	format	format specifier.

Access to genotypes is via index:

contig=vcf.contigfirst_sample_genotype=vcf[0]second_sample_genotype=vcf[1]

classpysam.asBed

converts atabix row into a bed recordwith the following fields:

Column	Field	Contents
1	contig	contig
2	start	genomic start coordinate (zero-based)
3	end	genomic end coordinate plus one(zero-based)
4	name	name of feature.
5	score	score of feature
6	strand	strand of feature
7	thickStart	thickStart
8	thickEnd	thickEnd
9	itemRGB	itemRGB
10	blockCount	number of bocks
11	blockSizes	‘,’ separated string of block sizes
12	blockStarts	‘,’ separated string of block genomicstart positions

Only the first three fields are required. Additionalfields are optional, but if one is defined, all the precedingneed to be defined as well.

classpysam.asGTF

converts atabix row into a GTF record with the followingfields:

Column	Name	Content
1	contig	the chromosome name
2	feature	The feature type
3	source	The feature source
4	start	genomic start coordinate(0-based)
5	end	genomic end coordinate(0-based)
6	score	feature score
7	strand	strand
8	frame	frame
9	attributes	the attribute field

GTF formatted entries also define the following fields thatare derived from the attributes field:

Name	Content
gene_id	the gene identifier
transcript_id	the transcript identifier

FASTA files

classpysam.FastaFile

Random access to fasta formatted files thathave been indexed byfaidx.

The file is automatically opened. The index file of file<filename> is expected to be called<filename>.fai.

Parameters:

filename (string) – Filename of fasta file to be opened.
filepath_index (string) – Optional, filename of the index. By default this isthe filename + “.fai”.
filepath_index_compressed (string) – Optional, filename of the index if fasta file is. By default this isthe filename + “.gzi”.

Raises:

ValueError – if index file is missing
IOError – if file could not be opened

close(self): close the file.

closed: bool indicating the current state of the file object.This is a read-only attribute; the close() method changes the value.

fetch(self,reference=None,start=None,end=None,region=None)

fetch sequences in aregion.

A region caneither be specified byreference,start andend.start andend denote 0-based, half-openintervals.

Alternatively, a samtoolsregion string can besupplied.

If any of the coordinates are missing they will be replaced by theminimum (start) or maximum (end) coordinate.

Note that region strings are 1-based, whilestart andend denotean interval in python coordinates.The region is specified byreference,start andend.

Returns:

string

Return type:

a string with the sequence specified by the region.

Raises:

IndexError – if the coordinates are out of range
ValueError – if the region is invalid

filename: filename associated with this object. This is a read-only attribute.

get_reference_length(self,reference): return the length of reference.

is_open(self): return true if samfile has been opened.

lengths: tuple with the lengths ofreference sequences.

nreferences: int with the number ofreference sequences in the file.This is a read-only attribute.

references: tuple with the names ofreference sequences.

FASTQ files

classpysam.FastxFile

Stream access tofasta orfastq formatted files.

The file is automatically opened.

Entries in the file can be both fastq or fasta formatted or even amixture of the two.

This file object permits iterating over all entries in thefile. Random access is not implemented. The iteration returnsobjects of typeFastqProxy

Parameters:

filename (string) – Filename of fasta/fastq file to be opened.
persist (bool) – If True (default) make a copy of the entry in the file duringiteration. If set to False, no copy will be made. This willpermit much faster iteration, but an entry will not persistwhen the iteration continues and an entry is read-only.

Notes

Prior to version 0.8.2, this class was called FastqFile.

Raises:: IOError – if file could not be opened

Examples

>>>withpysam.FastxFile(filename)asfh:...forentryinfh:...print(entry.name)...print(entry.sequence)...print(entry.comment)...print(entry.quality)>>>withpysam.FastxFile(filename)asfin,open(out_filename,mode='w')asfout:...forentryinfin:...fout.write(str(entry)+'\n')

close(self): close the file.

closed: bool indicating the current state of the file object.This is a read-only attribute; the close() method changes the value.

filename: string with the filename associated with this object.

is_open(self): return true if samfile has been opened.

classpysam.FastqProxy

A single entry in a fastq file.

get_quality_array(self,intoffset=33)→array: return quality values as integer array after subtracting offset.

name: The name of each entry in the fastq file.

quality: The quality score of each entry in the fastq file, represented as a string.

sequence: The sequence of each entry in the fastq file.

VCF/BCF files

classpysam.VariantFile(*args,**kwargs)

(filename, mode=None, index_filename=None, header=None, drop_samples=False,duplicate_filehandle=True, ignore_truncation=False, threads=1)

AVCF/BCF formatted file. The file is automaticallyopened.

If an index for a variant file exists (.csi or .tbi), it will beopened automatically. Without an index random access to recordsviafetch() is disabled.

For writing, aVariantHeader object must be provided,typically obtained from anotherVCF file/BCFfile.

Parameters:

mode (string) –
mode should ber for reading orw for writing. The default istext mode (VCF). For binary (BCF) I/O you should appendb for compressed oru for uncompressedBCF output.
Ifb is present, it must immediately followr orw. Validmodes arer,w,wh,rb,wb,wbu andwb0.For instance, to open aBCF formatted file for reading, type:
```
f=pysam.VariantFile('ex1.bcf','r')
```
If mode is not specified, we will try to auto-detect the file type. Allof the following should work:
```
f1=pysam.VariantFile('ex1.bcf')f2=pysam.VariantFile('ex1.vcf')f3=pysam.VariantFile('ex1.vcf.gz')
```
index_filename (string) – Explicit path to an index file.
header (VariantHeader) –VariantHeader object required for writing.
drop_samples (bool) – Ignore sample information when reading.
duplicate_filehandle (bool) – By default, file handles passed either directly or throughFile-like objects will be duplicated before passing them tohtslib. The duplication prevents issues where the same streamwill be closed by htslib and through destruction of thehigh-level python object. Set to False to turn offduplication.
ignore_truncation (bool) – Issue a warning, instead of raising an error if the current fileappears to be truncated due to a missing EOF marker. Only appliesto bgzipped formats. (Default=False)
threads (integer) – Number of threads to use for compressing/decompressing VCF/BCF files.Setting threads to > 1 cannot be combined withignore_truncation.(Default=1)

close(self): closes thepysam.VariantFile.

copy(self)

fetch(self,contig=None,start=None,stop=None,region=None,reopen=False,end=None,reference=None)

fetch records in aregion, specified either bycontig,start, andend (which are 0-based, half-open);or alternatively by a samtoolsregion string (which is1-based inclusive).

Withoutcontig orregion all mapped records will be fetched. Therecords will be returned ordered by contig, which will not necessarilybe the order within the file.

Setreopen to true if you will be using multiple iterators on thesame file at the same time. The iterator returned will receive itsown copy of a filehandle to the file effectively re-opening thefile. Re-opening a file incurrs some overhead, so use with care.

If onlycontig is set, all records oncontig will be fetched.If bothregion andcontig are given, an exception is raised.

Note that a bgzippedVCF.gz file without a tabix/CSI index(.tbi/.csi) or aBCF file without a CSI index can only beread sequentially.

get_reference_name(self,tid): returnreference name corresponding to numericaltid

get_tid(self,reference)

return the numericaltid corresponding toreference

returns -1 if reference is not known.

is_valid_tid(self,tid)

return True if the numericaltid is valid; False otherwise.

returns -1 if reference is not known.

new_record(self,*args,**kwargs)

Create a new emptyVariantRecord.

SeeVariantHeader.new_record()

open(self,filename,mode='r',index_filename=None,VariantHeaderheader=None,drop_samples=False,duplicate_filehandle=True,ignore_truncation=False,threads=1)

open a vcf/bcf file.

If open is called on an existing VariantFile, the current file will beclosed and a new file will be opened.

reset(self): reset file position to beginning of file just after the header.

subset_samples(self,include_samples): Read only a subset of samples to reduce processing time and memory.Must be called prior to retrieving records.

write(self,VariantRecordrecord)→int

write a singlepysam.VariantRecord to disk.

returns the number of bytes written.

classpysam.VariantHeader

header information for aVariantFile object

add_line(self,line): Add a metadata line to this header

add_meta(self,key,value=None,items=None): Add metadata to this header

add_record(self,VariantHeaderRecordrecord): Add an existingVariantHeaderRecord to this header

add_sample(self,name): Add a new sample to this header

add_samples(self,*args): Add several new samples to this header.This function takes multiple arguments, each of which maybe either a sample name or an iterable returning sample names(e.g., a list of sample names).

alts

alt metadata (dict ID->record).

The data returned just a snapshot of alt records, is createdevery time the property is requested, and modifications willnot be reflected in the header metadata and vice versa.

i.e. it is just a dict that reflects the state of alt recordsat the time it is created.

contigs: contig information (VariantHeaderContigs)

copy(self)

filters: filter metadata (VariantHeaderMetadata)

formats: format metadata (VariantHeaderMetadata)

info: info metadata (VariantHeaderMetadata)

merge(self,VariantHeaderheader)

new_record(self,contig=None,start=0,stop=0,alleles=None,id=None,qual=None,filter=None,info=None,samples=None,**kwargs)

Create a new empty VariantRecord.

Arguments are currently experimental. Use with caution and expectchanges in upcoming releases.

records: header records (VariantHeaderRecords)

samples

version: VCF version

classpysam.VariantHeaderRecord(*args,**kwargs)

header record from aVariantHeader object

attrs: sequence of additional header attributes

get(self,key,default=None): D.get(k[,d]) -> D[k] if k in D, else d. d defaults to None.

items(self): D.items() -> list of D’s (key, value) pairs, as 2-tuples

iteritems(self): D.iteritems() -> an iterator over the (key, value) items of D

iterkeys(self): D.iterkeys() -> an iterator over the keys of D

itervalues(self): D.itervalues() -> an iterator over the values of D

key: header key (the part before ‘=’, in FILTER/INFO/FORMAT/contig/fileformat etc.)

keys(self): D.keys() -> list of D’s keys

pop(self,key,default=_nothing)

remove(self)

type

FILTER, INFO, FORMAT, CONTIG, STRUCTURED, or GENERIC

Type:: header type

update(self,items=None,**kwargs)

D.update([E, ]**F) -> None.

Update D from dict/iterable E and F.

value: header value. Set only for generic lines, None for FILTER/INFO, etc.

values(self): D.values() -> list of D’s values

classpysam.VariantRecord(*args,**kwargs)

Variant record

alleles: tuple of reference allele followed by alt alleles

alts: tuple of alt alleles

chrom: chromosome/contig name

contig: chromosome/contig name

copy(self): return a copy of this VariantRecord object

filter: filter information (seeVariantRecordFilter)

format: sample format metadata (seeVariantRecordFormat)

id: record identifier or None if not available

info: info data (seeVariantRecordInfo)

pos: record start position on chrom/contig (1-based inclusive)

qual: phred scaled quality score or None if not available

ref: reference allele

rid: internal reference id number

rlen: record length on chrom/contig (aka rec.stop - rec.start)

samples: sample data (seeVariantRecordSamples)

start: record start position on chrom/contig (0-based inclusive)

stop: record stop position on chrom/contig (0-based exclusive)

translate(self,VariantHeaderdst_header)

HTSFile

HTSFile is the base class forpysam.AlignmentFile andpysam.VariantFile.

classpysam.HTSFile

Base class for HTS file types

add_hts_options(self,format_options=None): Given a list of key=value format option strings, add them to an open htsFile

category: General file format category. One of UNKNOWN, ALIGNMENTS,VARIANTS, INDEX, REGIONS

check_truncation(self,ignore_truncation=False): Check if file is truncated.

close(self)

closed: return True if HTSFile is closed.

compression

File compression.

One of NONE, GZIP, BGZF, CUSTOM.

description: Vaguely human readable description of the file format

flush(self): Flush any buffered data to the underlying output stream.

format

File format.

One of UNKNOWN, BINARY_FORMAT, TEXT_FORMAT, SAM, BAM,BAI, CRAM, CRAI, VCF, BCF, CSI, GZI, TBI, BED.

get_reference_name(self,tid): returncontig name corresponding to numericaltid

get_tid(self,contig)

return the numericaltid corresponding tocontig

returns -1 if contig is not known.

is_bam: return True if HTSFile is reading or writing a BAM alignment file

is_bcf: return True if HTSFile is reading or writing a BCF variant file

is_closed: return True if HTSFile is closed.

is_cram: return True if HTSFile is reading or writing a BAM alignment file

is_open: return True if HTSFile is open and in a valid state.

is_read: return True if HTSFile is open for reading

is_sam: return True if HTSFile is reading or writing a SAM alignment file

is_valid_reference_name(self,contig): return True if the contig namecontig is valid; False otherwise.

is_valid_tid(self,tid)

return True if the numericaltid is valid; False otherwise.

returns -1 if contig is not known.

is_vcf: return True if HTSFile is reading or writing a VCF variant file

is_write: return True if HTSFile is open for writing

parse_region(self,contig=None,start=None,stop=None,region=None,tid=None,reference=None,end=None)

parse alternative ways to specify a genomic region. A region caneither be specified bycontig,start andstop.start andstop denote 0-based, half-openintervals.reference andend are also accepted forbackward compatibility as synonyms forcontig andstop, respectively.

Alternatively, a samtoolsregion string can besupplied.

If any of the coordinates are missing they will be replaced bythe minimum (start) or maximum (stop) coordinate.

Note that region strings are 1-based inclusive, whilestartandstop denote an interval in 0-based, half-opencoordinates (like BED files and Python slices).

Ifcontig orregion or are*, unmapped reads at the endof a BAM file will be returned. Setting either to. williterate from the beginning of the file.

Returns:: a tuple offlag,tid,start andstop. The flag indicates whether no coordinates weresupplied and the genomic region is the complete genomic space.
Return type:: tuple
Raises:: ValueError – for invalid or out of bounds regions.

reset(self)

reset file position to beginning of file just after the header.

Returns:: The file position after moving the file pointer.
Return type:: pointer

seek(self,uint64_toffset,intwhence=io.SEEK_SET): move file pointer to positionoffset, seepysam.HTSFile.tell().

tell(self): return current file position, seepysam.HTSFile.seek().

version: Tuple of file format version numbers (major, minor)

Movatterモバイル変換

Introduction

API

SAM/BAM/CRAM files

Tabix files

FASTA files

FASTQ files

VCF/BCF files

HTSFile