Introduction
Pysam is a python module that makes it easy to read and manipulatemapped short read sequence data stored in SAM/BAM files. It is alightweight wrapper of thehtslib C-API.
This page provides a quick introduction in using pysam followed by theAPI. SeeWorking with BAM/CRAM/SAM-formatted files for more detailed usage instructions.
To use the module to read a file in BAM format, create aAlignmentFile object:
importpysamsamfile=pysam.AlignmentFile("ex1.bam","rb")
Once a file is opened you can iterate over all of the reads mapping toa specified region usingfetch(). Eachiteration returns aAlignedSegment object whichrepresents a single read along with its fields and optional tags:
forreadinsamfile.fetch('chr1',100,120):print(read)samfile.close()
To give:
EAS56_57:6:190:289:82099<<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<;69CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA01921EAS56_57:6:190:289:82099<<<<<<;<<<<<<<<<<;<<;<<<<;8<6;9;;2;137AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC73641EAS51_64:3:190:727:3080102<<<<<<<<<<<<<<<<<<<<<<<<<<<::<<<84499GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG99181...
You can also write to aAlignmentFile:
importpysamsamfile=pysam.AlignmentFile("ex1.bam","rb")pairedreads=pysam.AlignmentFile("allpaired.bam","wb",template=samfile)forreadinsamfile.fetch():ifread.is_paired:pairedreads.write(read)pairedreads.close()samfile.close()
An alternative way of accessing the data in a SAM file is by iteratingover each base of a specified region using thepileup() method. Each iteration returns aPileupColumn which represents all the reads in the SAMfile that map to a single base in the reference sequence. The list ofreads are represented asPileupRead objects in thePileupColumn.pileups property:
importpysamsamfile=pysam.AlignmentFile("ex1.bam","rb")forpileupcolumninsamfile.pileup("chr1",100,120):print("\ncoverage at base%s =%s"%(pileupcolumn.pos,pileupcolumn.n))forpileupreadinpileupcolumn.pileups:ifnotpileupread.is_delandnotpileupread.is_refskip:# query position is None if is_del or is_refskip is set.print('\tbase in read%s =%s'%(pileupread.alignment.query_name,pileupread.alignment.query_sequence[pileupread.query_position]))samfile.close()
The above code outputs:
coverageatbase99=1baseinreadEAS56_57:6:190:289:82=Acoverageatbase100=1baseinreadEAS56_57:6:190:289:82=Gcoverageatbase101=1baseinreadEAS56_57:6:190:289:82=Gcoverageatbase102=2baseinreadEAS56_57:6:190:289:82=GbaseinreadEAS51_64:3:190:727:308=G...
Commands available insamtools are available as simplefunction calls. For example:
pysam.sort("-o","output.bam","ex1.bam")
corresponds to the command line:
samtoolssort-ooutput.bamex1.bam
Analogous toAlignmentFile, aTabixFile allows fast random access to compressed andtabix indexed tab-separated file formats with genomic data:
importpysamtabixfile=pysam.TabixFile("example.gtf.gz")forgtfintabixfile.fetch("chr1",1000,2000):print(gtf.contig,gtf.start,gtf.end,gtf.gene_id)
TabixFile implements lazy parsing in order to iterateover large tables efficiently.
More detailed usage instructions are available atWorking with BAM/CRAM/SAM-formatted files.
Note
Coordinates in pysam are always 0-based (following the pythonconvention). SAM text files use 1-based coordinates.
Note
- The above examples can be run in the
testsdirectory of the installation directory. Type ‘make’ before running them.
See also
https://github.com/pysam-developers/pysam
The pysam code repository, containing source code and downloadinstructions
http://pysam.readthedocs.org/en/latest/
The pysam website containing documentation
API
SAM/BAM/CRAM files
Objects of typeAlignmentFile allow working withBAM/SAM formatted files.
- classpysam.AlignmentFile
AlignmentFile(filepath_or_object, mode=None, template=None,reference_names=None, reference_lengths=None, text=NULL,header=None, add_sq_text=False, check_header=True, check_sq=True,reference_filename=None, filename=None, index_filename=None,filepath_index=None, require_index=False, duplicate_filehandle=True,ignore_truncation=False, threads=1)
Iffilepath_or_object is a string, the file is automaticallyopened. Iffilepath_or_object is a python File object, thealready opened file will be used.
If the file is opened for reading and an index exists (if file is BAM, a.bai file or if CRAM a .crai file), it will be opened automatically.index_filename may be specified explicitly. If the index is not namedin the standard manner, not located in the same directory as theBAM/CRAM file, or is remote. Without an index, random access via
fetch()andpileup()is disabled.For writing, the header of aSAM file/BAM file canbe constituted from several sources (see also the samtools formatspecification):
Iftemplate is given, the header is copied from anotherAlignmentFile (template must be a
AlignmentFile).Ifheader is given, the header is built from amulti-level dictionary.
Iftext is given, new header text is copied from rawtext.
The names (reference_names) and lengths(reference_lengths) are supplied directly as lists.
When reading or writing a CRAM file, the filename of a FASTA-formattedreference can be specified withreference_filename.
By default, if a file is opened in mode ‘r’, it is checkedfor a valid header (check_header = True) and a definition ofchromosome names (check_sq = True).
- Parameters:
mode (string) –
mode should be
rfor reading orwfor writing. Thedefault is text mode (SAM). For binary (BAM)I/O you should appendbfor compressed oruforuncompressedBAM output. Usehto output headerinformation in text (TAM) mode. UsecforCRAM formatted files.If
bis present, it must immediately followrorw. Valid modes arer,w,wh,rb,wb,wbu,wb0,rcandwc. For instance, to open aBAM formatted file for reading, type:f=pysam.AlignmentFile('ex1.bam','rb')
If mode is not specified, the method will try to auto-detectin the order ‘rb’, ‘r’, thus both the following should work:
f1=pysam.AlignmentFile('ex1.bam')f2=pysam.AlignmentFile('ex1.sam')
template (AlignmentFile) – when writing, copy header from filetemplate.
header (dict orAlignmentHeader) – when writing, build header from a multi-level dictionary. Thefirst level are the four types (‘HD’, ‘SQ’, …). The secondlevel are a list of lines, with each line being a list oftag-value pairs. The header is constructed first from all thedefined fields, followed by user tags in alphabeticalorder. Alternatively, an
AlignmentHeaderobject can be passed directly.text (string) – when writing, use the string provided as the header
reference_names (list) – see reference_lengths
reference_lengths (list) – when writing or opening a SAM file without header build headerfrom list of chromosome names and lengths. By default, ‘SQ’and ‘LN’ tags will be added to the header text. This optioncan be changed by unsetting the flagadd_sq_text.
add_sq_text (bool) – do not add ‘SQ’ and ‘LN’ tags to header. This option permitsconstructionSAM formatted files without a header.
add_sam_header (bool) – when outputting SAM the default is to output a header. This isequivalent to opening the file in ‘wh’ mode. If this option isset to False, no header will be output. To read such a file,setcheck_header=False.
check_header (bool) – obsolete: when reading a SAM file, check if header is present(default=True)
check_sq (bool) – when reading, check if SQ entries are present in header(default=True)
reference_filename (string) – Path to a FASTA-formatted reference file. Valid only for CRAM files.When reading a CRAM file, this overrides both
$REF_PATHand the URLspecified in the header (URtag), which are normally used to findthe reference.index_filename (string) – Explicit path to the index file. Only needed if the index is notnamed in the standard manner, not located in the same directory asthe BAM/CRAM file, or is remote. An IOError is raised if the indexcannot be found or is invalid.
filepath_index (string) – Alias forindex_filename.
require_index (bool) – When reading, require that an index file is present and is valid orraise an IOError. (default=False)
filename (string) – Alternative to filepath_or_object. Filename of the fileto be opened.
duplicate_filehandle (bool) – By default, file handles passed either directly or throughFile-like objects will be duplicated before passing them tohtslib. The duplication prevents issues where the same streamwill be closed by htslib and through destruction of thehigh-level python object. Set to False to turn offduplication.
ignore_truncation (bool) – Issue a warning, instead of raising an error if the current fileappears to be truncated due to a missing EOF marker. Only appliesto bgzipped formats. (Default=False)
format_options (list) – A list of key=value strings, as accepted by –input-fmt-option and–output-fmt-option in samtools.
threads (integer) – Number of threads to use for compressing/decompressing BAM/CRAM files.Setting threads to > 1 cannot be combined withignore_truncation.(Default=1)
- check_index(self)
return True if index is present.
- Raises:
AttributeError – if htsfile isSAM formatted and thus has no index.
ValueError – if htsfile is closed or index could not be opened.
- close(self)
closes the
pysam.AlignmentFile.
- count(self,contig=None,start=None,stop=None,region=None,until_eof=False,read_callback='nofilter',reference=None,end=None)
count the number of reads inregion
The region is specified bycontig,start andstop.reference andend are also accepted for backwardcompatibility as synonyms forcontig andstop,respectively. Alternatively, asamtoolsregionstring can be supplied.
ASAM file does not allow random access and ifregion orcontig are given, an exception is raised.
- Parameters:
contig (string) – reference_name of the genomic region (chromosome)
start (int) – start of the genomic region (0-based inclusive)
stop (int) – end of the genomic region (0-based exclusive)
region (string) – a region string in samtools format.
until_eof (bool) – count until the end of the file, possibly includingunmapped reads as well.
read_callback (string orfunction) –
select a call-back to ignore reads when counting. It canbe either a string with the following values:
allskip reads in which any of the followingflags are set: BAM_FUNMAP, BAM_FSECONDARY, BAM_FQCFAIL,BAM_FDUP
nofilteruses every single read
Alternatively,read_callback can be a function
check_read(read)that should return True only forthose reads that shall be included in the counting.reference (string) – backward compatible synonym forcontig
end (int) – backward compatible synonym forstop
- Raises:
ValueError – if the genomic coordinates are out of range or invalid.
- count_coverage(self,contig,start=None,stop=None,region=None,quality_threshold=15,read_callback='all',reference=None,end=None)
count the coverage of genomic positions by reads inregion.
The region is specified bycontig,start andstop.reference andend are also accepted for backwardcompatibility as synonyms forcontig andstop,respectively. Alternatively, asamtoolsregionstring can be supplied. The coverage is computed per-base [ACGT].
- Parameters:
contig (string) – reference_name of the genomic region (chromosome)
start (int) – start of the genomic region (0-based inclusive). If notgiven, count from the start of the chromosome.
stop (int) – end of the genomic region (0-based exclusive). If not given,count to the end of the chromosome.
region (string) – a region string.
quality_threshold (int) – quality_threshold is the minimum quality score (in phred) abase has to reach to be counted.
read_callback (string orfunction) –
select a call-back to ignore reads when counting. It canbe either a string with the following values:
allskip reads in which any of the followingflags are set: BAM_FUNMAP, BAM_FSECONDARY, BAM_FQCFAIL,BAM_FDUP
nofilteruses every single read
Alternatively,read_callback can be a function
check_read(read)that should return True only forthose reads that shall be included in the counting.reference (string) – backward compatible synonym forcontig
end (int) – backward compatible synonym forstop
- Raises:
ValueError – if the genomic coordinates are out of range or invalid.
- Returns:
four array.arrays of the same length in order A C G T
- Return type:
- fetch(self,contig=None,start=None,stop=None,region=None,tid=None,until_eof=False,multiple_iterators=False,reference=None,end=None)
fetch reads aligned in aregion.
See
parse_region()for more informationon how genomic regions can be specified.reference andend are also accepted for backward compatibility as synonymsforcontig andstop, respectively.Without acontig orregion all mapped reads in the filewill be fetched. The reads will be returned ordered by referencesequence, which will not necessarily be the order within thefile. This mode of iteration still requires an index. If there isno index, useuntil_eof=True.
If onlycontig is set, all reads aligned tocontigwill be fetched.
ASAM file does not allow random access. Ifregionorcontig are given, an exception is raised.
- Parameters:
until_eof (bool) – Ifuntil_eof is True, all reads from the current fileposition will be returned in order as they are within thefile. Using this option will also fetch unmapped reads.
multiple_iterators (bool) – Ifmultiple_iterators is True, multipleiterators on the same file can be used at the same time. Theiterator returned will receive its own copy of a filehandle tothe file effectively re-opening the file. Re-opening a filecreates some overhead, so beware.
- Returns:
An iterator over a collection of reads.
- Return type:
IteratorRow
- Raises:
ValueError – if the genomic coordinates are out of range or invalid or the file does not permit random access to genomic coordinates.
- find_introns(self,read_iterator)
Return a dictionary {(start, stop): count}Listing the intronic sites in the reads (identified by ‘N’ in the cigar strings),and their support ( = number of reads ).
read_iterator can be the result of a .fetch(…) call.Or it can be a generator filtering such reads. Examplesamfile.find_introns((read for read in samfile.fetch(…) if read.is_reverse)
- find_introns_slow(self,read_iterator)
Return a dictionary {(start, stop): count}Listing the intronic sites in the reads (identified by ‘N’ in the cigar strings),and their support ( = number of reads ).
read_iterator can be the result of a .fetch(…) call.Or it can be a generator filtering such reads. Examplesamfile.find_introns((read for read in samfile.fetch(…) if read.is_reverse)
- get_index_statistics(self)
return statistics about mapped/unmapped reads per chromosome asthey are stored in the index, similarly to the statistics printedby the
samtoolsidxstatscommand.CRAI indexes do not record these statistics, so for a CRAM filewith a .crai index the returned statistics will all be 0.
- Returns:
a list of records for each chromosome. Each record has theattributes ‘contig’, ‘mapped’, ‘unmapped’ and ‘total’.
- Return type:
- get_tid(self,reference)
return the numericaltid corresponding toreference
returns -1 if reference is not known.
- getrname(self,tid)
deprecated, use
get_reference_name()instead
- has_index(self)
return true if htsfile has an existing (and opened) index.
- head(self,n,multiple_iterators=True)
return an iterator over the first n alignments.
This iterator is is useful for inspecting the bam-file.
- Parameters:
multiple_iterators (bool) – is set to True by default in order toavoid changing the current file position.
- Returns:
an iterator over a collection of reads
- Return type:
IteratorRowHead
- is_valid_tid(self,inttid)
return True if the numericaltid is valid; False otherwise.
Note that the unmapped tid code (-1) counts as an invalid.
- lengths
tuple of the lengths of thereference sequences. This is aread-only attribute. The lengths are in the same order as
pysam.AlignmentFile.references
- mapped
int with total number of mapped alignments according to thestatistics recorded in the index. This is a read-onlyattribute.(This will be 0 for a CRAM file indexed by a .crai index, as thatindex format does not record these statistics.)
- mate(self,AlignedSegmentread)
return the mate of
pysam.AlignedSegmentread.Note
Calling this method will change the file position.This might interfere with any iterators that havenot re-opened the file.
Note
This method is too slow for high-throughput processing.If a read needs to be processed with its mate, workfrom a read name sorted file or, better, cache reads.
- Returns:
the mate
- Return type:
- Raises:
ValueError – if the read is unpaired or the mate is unmapped
- nocoordinate
int with total number of reads without coordinates according to thestatistics recorded in the index, i.e., the statistic printed for “*”by the
samtoolsidxstatscommand. This is a read-only attribute.(This will be 0 for a CRAM file indexed by a .crai index, as thatindex format does not record these statistics.)
- pileup(self,contig=None,start=None,stop=None,region=None,reference=None,end=None,**kwargs)
perform apileup within aregion. The region isspecified bycontig,start andstop (using0-based indexing).reference andend are also accepted forbackward compatibility as synonyms forcontig andstop,respectively. Alternatively, a samtools ‘region’ stringcan be supplied.
Without ‘contig’ or ‘region’ all reads will be used for thepileup. The reads will be returned ordered bycontig sequence, which will not necessarily be theorder within the file.
Note thatSAM formatted files do not allow randomaccess. In these files, if a ‘region’ or ‘contig’ aregiven an exception is raised.
Note
‘all’ reads which overlap the region are returned. Thefirst base returned will be the first base of the firstread ‘not’ necessarily the first base of the region usedin the query.
- Parameters:
truncate (bool) – By default, the samtools pileup engine outputs all readsoverlapping a region. If truncate is True and a region isgiven, only columns in the exact region specified arereturned.
max_depth (int) – Maximum read depth permitted. The default limit is ‘8000’.
stepper (string) –
The stepper controls how the iterator advances.Possible options for the stepper are
allskip reads in which any of the following flags are set:BAM_FUNMAP, BAM_FSECONDARY, BAM_FQCFAIL, BAM_FDUP
nofilteruses every single read turning off any filtering.
samtoolssame filter and read processing as in samtoolspileup. For full compatibility, this requires a‘fastafile’ to be given. The following options all pertainto filtering of the
samtoolsstepper.
fastafile (
FastaFileobject.) – This is required for some of the steppers.ignore_overlaps (bool) – If set to True, detect if read pairs overlap and only takethe higher quality base. This is the default.
flag_filter (int) – ignore reads where any of the bits in the flag are set. The default isBAM_FUNMAP | BAM_FSECONDARY | BAM_FQCFAIL | BAM_FDUP.
flag_require (int) – only use reads where certain flags are set. The default is 0.
ignore_orphans (bool) – ignore orphans (paired reads that are not in a proper pair).The default is to ignore orphans.
min_base_quality (int) – Minimum base quality. Bases below the minimum quality willnot be output. The default is 13.
adjust_capq_threshold (int) – adjust mapping quality. The default is 0 for noadjustment. The recommended value for adjustment is 50.
min_mapping_quality (int) – only use reads above a minimum mapping quality. The default is 0.
compute_baq (bool) – re-alignment computing per-Base Alignment Qualities (BAQ). Thedefault is to do re-alignment. Realignment requires a referencesequence. If none is present, no realignment will be performed.
redo_baq (bool) – recompute per-Base Alignment Quality on the fly ignoringexisting base qualities. The default is False (use existingbase qualities).
- Returns:
an iterator over genomic positions.
- Return type:
IteratorColumn
- text
deprecated, use
referencesandlengthsinstead
- unmapped
int with total number of unmapped reads according to the statisticsrecorded in the index. This number of reads includes the number of readswithout coordinates. This is a read-only attribute.(This will be 0 for a CRAM file indexed by a .crai index, as thatindex format does not record these statistics.)
- write(self,AlignedSegmentread)→int
write a single
pysam.AlignedSegmentto disk.- Raises:
ValueError – if the writing failed
- Returns:
the number of bytes written. If the file is closed,this will be 0.
- Return type:
- classpysam.AlignmentHeader
header information for a
AlignmentFileobject- Parameters:
header_dict (dict) – build header from a multi-level dictionary. Thefirst level are the four types (‘HD’, ‘SQ’, …). The secondlevel are a list of lines, with each line being a list oftag-value pairs. The header is constructed first from all thedefined fields, followed by user tags in alphabeticalorder. Alternatively, an
AlignmentHeaderobject can be passed directly.text (string) – use the string provided as the header
reference_names (list) – see reference_lengths
reference_lengths (list) – build header from list of chromosome names and lengths. Bydefault, ‘SQ’ and ‘LN’ tags will be added to the headertext. This option can be changed by unsetting the flagadd_sq_text.
add_sq_text (bool) – do not add ‘SQ’ and ‘LN’ tags to header. This option permitsconstructionSAM formatted files without a header.
- copy(self)
- classmethodfrom_dict(cls,header_dict)
- classmethodfrom_references(cls,reference_names,reference_lengths,text=None,add_sq_text=True)
- classmethodfrom_text(cls,text)
- get(self,*args)
- get_reference_length(self,reference)
- get_reference_name(self,tid)
- get_tid(self,reference)
return the numericaltid corresponding toreference
returns -1 if reference is not known.
- is_valid_tid(self,inttid)
return True if the numericaltid is valid; False otherwise.
Note that the unmapped tid code (-1) counts as an invalid.
- items(self)
- iteritems(self)
- keys(self)
- lengths
tuple of the lengths of thereference sequences. This is aread-only attribute. The lengths are in the same order as
pysam.AlignmentFile.references
- to_dict(self)
return two-level dictionary with header information from the file.
The first level contains the record (
HD,SQ, etc) andthe second level contains the fields (VN,LN, etc).The parser is validating and will raise an AssertionError ifif encounters any record or field tags that are not part ofthe SAM specification. Use the
pysam.AlignmentFile.textattribute to get the unparsedheader.The parsing follows the SAM format specification with theexception of the
CLfield. This option will consume therest of a header line irrespective of any additional fields.This behaviour has been added to accommodate command lineoptions that contain characters that are not valid fieldseparators.If no @SQ entries are within the text section of the header,this will be automatically added from the reference names andlengths stored in the binary part of the header.
- values(self)
AnAlignedSegment represents an aligned segment withina SAM/BAM file.
- classpysam.AlignedSegment(AlignmentHeaderheader=None)
Class representing an aligned segment.
This class stores a handle to the samtools C-structure representingan aligned read. Member read access is forwarded to the C-structureand converted into python objects. This implementation should be fast,as only the data needed is converted.
For write access, the C-structure is updated in-place. This isnot the most efficient way to build BAM entries, as the variablelength data is concatenated and thus needs to be resized ifa field is updated. Furthermore, the BAM entry might bein an inconsistent state.
One issue to look out for is that the sequence should alwaysbe setbefore the quality scores. Setting the sequence willalso erase any quality scores that were set previously.
- Parameters:
header –
AlignmentHeaderobject to map numericalidentifiers to chromosome names. If not given, an emptyheader is created.
- aend
deprecated, use
reference_endinstead.
- alen
deprecated, use
reference_lengthinstead.
- aligned_pairs
deprecated, use
get_aligned_pairs()instead.
- bin
properties bin
- blocks
deprecated, use
get_blocks()instead.
- cigar
deprecated, use
cigarstringorcigartuplesinstead.
- cigarstring
thecigar alignment as a string.
The cigar string is a string of alternating integersand characters denoting the length and the type ofan operation.
Note
The order length,operation is specified in theSAM format. It is different from the order ofthe
cigarproperty.Returns None if not present.
To unset the cigarstring, assign None or theempty string.
- cigartuples
thecigar alignment. The alignmentis returned as a list of tuples of (operation, length).
If the alignment is not present, None is returned.
The operations are:
M
pysam.CIGAR_OPS.CMATCH
0
I
pysam.CIGAR_OPS.CINS
1
D
pysam.CIGAR_OPS.CDEL
2
N
pysam.CIGAR_OPS.CREF_SKIP
3
S
pysam.CIGAR_OPS.CSOFT_CLIP
4
H
pysam.CIGAR_OPS.CHARD_CLIP
5
P
pysam.CIGAR_OPS.CPAD
6
=
pysam.CIGAR_OPS.CEQUAL
7
X
pysam.CIGAR_OPS.CDIFF
8
B
pysam.CIGAR_OPS.CBACK
9
Note
The output is a list of (operation, length) tuples, such as
[(0,30)].This is different from the SAM specification andthecigarstringproperty, which uses a(length, operation) order, for example:30M.To unset the cigar property, assign an empty listor None.
- compare(self,AlignedSegmentother)
return -1,0,1, if contents in this are binary<,=,> toother
- flag
properties flag
- classmethodfrom_dict(cls,sam_dict,AlignmentHeaderheader)
parses a dictionary representation of the aligned segment.
- Parameters:
sam_dict – dictionary of alignment values, keys corresponding to output from
todict().
- classmethodfromstring(cls,sam,AlignmentHeaderheader)
parses a string representation of the aligned segment.
The input format should be valid SAM format.
- Parameters:
sam –SAM formatted string
- get_aligned_pairs(self,matches_only=False,with_seq=False,with_cigar=False)
a list of aligned read (query) and reference positions.
Each item in the returned list is a tuple consisting ofthe 0-based offset from the start of the read sequencefollowed by the 0-based reference position.
For inserts, deletions, skipping either query or referenceposition may be None.
For padding in the reference, the reference position willalways be None.
- Parameters:
matches_only (bool) – If True, only matched bases are returned — no None on eitherside.
with_seq (bool) – If True, return a third element in the tuple containing thereference sequence. For CIGAR ‘P’ (padding in the reference)operations, the third tuple element will be None. Substitutionsare lower-case. This option requires an MD tag to be present.
with_cigar (bool) – If True, return an extra element in the tuple containing theCIGAR operator corresponding to this position tuple.
- Returns:
aligned_pairs
- Return type:
list of tuples
- get_blocks(self)
a list of start and end positions ofaligned gapless blocks.
The start and end positions are in genomiccoordinates.
Blocks are not normalized, i.e. two blocksmight be directly adjacent. This happens ifthe two blocks are separated by an insertionin the read.
- get_cigar_stats(self)
summary of operations in cigar string.
The output order in the array is “MIDNSHP=X” followed by afield for the NM tag. If the NM tag is not present, thisfield will always be 0. (Accessing this field via index -1avoids changes if more CIGAR operators are added in future.)
M
pysam.CIGAR_OPS.CMATCH
0
I
pysam.CIGAR_OPS.CINS
1
D
pysam.CIGAR_OPS.CDEL
2
N
pysam.CIGAR_OPS.CREF_SKIP
3
S
pysam.CIGAR_OPS.CSOFT_CLIP
4
H
pysam.CIGAR_OPS.CHARD_CLIP
5
P
pysam.CIGAR_OPS.CPAD
6
=
pysam.CIGAR_OPS.CEQUAL
7
X
pysam.CIGAR_OPS.CDIFF
8
B
pysam.CIGAR_OPS.CBACK
9
NM
NM tag
10 or -1
If no cigar string is present, empty arrays will be returned.
- Returns:
two arrays. The first contains the nucleotide counts withineach cigar operation, the second contains the number of blocksfor each cigar operation.
- Return type:
arrays
- get_forward_qualities(self)
return the original base qualities of the read sequence,in the same format as the
query_qualitiesproperty.Reads mapped to the reverse strand have their base qualities storedreversed in the BAM file. This method returns such reads’ base qualitiesreversed back to their original orientation.
- get_forward_sequence(self)
return the original read sequence.
Reads mapped to the reverse strand are stored reverse complemented inthe BAM file. This method returns such reads reverse complemented backto their original orientation.
Returns None if the record has no query sequence.
- get_overlap(self,uint32_tstart,uint32_tend)
return number of aligned bases of read overlapping the intervalstart andend on the reference sequence.
Return None if cigar alignment is not available.
- get_reference_positions(self,full_length=False)
a list of reference positions that this read aligns to.
By default, this method returns the (0-based) positions on thereference that are within the read’s alignment, leaving gapscorresponding to deletions and other reference skips.
Whenfull_length is True, the returned list is the same lengthas the read and additionally includes None values correspondingto insertions or soft-clipping, i.e., to bases of the read thatare not aligned to a reference position.(See also
get_aligned_pairs()which additionally returnsthe corresponding positions along the read.)
- get_reference_sequence(self)
return the reference sequence in the region that is covered by thealignment of the read to the reference.
This method requires the MD tag to be set.
- get_tag(self,tag,with_value_type=False)
retrieves data from the optional alignment sectiongiven a two-lettertag denoting the field.
The returned value is cast into an appropriate python type.
This method is the fastest way to access the optionalalignment section if only few tags need to be retrieved.
Possible value types are “AcCsSiIfZHB” (see BAM formatspecification) as well as additional value type ‘d’ asimplemented in htslib.
- Parameters:
tag – data tag.
with_value_type – Optional[bool]if set to True, the return value is a tuple of (tag value, typecode). (default False)
- Returns:
A python object with the value of thetag. The type of theobject depends on the data type in the data record.
- Raises:
KeyError – Iftag is not present, a KeyError is raised.
- get_tags(self,with_value_type=False)
the fields in the optional alignment section.
Returns a list of all fields in the optionalalignment section. Values are converted to appropriate pythonvalues. For example:
[(NM,2),(RG,"GJP00TM04")]Ifwith_value_type is set, the value type as encode inthe AlignedSegment record will be returned as well:
[(NM, 2, “i”), (RG, “GJP00TM04”, “Z”)]
This method will convert all values in the optional alignmentsection. When getting only one or few tags, please see
get_tag()for a quicker way to achieve this.
- has_tag(self,tag)
returns true if the optional alignment sectioncontains a giventag.
- infer_query_length(self,always=False)
infer query length from CIGAR alignment.
This method deduces the query length from the CIGAR alignmentbut does not include hard-clipped bases.
Returns None if CIGAR alignment is not present.
Ifalways is set to True,infer_read_length is used instead.This is deprecated and only present for backward compatibility.
- infer_read_length(self)
infer read length from CIGAR alignment.
This method deduces the read length from the CIGAR alignmentincluding hard-clipped bases.
Returns None if CIGAR alignment is not present.
- inferred_length
deprecated, use
infer_query_length()instead.
- is_duplicate
true if optical or PCR duplicate
- is_forward
true if read is mapped to forward strand(implemented in terms of
is_reverse)
- is_mapped
true if read itself is mapped(implemented in terms of
is_unmapped)
- is_paired
true if read is paired in sequencing
- is_proper_pair
true if read is mapped in a proper pair
- is_qcfail
true if QC failure
- is_read1
true if this is read1
- is_read2
true if this is read2
- is_reverse
true if read is mapped to reverse strand
- is_secondary
true if not primary alignment
- is_supplementary
true if this is a supplementary alignment
- is_unmapped
true if read itself is unmapped
- isize
deprecated, use
template_lengthinstead.
- mapping_quality
mapping quality
- mapq
deprecated, use
mapping_qualityinstead.
- mate_is_forward
true if the mate is mapped to forward strand(implemented in terms of
mate_is_reverse)
- mate_is_mapped
true if the mate is mapped(implemented in terms of
mate_is_unmapped)
- mate_is_reverse
true if the mate is mapped to reverse strand
- mate_is_unmapped
true if the mate is unmapped
- modified_bases
Modified bases annotations from MM/ML tags. The output isDict[(canonical base, strand, modification)] -> [ (pos,qual), …]with qual being (256*probability), or -1 if unknown.Strand==0 for forward and 1 for reverse strand modification
- modified_bases_forward
Modified bases annotations from MM/ML tags. The output isDict[(canonical base, strand, modification)] -> [ (pos,qual), …]with qual being (256*probability), or -1 if unknown.Strand==0 for forward and 1 for reverse strand modification.The positions are with respect to the original sequence from get_forward_sequence()
- mpos
deprecated, use
next_reference_startinstead.
- mrnm
deprecated, use
next_reference_idinstead.
- next_reference_start
the position of the mate/next read.
- overlap(self)
deprecated, use
get_overlap()instead.
- pnext
deprecated, use
next_reference_startinstead.
- pos
deprecated, use
reference_startinstead.
- positions
deprecated, use
get_reference_positions()instead.
- qend
deprecated, use
query_alignment_endinstead.
- qlen
deprecated, use
query_alignment_lengthinstead.
- qname
deprecated, use
query_nameinstead.
- qqual
deprecated, use
query_alignment_qualitiesorquery_alignment_qualities_strinstead.
- qstart
deprecated, use
query_alignment_startinstead.
- qual
deprecated, use
query_qualitiesorquery_qualities_strinstead.
- query
deprecated, use
query_alignment_sequenceinstead.
- query_alignment_end
end index of the aligned query portion of the sequence (0-based,exclusive).
This the index just past the last base in
query_sequencethat is not soft-clipped. (For unmapped reads and when CIGAR isunavailable, this will be the length of the query/read.)
- query_alignment_length
length of the aligned query sequence.
This is the number of bases in
query_sequencethat are notsoft-clipped. (For unmapped reads and when CIGAR is unavailable, thiswill equal the length of the query/read.)
- query_alignment_qualities
aligned query sequence quality values (None if not present). Theseare the quality values that correspond to
query_alignment_sequence, that is, they exclude qualities ofsoft clipped bases. This is equal toquery_qualities[query_alignment_start:query_alignment_end].Quality scores are returned as a python array of unsignedchars. Note that this is not the ASCII-encoded value typicallyseen in FASTQ or SAM formatted files. Thus, no offset of 33needs to be subtracted.
This property is read-only.
- query_alignment_qualities_str
aligned query sequence quality values, returned as an ASCII-encoded stringsimilar to that in FASTQ or SAM files, or None if base qualities are not present.These are the quality values that correspond to
query_alignment_sequence,i.e., excluding qualities corresponding to soft-clipped bases.This property is read-only.
- query_alignment_sequence
aligned portion of the read.
This is a substring of
query_sequencethat excludes flankingbases that weresoft clipped (None if not present). Itis equal toquery_sequence[query_alignment_start:query_alignment_end].SAM/BAM files may include extra flanking bases that are notpart of the alignment. These bases may be the result of theSmith-Waterman or other algorithms, which may not requirealignments that begin at the first residue or end at the last.In addition, extra sequencing adapters, multiplex identifiers,and low-quality bases that were not considered for alignmentmay have been retained.
- query_alignment_start
start index of the aligned query portion of the sequence (0-based,inclusive).
This the index of the first base in
query_sequencethat is notsoft-clipped. (For unmapped reads and when CIGAR is unavailable, thiswill be zero.)
- query_length
the length of the query/read.
This value corresponds to the length of the sequence suppliedin the BAM/SAM file. The length of a query is 0 if there is nosequence in the BAM/SAM file. In those cases, the read lengthcan be inferred from the CIGAR alignment, see
pysam.AlignedSegment.infer_query_length().The length includes soft-clipped bases and is equal to
len(query_sequence).This property is read-only but is updated when a new querysequence is assigned to this AlignedSegment.
Returns 0 if not available.
- query_name
the query template name (None if not present)
- query_qualities
read sequence base qualities, includingsoft clipped bases(None if not present).
Quality scores are returned as a python array of unsignedchars. Note that this is not the ASCII-encoded value typicallyseen in FASTQ or SAM formatted files. Thus, no offset of 33needs to be subtracted.
Note that to set quality scores the sequence has to be setbeforehand as this will determine the expected length of thequality score array. Setting will raise a ValueError if thelength of the new quality scores is not the same as thelength of the existing sequence.
Quality scores to be set may be specified as a Python arrayor other iterable of ints, or as a string of ASCII-encoodedFASTQ/SAM-style base quality characters.
- query_qualities_str
read sequence base qualities, includingsoft clipped bases,returned as an ASCII-encoded string similar to that in FASTQ or SAM files,or None if base qualities are not present.
Note that to set quality scores the sequence has to be set beforehandas this will determine the expected length of the quality score string.Setting will raise a ValueError if the length of the new quality scoresis not the same as the length of the existing sequence.
- query_sequence
read sequence bases, includingsoft clipped bases(None if not present).
Assigning to this attribute will invalidate any quality scores.Thus, to in-place edit the sequence and quality scores, copies ofthe quality scores need to be taken. Consider trimming for example:
q=read.query_qualitiesread.query_sequence=read.query_sequence[5:10]read.query_qualities=q[5:10]
The sequence is returned as it is stored in the BAM file. (This willbe the reverse complement of the original read sequence if the mapperhas aligned the read to the reverse strand.)
- reference_end
aligned reference position of the read on the reference genome.
reference_end points to one past the last aligned residue.Returns None if not available (read is unmapped or no cigaralignment present).
- reference_id
reference ID
Note
This field contains the index of the reference sequence inthe sequence dictionary. To obtain the name of thereference sequence, use
get_reference_name()
- reference_length
aligned length of the read on the reference genome.
This is equal toreference_end - reference_start.Returns None if not available.
- reference_start
0-based leftmost coordinate
- rlen
deprecated, use
query_lengthinstead.
- rname
deprecated, use
reference_idinstead.
- rnext
deprecated, use
next_reference_idinstead.
- seq
deprecated, use
query_sequenceinstead.
- set_tag(self,tag,value,value_type=None,replace=True)
sets a particular fieldtag tovalue in the optional alignmentsection.
value_type describes the type ofvalue that is to enteredinto the alignment record. It can be set explicitly to one ofthe valid one-letter type codes. If unset, an appropriate typewill be chosen automatically based on the python type ofvalue.
An existing value of the sametag will be overwritten unlessreplace is set to False. This is usually not recommended as atag may only appear once in the optional alignment section.
Ifvalue isNone, the tag will be deleted.
This method accepts valid SAM specification value types, whichare:
A:printablechari:signedintf:floatZ:printablestringH:BytearrayinhexformatB:Integerornumericarray
Additionally, it will accept the integer BAM types (‘cCsSI’)
For htslib compatibility, ‘a’ is synonymous with ‘A’ and themethod accepts a ‘d’ type code for a double precision float.
When deducing the type code by the python type ofvalue, thefollowing mapping is applied:
i:pythonintf:pythonfloatZ:pythonstrorbytesB:pythonarray.array,listortuple
Note that a single character string will be output as ‘Z’ andnot ‘A’ as the former is the more general type.
- set_tags(self,tags)
sets the fields in the optional alignment section witha list of (tag, value) tuples.
The value type of the values is determined from thepython type. Optionally, a type may be given explicitly asa third value in the tuple, For example:
x.set_tags([(NM, 2, “i”), (RG, “GJP00TM04”, “Z”)]
This method will not enforce the rule that the same tag may appearonly once in the optional alignment section.
- tags
deprecated, use
get_tags()instead.
- template_length
the observed query template length
- tid
deprecated, use
reference_idinstead.
- tlen
deprecated, use
template_lengthinstead.
- to_dict(self)
returns a json representation of the aligned segment.
Field names are abbreviated versions of the class attributes.
- to_string(self)
returns a string representation of the aligned segment.
The output format is valid SAM format if a header is associatedwith the AlignedSegment.
- tostring(self,htsfile=None)
deprecated, use
to_string()instead.- Parameters:
htsfile – (deprecated) AlignmentFile object to map numericalidentifiers to chromosome names. This parameter is presentfor backwards compatibility and ignored.
- classpysam.PileupColumn
A pileup of reads at a particular reference sequence position(column). A pileup column contains all the reads that mapto a certain target base.
This class is a proxy for results returned by the samtools pileupengine. If the underlying engine iterator advances, the resultsof this column will change.
- get_mapping_qualities(self)
query mapping quality scores at pileup column position.
- Returns:
a list of quality scores
- Return type:
- get_num_aligned(self)
return number of aligned bases at pileup column position.
This method applies a base quality filter and the number isequal to the size of
get_query_sequences(),get_mapping_qualities(), etc.
- get_query_names(self)
query/read names aligned at pileup column position.
- Returns:
a list of query names at pileup column position.
- Return type:
- get_query_positions(self)
positions in read at pileup column position.
- Returns:
a list of read positions
- Return type:
- get_query_qualities(self)
query base quality scores at pileup column position.
- Returns:
a list of quality scores
- Return type:
- get_query_sequences(self,boolmark_matches=False,boolmark_ends=False,booladd_indels=False)
query bases/sequences at pileup column position.
Optionally, the bases/sequences can be annotated according to the samtoolsmpileup format. This is the format description from the samtools mpileup tool:
Information on match, mismatch, indel, strand, mappingquality and start and end of a read are all encoded at theread base column. At this column, a dot stands for a matchto the reference base on the forward strand, a comma for amatch on the reverse strand, a '>' or '<' for a referenceskip, `ACGTN' for a mismatch on the forward strand and`acgtn' for a mismatch on the reverse strand. A pattern`\+[0-9]+[ACGTNacgtn]+' indicates there is an insertionbetween this reference position and the next referenceposition. The length of the insertion is given by theinteger in the pattern, followed by the insertedsequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+'represents a deletion from the reference. The deleted baseswill be presented as `*' in the following lines. Also atthe read base column, a symbol `^' marks the start of aread. The ASCII of the character following `^' minus 33gives the mapping quality. A symbol `$' marks the end of aread segment
To reproduce samtools mpileup format, set all of mark_matches,mark_ends and add_indels to True.
- Parameters:
mark_matches (bool) – If True, output bases matching the reference as “.” or “,”for forward and reverse strand, respectively. This markrequires the reference sequence. If no reference ispresent, this option is ignored.
mark_ends (bool) – If True, add markers “^” and “$” for read start and end, respectively.
add_indels (bool) – If True, add bases for bases inserted into or skipped from thereference. The latter requires a reference sequence file to havebeen given, e.g. viapileup(fastafile = …). If no referencesequence is available, skipped bases are represented as ‘N’s.
- Returns:
a list of bases/sequences per read at pileup column position.
- Return type:
- nsegments
number of reads mapping to this column.
Note that this number ignores the base quality filter.
- pileups
list of reads (
pysam.PileupRead) aligned to this column
- pos
deprecated, use
reference_posinstead.
- reference_id
the reference sequence number as defined in the header
- reference_pos
the position in the reference sequence (0-based).
- set_min_base_quality(self,min_base_quality)
set the minimum base quality for this pileup column.
- tid
deprecated, use
reference_idinstead.
- classpysam.PileupRead
Representation of a read aligned to a particular position in thereference sequence.
- alignment
a
pysam.AlignedSegmentobject of the aligned read
- indel
indel length for the position following the current pileup site.
This quantity peeks ahead to the next cigar operation in thisalignment. If the next operation is an insertion, indel willbe positive. If the next operation is a deletion, it will benegation. 0 if the next operation is not an indel.
- is_del
1 iff the base on the padded read is a deletion
- is_head
1 iff the base on the padded read is the left-most base.
- is_refskip
1 iff the base on the padded read is part of CIGAR N op.
- is_tail
1 iff the base on the padded read is the right-most base.
- level
the level of the read in the “viewer” mode. Note that this valueis currently not computed.
- query_position
position of the read base at the pileup site, 0-based.None if
is_deloris_refskipis set.
- query_position_or_next
position of the read base at the pileup site, 0-based.
If the current position is a deletion, returns the nextaligned base.
- classpysam.IndexedReads(AlignmentFilesamfile,intmultiple_iterators=True)
Index a Sam/BAM-file by query name while keeping theoriginal sort order intact.
The index is kept in memory and can be substantial.
By default, the file is re-opened to avoid conflicts if multipleoperators work on the same file. Setmultiple_iterators = Falseto not re-opensamfile.
- Parameters:
samfile (AlignmentFile) – File to be indexed.
multiple_iterators (bool) – Flag indicating whether the file should be reopened. Reopening preventsexisting iterators being affected by the indexing.
- build(self)
build the index.
Tabix files
TabixFile opens tabular files that have beenindexed withtabix.
- classpysam.TabixFile
Random access to bgzf formatted files thathave been indexed bytabix.
The file is automatically opened. The index file of file
<filename>is expected to be called<filename>.tbiby default (see parameterindex).- Parameters:
filename (string) – Filename of bgzf file to be opened.
index (string) – The filename of the index. If not set, the default is toassume that the index is called
filename.tbimode (char) – The file opening mode. Currently, only
ris permitted.parser (
pysam.Parser) – sets the default parser for this tabix file. Ifparseris None, the results are returned as an unparsed string.Otherwise,parser is assumed to be a functor that will returnparsed data (see for exampleasTupleandasGTF).encoding (string) – The encoding passed to the parser
threads (integer) – Number of threads to use for decompressing Tabix files.(Default=1)
- Raises:
ValueError – if index file is missing.
IOError – if file could not be opened
- close(self)
closes the
pysam.TabixFile.
- contigs
list of chromosome names
- fetch(self,reference=None,start=None,end=None,region=None,parser=None,multiple_iterators=False)
fetch one or more rows in aregion using 0-basedindexing. The region is specified byreference,start andend. Alternatively, a samtoolsregionstring can be supplied.
Withoutreference orregion all entries will be fetched.
If onlyreference is set, all reads matching onreferencewill be fetched.
Ifparser is None, the default parser will be used forparsing.
Setmultiple_iterators to true if you will be using multipleiterators on the same file at the same time. The iteratorreturned will receive its own copy of a filehandle to the fileeffectively re-opening the file. Re-opening a file createssome overhead, so beware.
- header
the file header.
The file header consists of the lines at the beginning of afile that are prefixed by the comment character
#.Note
The header is returned as an iterator presenting lineswithout the newline character.
To iterate over tabix files, usetabix_iterator():
- pysam.tabix_iterator(infile,parser)
return an iterator over all entries in a file.
Results are returned parsed as specified by theparser. Ifparser is None, the results are returned as an unparsed string.Otherwise,parser is assumed to be a functor that will returnparsed data (see for example
asTupleandasGTF).
- pysam.tabix_compress(filename_in,filename_out,force=False)
compressfilename_in writing the output tofilename_out.
Raise an IOError iffilename_out already exists, unlessforceis set.
- pysam.tabix_index(filename,force=False,seq_col=None,start_col=None,end_col=None,preset=None,meta_char='#',intline_skip=0,zerobased=False,intmin_shift=-1,index=None,keep_original=False,csi=False)
index tab-separatedfilename using tabix.
An existing index will not be overwritten unlessforce is set.
The index will be built from coordinates in columnsseq_col,start_col andend_col.
The contents offilename have to be sorted by contig andposition - the method does not check if the file is sorted.
Column indices are 0-based. Note that this is different from thetabix command line utility where column indices start at 1.
Coordinates in the file are assumed to be 1-based unlesszerobased is set.
Ifpreset is provided, the column coordinates are taken from apreset. Valid values for preset are “gff”, “bed”, “sam”, “vcf”,psltbl”, “pileup”.
Lines beginning withmeta_char and the firstline_skip lineswill be skipped.
Iffilename is not detected as a gzip file it will be automaticallycompressed. The original file will be removed and only the compressedfile will be retained.
By default or whenmin_shift is 0, creates a TBI index. Ifmin_shiftis greater than zero and/orcsi is True, creates a CSI index with aminimal interval size of 1<<min_shift (1<<14 if onlycsi is set).
index controls the filename which should be used for creating the index.If not set, the default is to append
.tbitofilename.When automatically compressing files, ifkeep_original is set theuncompressed file will not be deleted.
returns the filename of the compressed data
- classpysam.asTuple
converts atabix row into a python tuple.
A field in a row is accessed by numeric index.
- classpysam.asVCF
converts atabix row into a VCF record withthe following fields:
Column
Field
Contents
1
contig
chromosome
2
pos
chromosomal position, zero-based
3
id
id
4
ref
reference allele
5
alt
alternate alleles
6
qual
quality
7
filter
filter
8
info
info
9
format
format specifier.
Access to genotypes is via index:
contig=vcf.contigfirst_sample_genotype=vcf[0]second_sample_genotype=vcf[1]
- classpysam.asBed
converts atabix row into a bed recordwith the following fields:
Column
Field
Contents
1
contig
contig
2
start
genomic start coordinate (zero-based)
3
end
genomic end coordinate plus one(zero-based)
4
name
name of feature.
5
score
score of feature
6
strand
strand of feature
7
thickStart
thickStart
8
thickEnd
thickEnd
9
itemRGB
itemRGB
10
blockCount
number of bocks
11
blockSizes
‘,’ separated string of block sizes
12
blockStarts
‘,’ separated string of block genomicstart positions
Only the first three fields are required. Additionalfields are optional, but if one is defined, all the precedingneed to be defined as well.
- classpysam.asGTF
converts atabix row into a GTF record with the followingfields:
Column
Name
Content
1
contig
the chromosome name
2
feature
The feature type
3
source
The feature source
4
start
genomic start coordinate(0-based)
5
end
genomic end coordinate(0-based)
6
score
feature score
7
strand
strand
8
frame
frame
9
attributes
the attribute field
GTF formatted entries also define the following fields thatare derived from the attributes field:
Name
Content
gene_id
the gene identifier
transcript_id
the transcript identifier
FASTA files
- classpysam.FastaFile
Random access to fasta formatted files thathave been indexed byfaidx.
The file is automatically opened. The index file of file
<filename>is expected to be called<filename>.fai.- Parameters:
filename (string) – Filename of fasta file to be opened.
filepath_index (string) – Optional, filename of the index. By default this isthe filename + “.fai”.
filepath_index_compressed (string) – Optional, filename of the index if fasta file is. By default this isthe filename + “.gzi”.
- Raises:
ValueError – if index file is missing
IOError – if file could not be opened
- close(self)
close the file.
- closed
bool indicating the current state of the file object.This is a read-only attribute; the close() method changes the value.
- fetch(self,reference=None,start=None,end=None,region=None)
fetch sequences in aregion.
A region caneither be specified byreference,start andend.start andend denote 0-based, half-openintervals.
Alternatively, a samtoolsregion string can besupplied.
If any of the coordinates are missing they will be replaced by theminimum (start) or maximum (end) coordinate.
Note that region strings are 1-based, whilestart andend denotean interval in python coordinates.The region is specified byreference,start andend.
- Returns:
string
- Return type:
a string with the sequence specified by the region.
- Raises:
IndexError – if the coordinates are out of range
ValueError – if the region is invalid
- filename
filename associated with this object. This is a read-only attribute.
- get_reference_length(self,reference)
return the length of reference.
- is_open(self)
return true if samfile has been opened.
FASTQ files
- classpysam.FastxFile
Stream access tofasta orfastq formatted files.
The file is automatically opened.
Entries in the file can be both fastq or fasta formatted or even amixture of the two.
This file object permits iterating over all entries in thefile. Random access is not implemented. The iteration returnsobjects of type
FastqProxy- Parameters:
filename (string) – Filename of fasta/fastq file to be opened.
persist (bool) – If True (default) make a copy of the entry in the file duringiteration. If set to False, no copy will be made. This willpermit much faster iteration, but an entry will not persistwhen the iteration continues and an entry is read-only.
Notes
Prior to version 0.8.2, this class was called FastqFile.
- Raises:
IOError – if file could not be opened
Examples
>>>withpysam.FastxFile(filename)asfh:...forentryinfh:...print(entry.name)...print(entry.sequence)...print(entry.comment)...print(entry.quality)>>>withpysam.FastxFile(filename)asfin,open(out_filename,mode='w')asfout:...forentryinfin:...fout.write(str(entry)+'\n')
- close(self)
close the file.
- closed
bool indicating the current state of the file object.This is a read-only attribute; the close() method changes the value.
- filename
string with the filename associated with this object.
- is_open(self)
return true if samfile has been opened.
- classpysam.FastqProxy
A single entry in a fastq file.
- get_quality_array(self,intoffset=33)→array
return quality values as integer array after subtracting offset.
- name
The name of each entry in the fastq file.
- quality
The quality score of each entry in the fastq file, represented as a string.
- sequence
The sequence of each entry in the fastq file.
VCF/BCF files
- classpysam.VariantFile(*args,**kwargs)
(filename, mode=None, index_filename=None, header=None, drop_samples=False,duplicate_filehandle=True, ignore_truncation=False, threads=1)
AVCF/BCF formatted file. The file is automaticallyopened.
If an index for a variant file exists (.csi or .tbi), it will beopened automatically. Without an index random access to recordsvia
fetch()is disabled.For writing, a
VariantHeaderobject must be provided,typically obtained from anotherVCF file/BCFfile.- Parameters:
mode (string) –
mode should be
rfor reading orwfor writing. The default istext mode (VCF). For binary (BCF) I/O you should appendbfor compressed orufor uncompressedBCF output.If
bis present, it must immediately followrorw. Validmodes arer,w,wh,rb,wb,wbuandwb0.For instance, to open aBCF formatted file for reading, type:f=pysam.VariantFile('ex1.bcf','r')
If mode is not specified, we will try to auto-detect the file type. Allof the following should work:
f1=pysam.VariantFile('ex1.bcf')f2=pysam.VariantFile('ex1.vcf')f3=pysam.VariantFile('ex1.vcf.gz')
index_filename (string) – Explicit path to an index file.
header (VariantHeader) –
VariantHeaderobject required for writing.drop_samples (bool) – Ignore sample information when reading.
duplicate_filehandle (bool) – By default, file handles passed either directly or throughFile-like objects will be duplicated before passing them tohtslib. The duplication prevents issues where the same streamwill be closed by htslib and through destruction of thehigh-level python object. Set to False to turn offduplication.
ignore_truncation (bool) – Issue a warning, instead of raising an error if the current fileappears to be truncated due to a missing EOF marker. Only appliesto bgzipped formats. (Default=False)
threads (integer) – Number of threads to use for compressing/decompressing VCF/BCF files.Setting threads to > 1 cannot be combined withignore_truncation.(Default=1)
- close(self)
closes the
pysam.VariantFile.
- copy(self)
- fetch(self,contig=None,start=None,stop=None,region=None,reopen=False,end=None,reference=None)
fetch records in aregion, specified either bycontig,start, andend (which are 0-based, half-open);or alternatively by a samtoolsregion string (which is1-based inclusive).
Withoutcontig orregion all mapped records will be fetched. Therecords will be returned ordered by contig, which will not necessarilybe the order within the file.
Setreopen to true if you will be using multiple iterators on thesame file at the same time. The iterator returned will receive itsown copy of a filehandle to the file effectively re-opening thefile. Re-opening a file incurrs some overhead, so use with care.
If onlycontig is set, all records oncontig will be fetched.If bothregion andcontig are given, an exception is raised.
Note that a bgzippedVCF.gz file without a tabix/CSI index(.tbi/.csi) or aBCF file without a CSI index can only beread sequentially.
- get_tid(self,reference)
return the numericaltid corresponding toreference
returns -1 if reference is not known.
- is_valid_tid(self,tid)
return True if the numericaltid is valid; False otherwise.
returns -1 if reference is not known.
- new_record(self,*args,**kwargs)
Create a new empty
VariantRecord.
- open(self,filename,mode='r',index_filename=None,VariantHeaderheader=None,drop_samples=False,duplicate_filehandle=True,ignore_truncation=False,threads=1)
open a vcf/bcf file.
If open is called on an existing VariantFile, the current file will beclosed and a new file will be opened.
- reset(self)
reset file position to beginning of file just after the header.
- subset_samples(self,include_samples)
Read only a subset of samples to reduce processing time and memory.Must be called prior to retrieving records.
- write(self,VariantRecordrecord)→int
write a single
pysam.VariantRecordto disk.returns the number of bytes written.
- classpysam.VariantHeader
header information for a
VariantFileobject- add_line(self,line)
Add a metadata line to this header
- add_meta(self,key,value=None,items=None)
Add metadata to this header
- add_record(self,VariantHeaderRecordrecord)
Add an existing
VariantHeaderRecordto this header
- add_sample(self,name)
Add a new sample to this header
- add_samples(self,*args)
Add several new samples to this header.This function takes multiple arguments, each of which maybe either a sample name or an iterable returning sample names(e.g., a list of sample names).
- alts
alt metadata (
dictID->record).The data returned just a snapshot of alt records, is createdevery time the property is requested, and modifications willnot be reflected in the header metadata and vice versa.
i.e. it is just a dict that reflects the state of alt recordsat the time it is created.
- contigs
contig information (
VariantHeaderContigs)
- copy(self)
- filters
filter metadata (
VariantHeaderMetadata)
- formats
format metadata (
VariantHeaderMetadata)
- info
info metadata (
VariantHeaderMetadata)
- merge(self,VariantHeaderheader)
- new_record(self,contig=None,start=0,stop=0,alleles=None,id=None,qual=None,filter=None,info=None,samples=None,**kwargs)
Create a new empty VariantRecord.
Arguments are currently experimental. Use with caution and expectchanges in upcoming releases.
- records
header records (
VariantHeaderRecords)
- samples
- version
VCF version
- classpysam.VariantHeaderRecord(*args,**kwargs)
header record from a
VariantHeaderobject- attrs
sequence of additional header attributes
- get(self,key,default=None)
D.get(k[,d]) -> D[k] if k in D, else d. d defaults to None.
- items(self)
D.items() -> list of D’s (key, value) pairs, as 2-tuples
- iteritems(self)
D.iteritems() -> an iterator over the (key, value) items of D
- iterkeys(self)
D.iterkeys() -> an iterator over the keys of D
- itervalues(self)
D.itervalues() -> an iterator over the values of D
- key
header key (the part before ‘=’, in FILTER/INFO/FORMAT/contig/fileformat etc.)
- keys(self)
D.keys() -> list of D’s keys
- pop(self,key,default=_nothing)
- remove(self)
- type
FILTER, INFO, FORMAT, CONTIG, STRUCTURED, or GENERIC
- Type:
header type
- update(self,items=None,**kwargs)
D.update([E, ]**F) -> None.
Update D from dict/iterable E and F.
- value
header value. Set only for generic lines, None for FILTER/INFO, etc.
- values(self)
D.values() -> list of D’s values
- classpysam.VariantRecord(*args,**kwargs)
Variant record
- alleles
tuple of reference allele followed by alt alleles
- alts
tuple of alt alleles
- chrom
chromosome/contig name
- contig
chromosome/contig name
- copy(self)
return a copy of this VariantRecord object
- filter
filter information (see
VariantRecordFilter)
- format
sample format metadata (see
VariantRecordFormat)
- id
record identifier or None if not available
- info
info data (see
VariantRecordInfo)
- pos
record start position on chrom/contig (1-based inclusive)
- qual
phred scaled quality score or None if not available
- ref
reference allele
- rid
internal reference id number
- rlen
record length on chrom/contig (aka rec.stop - rec.start)
- samples
sample data (see
VariantRecordSamples)
- start
record start position on chrom/contig (0-based inclusive)
- stop
record stop position on chrom/contig (0-based exclusive)
- translate(self,VariantHeaderdst_header)
HTSFile
HTSFile is the base class forpysam.AlignmentFile andpysam.VariantFile.
- classpysam.HTSFile
Base class for HTS file types
- add_hts_options(self,format_options=None)
Given a list of key=value format option strings, add them to an open htsFile
- category
General file format category. One of UNKNOWN, ALIGNMENTS,VARIANTS, INDEX, REGIONS
- check_truncation(self,ignore_truncation=False)
Check if file is truncated.
- close(self)
- closed
return True if HTSFile is closed.
- compression
File compression.
One of NONE, GZIP, BGZF, CUSTOM.
- description
Vaguely human readable description of the file format
- flush(self)
Flush any buffered data to the underlying output stream.
- format
File format.
One of UNKNOWN, BINARY_FORMAT, TEXT_FORMAT, SAM, BAM,BAI, CRAM, CRAI, VCF, BCF, CSI, GZI, TBI, BED.
- get_tid(self,contig)
return the numericaltid corresponding tocontig
returns -1 if contig is not known.
- is_bam
return True if HTSFile is reading or writing a BAM alignment file
- is_bcf
return True if HTSFile is reading or writing a BCF variant file
- is_closed
return True if HTSFile is closed.
- is_cram
return True if HTSFile is reading or writing a BAM alignment file
- is_open
return True if HTSFile is open and in a valid state.
- is_read
return True if HTSFile is open for reading
- is_sam
return True if HTSFile is reading or writing a SAM alignment file
- is_valid_reference_name(self,contig)
return True if the contig namecontig is valid; False otherwise.
- is_valid_tid(self,tid)
return True if the numericaltid is valid; False otherwise.
returns -1 if contig is not known.
- is_vcf
return True if HTSFile is reading or writing a VCF variant file
- is_write
return True if HTSFile is open for writing
- parse_region(self,contig=None,start=None,stop=None,region=None,tid=None,reference=None,end=None)
parse alternative ways to specify a genomic region. A region caneither be specified bycontig,start andstop.start andstop denote 0-based, half-openintervals.reference andend are also accepted forbackward compatibility as synonyms forcontig andstop, respectively.
Alternatively, a samtoolsregion string can besupplied.
If any of the coordinates are missing they will be replaced bythe minimum (start) or maximum (stop) coordinate.
Note that region strings are 1-based inclusive, whilestartandstop denote an interval in 0-based, half-opencoordinates (like BED files and Python slices).
Ifcontig orregion or are
*, unmapped reads at the endof a BAM file will be returned. Setting either to.williterate from the beginning of the file.- Returns:
a tuple offlag,tid,start andstop. The flag indicates whether no coordinates weresupplied and the genomic region is the complete genomic space.
- Return type:
- Raises:
ValueError – for invalid or out of bounds regions.
- reset(self)
reset file position to beginning of file just after the header.
- Returns:
The file position after moving the file pointer.
- Return type:
pointer
- seek(self,uint64_toffset,intwhence=io.SEEK_SET)
move file pointer to positionoffset, see
pysam.HTSFile.tell().
- tell(self)
return current file position, see
pysam.HTSFile.seek().
- version
Tuple of file format version numbers (major, minor)