- Notifications
You must be signed in to change notification settings - Fork96
Count bases in BAM/CRAM files
License
genome/bam-readcount
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
bam-readcount
is a utility that runs on aBAM
orCRAM
file and generates low-level information aboutsequencing data at specific nucleotide positions. Its outputs include observed bases,readcounts, summarized mapping and base qualities, strandedness information,mismatch counts, and position within the reads. (see "Output" section below)
Originally designed to help filter genomic mutation calls, the metricsbam-readcount
outputsare also useful as input for variant detection tools and for resolving ambiguity betweenvariant callers.
If you findbam-readcount
useful in your work, please cite ourpaper:
Khanna et al., (2022). Bam-readcount - rapid generation of basepair-resolution sequence metrics.Journal of Open Source Software, 7(69), 3722.https://doi.org/10.21105/joss.03722
The latest release version ofbam-readcount
is available as a Docker imageonDockerHub
docker pull mgibio/bam-readcount
For details see thedocker-bam-readcount
repository.
Requires a C++ toolchain andcmake
. For details seeBUILD.md.
git clone https://github.com/genome/bam-readcount cd bam-readcountmkdir buildcd buildcmake ..make# Executable isbin/bam-readcount
Run with no arguments for command-line help:
$ bam-readcountUsage: bam-readcount [OPTIONS] <bam_file> [region]Generate metrics for bam_file at single nucleotide positions.Example: bam-readcount -f ref.fa some.bamAvailable options: -h [ --help ] produce this message -v [ --version ] output the version number -q [ --min-mapping-quality ] arg (=0) minimum mapping quality of reads used for counting. -b [ --min-base-quality ] arg (=0) minimum base quality at a position to use the read for counting. -d [ --max-count ] arg (=10000000) max depth to avoid excessive memory usage. -l [ --site-list ] arg file containing a list of regions to report readcounts within. -f [ --reference-fasta ] arg reference sequence in the fasta format. -D [ --print-individual-mapq ] arg report the mapping qualities as a comma separated list. -p [ --per-library ] report results by library. -w [ --max-warnings ] arg maximum number of warnings of each type to emit. -1 gives an unlimited number. -i [ --insertion-centric ] generate indel centric readcounts. Reads containing insertions will not be included in per-base counts
The optional[region]
should be in the same format assamtools
:
chromosome:start-stop
The optional-l
(--site-list
) file should be tab-separated, noheader, one region per line:
chromosomestartend
When using CRAM files as input, if a reference is specified with-f
, it will override whatever is inthe CRAM header. Otherwise, the reference(s) encoded in the CRAM header or a lookup byMD5 at ENA will be used.
Add bam-readcount counts to VCF
- VAtools allows you to add read-counts to VCF from modern variant callers.Additional detailsCreate csv file
- brc-parser parser to convert bam-readcount output to comma seperated long format file.
Output is tab-separated with no header toSTDOUT
, one line perposition:
chrpositionreference_basedepthbase:count:avg_mapping_quality:avg_basequality:avg_se_mapping_quality:num_plus_strand:num_minus_strand:avg_pos_as_fraction:avg_num_mismatches_as_fraction:avg_sum_mismatch_qualities:num_q2_containing_reads:avg_distance_to_q2_start_in_q2_reads:avg_clipped_length:avg_distance_to_effective_3p_end ...
There is one set of:
-separated fields for each reportedbase
withstatistics on the set of reads containing that base:
Field | Description |
---|---|
base | The base, egC |
count | Number of reads |
avg_mapping_quality | Mean mapping quality |
avg_basequality | Mean base quality |
avg_se_mapping_quality | Mean single ended mapping quality |
num_plus_strand | Number of reads on the plus/forward strand |
num_minus_strand | Number of reads on the minus/reverse strand |
avg_pos_as_fraction | Average position on the read as a fraction, calculated with respect to the length after clipping. This value is normalized to the center of the read: bases occurring strictly at the center of the read have a value of 1, those occurring strictly at the ends should approach a value of 0 |
avg_num_mismatches_as_fraction | Average number of mismatches on these reads per base |
avg_sum_mismatch_qualities | Average sum of the base qualities of mismatches in the reads |
num_q2_containing_reads | Number of reads with q2 runs at the 3’ end |
avg_distance_to_q2_start_in_q2_reads | Average distance of position (as fraction of unclipped read length) to the start of the q2 run |
avg_clipped_length | Average clipped read length |
avg_distance_to_effective_3p_end | Average distance to the 3’ prime end of the read (as fraction of unclipped read length) |
With the-p
option, each output line will have a set of{}
-delimitedresults, one for each library:
chrpositionreference_basedepthlibrary_1_name{base:count:avg_mapping_quality:avg_basequality:avg_se_mapping_quality:num_plus_strand:num_minus_strand:avg_pos_as_fraction:avg_num_mismatches_as_fraction:avg_sum_mismatch_qualities:num_q2_containing_reads:avg_distance_to_q2_start_in_q2_reads:avg_clipped_length:avg_distance_to_effective_3p_end} ... library_N_name{base:count:avg_mapping_quality:avg_basequality:avg_se_mapping_quality:num_plus_strand:num_minus_strand:avg_pos_as_fraction:avg_num_mismatches_as_fraction:avg_sum_mismatch_qualities:num_q2_containing_reads:avg_distance_to_q2_start_in_q2_reads:avg_clipped_length:avg_distance_to_effective_3p_end}
For those who learn best by example, abrief tutorial is available here that uses bam-readcount to identify the Omicron SARS-CoV-2 variant of concern from raw sequence data.
For support, pleasesearchbam-readcount
onBiostars as many of the most frequently askedquestions aboutbam-readcount
have been answered there. For problems not addressed there,please open an github issue or make a BioStar post.
We welcome contributions! SeeContributing for more details
About
Count bases in BAM/CRAM files