Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

The D4 Quantitative Data Format

License

NotificationsYou must be signed in to change notification settings

38/d4-format

Repository files navigation

Synopsis

The Dense Depth Data Dump (D4) format and tool suite provide an alternative to BigWig for fast analysis and compact storage of quantitative genomics datasets (e.g., RNA-seq, ChIP-seq, WGS depths, etc.). It supports random access, multiple tracks (e.g., RNA-seq, ChiP-seq, etc. from the same sample), HTTP range requests, and statistics on arbitrary genome intervals. The D4tools software is built on aRust crate. We provide both aC-API and aPython API with anJupyter notebook providing examples of how to to read, query, and create single-track and multi-track D4 files.

Usage examples are provided below. Also, check out theslide deck that describes the motivation, performance and toolkits for D4

Motivation

Modern DNA sequencing is used as a readout for diverse assays, with the count of aligned sequences, or "read depth", serving as the quantitative signal for many underlying cellular phenomena. Despite wide use and thousands of datasets, existing formats used for the storage and analysis of read depths are limited with respect to both size and speed. For example, it is faster to recalculate sequencing depth from an alignment file than it is to analyze the text output from that calculation. We sought to improve on existing formats such as BigWig and compressed BED files by creating the Dense Depth Data Dump (D4) format and tool suite. The D4 format is adaptive in that it profiles a random sample of aligned sequence depth from the input BAM or CRAM file to determine an optimal encoding that minimizes file size, while also enabling fast data access. We show that D4 uses less disk space for both RNA-Seq and whole-genome sequencing and offers 3 to 440 fold speed improvements over existing formats for random access, aggregation and summarization for scalable downstream analyses that would be otherwise intractable.

Manuscript

To learn more, please read the publication:https://www.nature.com/articles/s43588-021-00085-0. Note We ran the experiments described in the manuscript on a server with following hardward and software

  • Processor: Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
  • RAM: 376GB
  • OS: CentOS 7.6.180 w/ Linux Kernel 3.0.10
  • Rust Version: 1.47.0-nightly

Basic Usage by Examples (each should take seconds)

Create a D4 file

Thed4tools create subcommand is used to convert BAM,CRAM,BigWig and BedGraph file to D4 file.

USAGE:    create [FLAGS] [OPTIONS] <input-file> [output-file]FLAGS:    -z, --deflate      Enable the deflate compression    -A, --dict-auto    Automatically determine the dictionary type by random sampling        --dump-dict    Do not profile the BAM file, only dump the dictionary    -h, --help         Prints help information    -S, --sparse       Sparse mode, this is same as '-zR0-1', which enable secondary table compression and disable                       primary table    -V, --version      Prints version informationOPTIONS:        --deflate-level <level>          Configure the deflate algorithm, default 5    -d, --dict-file <dict_spec_file>     Provide a file that defines the values of the dictionary    -R, --dict-range <dict_spec>         Dictionary specification, use "a-b" to specify the dictionary is encoding                                         values from A to B(exclusively)    -f, --filter <regex>                 A regex that matches the genome name should present in the output file    -g, --genome <genome_file>           The genome description file (Used by BED inputs)    -q, --mapping-qual <mapping-qual>    The minimal mapping quality (Only valid with CRAM/BAM inputs)    -r, --ref <fai_file_path>            Reference genome file (Used by CRAM inputs)    -t, --threads <num_of_threads>       Specify the number of threads D4 can use for encodingARGS:    <input-file>     Path to the input file    <output-file>    Path to the output file
  • From CRAM/BAM file
  d4tools create -Azr hg19.fa.gz.fai hg002.cram hg002.d4
  • From BigWig file
  d4tools create -z input.bw output.d4
  • From a BedGraph file (extension must be ".bedgraph")
  d4tools create -z -g hg19.genome input.bedgraph output.d4

View a D4 File

USAGE:    view [FLAGS] <input-file> [chr:start-end]...FLAGS:    -h, --help           Prints help information    -g, --show-genome    Show the genome file instead of the file content    -V, --version        Prints version informationARGS:    <input-file>          Path to the input file    <chr:start-end>...    Regions to be viewed
  • Convert a d4 file to a bedgraph file
$ d4tools view hg002.d4 | head -n 10chr1    0       9998    0chr1    9998    9999    6chr1    9999    10000   9chr1    10000   10001   37chr1    10001   10002   59chr1    10002   10003   78chr1    10003   10004   100chr1    10004   10005   116chr1    10005   10006   130chr1    10006   10007   135
  • Print given regions
$ d4tools view hg002.d4 1:1234560-1234580 X:1234560-12345801       1234559 1234562 281       1234562 1234565 291       1234565 1234566 301       1234566 1234572 311       1234572 1234573 291       1234573 1234576 281       1234576 1234578 271       1234578 1234579 26X       1234559 1234562 26X       1234562 1234563 25X       1234563 1234565 26X       1234565 1234574 25X       1234574 1234575 26X       1234575 1234576 25X       1234576 1234578 26X       1234578 1234579 25
  • Print the genome layout
$ d4tools view -g hg002.d4 | head -n 101       2492506212       2431993733       1980224304       1911542765       1809152606       1711150677       1591386638       1463640229       14121343110      135534747

Run stat on a D4 file

USAGE:    stat [OPTIONS] <input_d4_file>FLAGS:    -h, --help       Prints help information    -V, --version    Prints version informationOPTIONS:    -r, --region <bed_file_path>      A bed file that describes the region we want to run the stat    -s, --stat <stat_type>            The type of statistics we want to perform, by default average. You can specify                                      statistic methods: perc_cov, mean, median, hist, percentile=X% (If this is not speficied                                      d4tools will use mean by default)    -t, --threads <num_of_threads>    Number of threadsARGS:    <input_d4_file>
  • Mean cov for each Chrom
$ d4tools stat hg002.d4chr1    0       249250621       27.075065016588262chr10   0       135534747       31.59483947684648chr11   0       135006516       25.970025943044114chr11_gl000202_random   0       40103   14.47213425429519chr12   0       133851895       25.80992053194316chr13   0       115169878       24.18613685602758chr14   0       107349540       24.25194093053403chr15   0       102531392       23.04176524785697chr16   0       90354753        28.106620932271266chr17   0       81195210        25.58382477242192...
  • Median cov for each Chrom
$ d4tools stat -s median hg002.d4 | head -n 101       0       249250621       2510      0       135534747       2611      0       135006516       2612      0       133851895       2613      0       115169878       2614      0       107349540       2515      0       102531392       2416      0       90354753        2417      0       81195210        2518      0       78077248        26
  • Top 5% for the given region defined in a bed file
$ d4tools stat -s percentile=95 -r region.bed hg002.d41       2000000 3000000 332       0       150000000       38
  • Percent of bases at or above coverage levels (perc_cov)
$ d4tools stat -H -s perc_cov=1,2 -r data/input_10nt.multiple_ranges.bed data/input_10nt.d4 #Chr    Start   End     1x      2xchr     0       2       0.000   0.000chr     0       8       0.625   0.375chr     0       10      0.600   0.300chr     1       6       0.600   0.400chr     3       9       1.000   0.500chr     4       5       1.000   1.000chr     5       10      0.800   0.400

Reading D4 File Served by static HTTP Server

D4 now supports showing and run statistics for D4 files that is served on a HTTP server without downloading the file to local.For printing the file content, simple use the following command:

$ d4tools show https://d4-format-testing.s3.us-west-1.amazonaws.com/hg002.d4 | head -n 101       0       9998    01       9998    9999    61       9999    10000   101       10000   10001   381       10001   10002   551       10002   10003   721       10003   10004   931       10004   10005   1101       10005   10006   1261       10006   10007   131

To run statistics on a D4 file on network, we required the D4 file contains the data index to avoid full file accessing.

  • (On the server side) Prepare the D4 file that need to be accessed on web
d4tools index build --sum hg002.d4
  • (On the client side) Run mean depth statistics on this file
$ d4tools stat https://d4-format-testing.s3.us-west-1.amazonaws.com/hg002.d41       0       249250621       23.8483271461939522       0       243199373       25.021627494080753       0       198022430       23.0865041753098374       0       191154276       23.184711212005535       0       180915260       23.25364190947746       0       171115067       24.5151561083747227       0       159138663       24.3981023140806468       0       146364022       26.4257891396288659       0       141213431       19.78024711402982710      0       135534747       25.475887087464....

Build

Prerequisites

To buildd4, Rust toolchain is required. To install Rust toolchain,please run the following command and follow the prompt to complete theRust installation.

curl --proto'=https' --tlsv1.2 -sSf https://sh.rustup.rs| sh

gcc orclang is required to buildhtslib embeded with thed4 library.For details, please check the htslib repository.

Build Steps

Normally, the build step is quite easy. Just

# For Debug Buildcargo build# For Release Buildcargo build --release

And it will produce thed4tools binary which you can find at eithertarget/debug/d4tools ortarget/release/d4tools depending on which build modeyou choose.

Troubleshooting

  • Compiling error: asking for -fPIC or -fPIE option

For some environment, the Rust toolchain will ask compile the-fPIC or-fPIE to build thed4tools binary.In this case, you should be able to use the following workaround:

# To build a debug build :cd d4tools&& cargo rustc --bin d4tools -- -C relocation-model=static# To build a release build :cd d4tools&& cargo rustc --bin d4tools --release -- -C relocation-model=static

Installation (< 2 minutes)

  • Install bioconda

Assuming you have bioconda environment installed and configured, you can simply install d4tools and d4binding from bioconda repository

conda install d4tools
  • Install from crates.io: Assuming you have Rust compiler toolchain, you can install it from crate.io as well.
cargo install d4tools
  • Install from source code: The following steps allows you to install d4tools from source code. You can choose to install the d4tools binary by running
cargo install --path.

Using D4 in C/C++

D4 provides a C binding that allows the D4 library used in C and C++.Here's the steps to build D4 binding.

  1. Install or build the binding library
  • The easist way to install d4binding library is using bioconda.
conda install d4binding

Then the header file will be installed under<conda-dir>/include. Andlibd4binding.so orlibd4binding.dylib will be installed under<conda-dir>/lib.

  • Alternatively, you can choose install from the source code as well:
# Build the D4 binding library, for debug build, remove "--release" argumentcargo build --package=d4binding --release

After running this command, you should be able to find the library "target/release/libd4binding.so".

  1. Use D4 in C

Here's a small example that prints all chromosome name and size defined in a D4 file.

#include<stdio.h>#include<d4.h>intmain(intargc,char**argv) {d4_file_t*fp=d4_open("input.d4","r");d4_file_metadata_tmt= {};d4_file_load_metadata(fp,&mt);inti;for(i=0;i<mt.chrom_count;i++)printf("# %s %d\n",mt.chrom_name[i],mt.chrom_size[i]);d4_close(fp);return0;}
  1. Compile C++ code against D4 binding library
gcc print-chrom-info.c -o print-chrom-info -I d4binding/include -L target/release -ld4binding

For more examples, seed4binding/examples/

Sample Data


[8]ページ先頭

©2009-2025 Movatter.jp