- Notifications
You must be signed in to change notification settings - Fork0
Hadoop Sequence Parser (HSP) library
License
UDC-GAC/hsp
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Hadoop Sequence Parser (HSP) is a Java library that allows to parse DNA sequence reads from FASTQ/FASTA datasets stored in the Hadoop Distributed File System (HDFS).
HSP supports the processing of input datasets compressed with Gzip (i.e., .gz extension) and BZip2 (i.e., .bz2 extension) codecs. However, when considering compressed data that will be later processed by Hadoop or any other data processing engine (e.g., Spark), it is important to understand whether the underlying compression format supports splitting, as many codecs need the whole input stream to uncompress successfully. On the one hand, Gzip does not support splitting and HSP will not split the gzipped input dataset. This will work, but probably at the expense of lower performance. On the other hand, BZip2 does compression on blocks of data and later these blocks can be decompressed independent of each other (i.e., it supports splitting). Therefore, BZip2 is the recommended codec to use with HSP for better performance.
Make sure you have Java Develpment Environment (JDK) version 1.6 or above
Make sure you have a working Apache Maven distribution version 3 or above
In order to download, compile, build and install the HSP library in your Maven local repository (by default ~/.m2), just execute the following commands:
git clone https://github.com/rreye/hsp.gitcd hspmvn install
In order to use the HSP library in your projects, add the following dependency section to your pom.xml:
<dependencies>... <dependency> <groupId>es.udc.gac</groupId> <artifactId>hadoop-sequence-parser</artifactId> <version>1.2</version><!-- or latest version--> </dependency>...</dependencies>
For single-end datasets, HSP generates <key,value> pairs of type <LongWritable,Text>. The key is the byte offset in file for each read and the value is the text-based content of the read (e.g., read name, bases and qualities for FASTQ). TheText object representing the sequence can be converted to aString object using the static methodgetRead() provided by theSingleEndSequenceRecordReader class.
For paired-end datasets, HSP generates <key,value> pairs of type <LongWritable,PairText>. The key is the byte offset in file for each paired read and the value is a tuple containing the pair of theText objects that represent the paired sequence. HSP provides static methods in thePairedEndSequenceRecordReader class that allow obtaining "left" and "right" reads separately asString objects:getLeftRead() andgetRightRead(), respectively.
Setting up the input format of a Hadoop job for a single-end dataset in FASTQ format stored in "/path/to/file":
Jobjob =Job.getInstance(newConfiguration(),"Example");PathinputFile =newPath("/path/to/file");//Set input path and input format class for HSPSingleEndSequenceInputFormat.addInputPath(job,inputFile);job.setInputFormatClass(FastQInputFormat.class);
Setting up the input format of a Hadoop job for a paired-end dataset in FASTQ format stored in "/path/to/file" and "/path/to/file2":
Jobjob =Job.getInstance(newConfiguration(),"Example");PathinputFile1 =newPath("/path/to/file1");PathinputFile2 =newPath("/path/to/file2");//Set input format class and input paths for HSPjob.setInputFormatClass(PairedEndSequenceInputFormat.class);PairedEndSequenceInputFormat.setLeftInputPath(job,inputFile1,FastQInputFormat.class);PairedEndSequenceInputFormat.setRightInputPath(job,inputFile2,FastQInputFormat.class);
Creating a Spark RDD from a single-end dataset in FASTQ format stored in "/path/to/file" using Java:
SparkSessionsparkSession =SparkSession.builder().config(newSparkConf()).getOrCreate();JavaSparkContextjsc =JavaSparkContext.fromSparkContext(sparkSession.sparkContext());Configurationconfig =jsc.hadoopConfiguration();// Create RDDJavaPairRDD<LongWritable,Text>readsRDD =jsc.newAPIHadoopFile("/path/to/file",FastQInputFormat.class,LongWritable.class,Text.class,config);
Creating a Spark RDD from a paired-end dataset in FASTQ format stored in "/path/to/file1" and "/path/to/file2" using Java:
SparkSessionsparkSession =SparkSession.builder().config(newSparkConf()).getOrCreate();JavaSparkContextjsc =JavaSparkContext.fromSparkContext(sparkSession.sparkContext());Configurationconfig =jsc.hadoopConfiguration();// Set left and right input paths for HSPPairedEndSequenceInputFormat.setLeftInputPath(config,"/path/to/file1",FastQInputFormat.class);PairedEndSequenceInputFormat.setRightInputPath(config,"/path/to/file2",FastQInputFormat.class);// Create RDDJavaPairRDD<LongWritable,PairText>readsRDD =jsc.newAPIHadoopFile("path/to/file1",PairedEndSequenceInputFormat.class,LongWritable.class,PairText.class,config);
Creating a Flink DataSet from a single-end dataset in FASTQ format stored in "/path/to/file" using Java:
ExecutionEnvironmentflinkExecEnv =ExecutionEnvironment.getExecutionEnvironment();JobhadoopJob =Job.getInstance();// Set input path for HSPSingleEndSequenceInputFormat.setInputPaths(hadoopJob,"/path/to/file");// Create DataSetDataSet<Tuple2<LongWritable,Text>>readsDS =flinkExecEnv.createInput(newHadoopInputFormat<LongWritable,Text>(FastQInputFormat.class,LongWritable.class,Text.class,hadoopJob));
Creating a Flink DataSet from a paired-end dataset in FASTQ format stored in "/path/to/file1" and "/path/to/file2" using Java:
ExecutionEnvironmentflinkExecEnv =ExecutionEnvironment.getExecutionEnvironment();JobhadoopJob =Job.getInstance();Configurationconfig =hadoopJob.getConfiguration();// Set left and right input paths for HSPPairedEndSequenceInputFormat.setLeftInputPath(config,"/path/to/file1",FastQInputFormat.class);PairedEndSequenceInputFormat.setRightInputPath(config,"/path/to/file2",FastQInputFormat.class);// Create DataSetDataSet<Tuple2<LongWritable,PairText>>readsDS =flinkExecEnv.createInput(newHadoopInputFormat<LongWritable,PairText>(newPairedEndSequenceInputFormat(),LongWritable.class,PairText.class,hadoopJob));
HSP is developed in theComputer Architecture Group at theUniversidade da Coruña by:
- Roberto R. Expósito (http://gac.udc.es/~rober)
- Luis Lorenzo Mosquera (https://github.com/luislorenzom)
- Jorge González-Domínguez (http://gac.udc.es/~jgonzalezd)
This library is distributed as free software and is publicly available under the GPLv3 license (see theLICENSE file for more details)
About
Hadoop Sequence Parser (HSP) library