UDC-GAC/hspPublic

NotificationsYou must be signed in to change notification settings
Fork0
Star1

Hadoop Sequence Parser (HSP) library

License

GPL-3.0 license

1 star 0 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
src/main		src/main
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Repository files navigation

Hadoop Sequence Parser (HSP)

Hadoop Sequence Parser (HSP) is a Java library that allows to parse DNA sequence reads from FASTQ/FASTA datasets stored in the Hadoop Distributed File System (HDFS).

HSP supports the processing of input datasets compressed with Gzip (i.e., .gz extension) and BZip2 (i.e., .bz2 extension) codecs. However, when considering compressed data that will be later processed by Hadoop or any other data processing engine (e.g., Spark), it is important to understand whether the underlying compression format supports splitting, as many codecs need the whole input stream to uncompress successfully. On the one hand, Gzip does not support splitting and HSP will not split the gzipped input dataset. This will work, but probably at the expense of lower performance. On the other hand, BZip2 does compression on blocks of data and later these blocks can be decompressed independent of each other (i.e., it supports splitting). Therefore, BZip2 is the recommended codec to use with HSP for better performance.

Getting Started

Prerequisites

Make sure you have Java Develpment Environment (JDK) version 1.6 or above
Make sure you have a working Apache Maven distribution version 3 or above
- Seehttps://maven.apache.org/install.html

Installation

In order to download, compile, build and install the HSP library in your Maven local repository (by default ~/.m2), just execute the following commands:

git clone https://github.com/rreye/hsp.gitcd hspmvn install

Usage

In order to use the HSP library in your projects, add the following dependency section to your pom.xml:

<dependencies>...  <dependency>   <groupId>es.udc.gac</groupId>   <artifactId>hadoop-sequence-parser</artifactId>   <version>1.2</version><!-- or latest version-->  </dependency>...</dependencies>

For single-end datasets, HSP generates <key,value> pairs of type <LongWritable,Text>. The key is the byte offset in file for each read and the value is the text-based content of the read (e.g., read name, bases and qualities for FASTQ). TheText object representing the sequence can be converted to aString object using the static methodgetRead() provided by theSingleEndSequenceRecordReader class.

For paired-end datasets, HSP generates <key,value> pairs of type <LongWritable,PairText>. The key is the byte offset in file for each paired read and the value is a tuple containing the pair of theText objects that represent the paired sequence. HSP provides static methods in thePairedEndSequenceRecordReader class that allow obtaining "left" and "right" reads separately asString objects:getLeftRead() andgetRightRead(), respectively.

Hadoop examples

Setting up the input format of a Hadoop job for a single-end dataset in FASTQ format stored in "/path/to/file":

Jobjob =Job.getInstance(newConfiguration(),"Example");PathinputFile =newPath("/path/to/file");//Set input path and input format class for HSPSingleEndSequenceInputFormat.addInputPath(job,inputFile);job.setInputFormatClass(FastQInputFormat.class);

Setting up the input format of a Hadoop job for a paired-end dataset in FASTQ format stored in "/path/to/file" and "/path/to/file2":

Jobjob =Job.getInstance(newConfiguration(),"Example");PathinputFile1 =newPath("/path/to/file1");PathinputFile2 =newPath("/path/to/file2");//Set input format class and input paths for HSPjob.setInputFormatClass(PairedEndSequenceInputFormat.class);PairedEndSequenceInputFormat.setLeftInputPath(job,inputFile1,FastQInputFormat.class);PairedEndSequenceInputFormat.setRightInputPath(job,inputFile2,FastQInputFormat.class);

Spark examples

Creating a Spark RDD from a single-end dataset in FASTQ format stored in "/path/to/file" using Java:

SparkSessionsparkSession =SparkSession.builder().config(newSparkConf()).getOrCreate();JavaSparkContextjsc =JavaSparkContext.fromSparkContext(sparkSession.sparkContext());Configurationconfig =jsc.hadoopConfiguration();// Create RDDJavaPairRDD<LongWritable,Text>readsRDD =jsc.newAPIHadoopFile("/path/to/file",FastQInputFormat.class,LongWritable.class,Text.class,config);

Creating a Spark RDD from a paired-end dataset in FASTQ format stored in "/path/to/file1" and "/path/to/file2" using Java:

SparkSessionsparkSession =SparkSession.builder().config(newSparkConf()).getOrCreate();JavaSparkContextjsc =JavaSparkContext.fromSparkContext(sparkSession.sparkContext());Configurationconfig =jsc.hadoopConfiguration();// Set left and right input paths for HSPPairedEndSequenceInputFormat.setLeftInputPath(config,"/path/to/file1",FastQInputFormat.class);PairedEndSequenceInputFormat.setRightInputPath(config,"/path/to/file2",FastQInputFormat.class);// Create RDDJavaPairRDD<LongWritable,PairText>readsRDD =jsc.newAPIHadoopFile("path/to/file1",PairedEndSequenceInputFormat.class,LongWritable.class,PairText.class,config);

Flink examples

Creating a Flink DataSet from a single-end dataset in FASTQ format stored in "/path/to/file" using Java:

ExecutionEnvironmentflinkExecEnv =ExecutionEnvironment.getExecutionEnvironment();JobhadoopJob =Job.getInstance();// Set input path for HSPSingleEndSequenceInputFormat.setInputPaths(hadoopJob,"/path/to/file");// Create DataSetDataSet<Tuple2<LongWritable,Text>>readsDS =flinkExecEnv.createInput(newHadoopInputFormat<LongWritable,Text>(FastQInputFormat.class,LongWritable.class,Text.class,hadoopJob));

Creating a Flink DataSet from a paired-end dataset in FASTQ format stored in "/path/to/file1" and "/path/to/file2" using Java:

ExecutionEnvironmentflinkExecEnv =ExecutionEnvironment.getExecutionEnvironment();JobhadoopJob =Job.getInstance();Configurationconfig =hadoopJob.getConfiguration();// Set left and right input paths for HSPPairedEndSequenceInputFormat.setLeftInputPath(config,"/path/to/file1",FastQInputFormat.class);PairedEndSequenceInputFormat.setRightInputPath(config,"/path/to/file2",FastQInputFormat.class);// Create DataSetDataSet<Tuple2<LongWritable,PairText>>readsDS =flinkExecEnv.createInput(newHadoopInputFormat<LongWritable,PairText>(newPairedEndSequenceInputFormat(),LongWritable.class,PairText.class,hadoopJob));

Projects using HSP

Authors

HSP is developed in theComputer Architecture Group at theUniversidade da Coruña by:

Roberto R. Expósito (http://gac.udc.es/~rober)
Luis Lorenzo Mosquera (https://github.com/luislorenzom)
Jorge González-Domínguez (http://gac.udc.es/~jgonzalezd)

License

This library is distributed as free software and is publicly available under the GPLv3 license (see theLICENSE file for more details)

About

Hadoop Sequence Parser (HSP) library

Releases2

HSP v1.3 Latest

Apr 22, 2022

+ 1 release

Packages

No packages published

Contributors3

Languages

Java100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Hadoop Sequence Parser (HSP)

Getting Started

Prerequisites

Installation

Usage

Hadoop examples

Spark examples

Flink examples

Projects using HSP

Authors

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases2

Packages

Uh oh!

Contributors3

Uh oh!

Languages

Movatterモバイル変換

License

UDC-GAC/hsp

Folders and files

Latest commit

History

Repository files navigation

Hadoop Sequence Parser (HSP)

Getting Started

Prerequisites

Installation

Usage

Hadoop examples

Spark examples

Flink examples

Projects using HSP

Authors

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases2

Packages0

Uh oh!

Contributors3

Uh oh!

Languages

Packages