Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Hadoop Sequence Parser (HSP) library

License

NotificationsYou must be signed in to change notification settings

UDC-GAC/hsp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

65 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hadoop Sequence Parser (HSP) is a Java library that allows to parse DNA sequence reads from FASTQ/FASTA datasets stored in the Hadoop Distributed File System (HDFS).

HSP supports the processing of input datasets compressed with Gzip (i.e., .gz extension) and BZip2 (i.e., .bz2 extension) codecs. However, when considering compressed data that will be later processed by Hadoop or any other data processing engine (e.g., Spark), it is important to understand whether the underlying compression format supports splitting, as many codecs need the whole input stream to uncompress successfully. On the one hand, Gzip does not support splitting and HSP will not split the gzipped input dataset. This will work, but probably at the expense of lower performance. On the other hand, BZip2 does compression on blocks of data and later these blocks can be decompressed independent of each other (i.e., it supports splitting). Therefore, BZip2 is the recommended codec to use with HSP for better performance.

Getting Started

Prerequisites

  • Make sure you have Java Develpment Environment (JDK) version 1.6 or above

  • Make sure you have a working Apache Maven distribution version 3 or above

Installation

In order to download, compile, build and install the HSP library in your Maven local repository (by default ~/.m2), just execute the following commands:

git clone https://github.com/rreye/hsp.gitcd hspmvn install

Usage

In order to use the HSP library in your projects, add the following dependency section to your pom.xml:

<dependencies>...  <dependency>   <groupId>es.udc.gac</groupId>   <artifactId>hadoop-sequence-parser</artifactId>   <version>1.2</version><!-- or latest version-->  </dependency>...</dependencies>

For single-end datasets, HSP generates <key,value> pairs of type <LongWritable,Text>. The key is the byte offset in file for each read and the value is the text-based content of the read (e.g., read name, bases and qualities for FASTQ). TheText object representing the sequence can be converted to aString object using the static methodgetRead() provided by theSingleEndSequenceRecordReader class.

For paired-end datasets, HSP generates <key,value> pairs of type <LongWritable,PairText>. The key is the byte offset in file for each paired read and the value is a tuple containing the pair of theText objects that represent the paired sequence. HSP provides static methods in thePairedEndSequenceRecordReader class that allow obtaining "left" and "right" reads separately asString objects:getLeftRead() andgetRightRead(), respectively.

Hadoop examples

Setting up the input format of a Hadoop job for a single-end dataset in FASTQ format stored in "/path/to/file":

Jobjob =Job.getInstance(newConfiguration(),"Example");PathinputFile =newPath("/path/to/file");//Set input path and input format class for HSPSingleEndSequenceInputFormat.addInputPath(job,inputFile);job.setInputFormatClass(FastQInputFormat.class);

Setting up the input format of a Hadoop job for a paired-end dataset in FASTQ format stored in "/path/to/file" and "/path/to/file2":

Jobjob =Job.getInstance(newConfiguration(),"Example");PathinputFile1 =newPath("/path/to/file1");PathinputFile2 =newPath("/path/to/file2");//Set input format class and input paths for HSPjob.setInputFormatClass(PairedEndSequenceInputFormat.class);PairedEndSequenceInputFormat.setLeftInputPath(job,inputFile1,FastQInputFormat.class);PairedEndSequenceInputFormat.setRightInputPath(job,inputFile2,FastQInputFormat.class);

Spark examples

Creating a Spark RDD from a single-end dataset in FASTQ format stored in "/path/to/file" using Java:

SparkSessionsparkSession =SparkSession.builder().config(newSparkConf()).getOrCreate();JavaSparkContextjsc =JavaSparkContext.fromSparkContext(sparkSession.sparkContext());Configurationconfig =jsc.hadoopConfiguration();// Create RDDJavaPairRDD<LongWritable,Text>readsRDD =jsc.newAPIHadoopFile("/path/to/file",FastQInputFormat.class,LongWritable.class,Text.class,config);

Creating a Spark RDD from a paired-end dataset in FASTQ format stored in "/path/to/file1" and "/path/to/file2" using Java:

SparkSessionsparkSession =SparkSession.builder().config(newSparkConf()).getOrCreate();JavaSparkContextjsc =JavaSparkContext.fromSparkContext(sparkSession.sparkContext());Configurationconfig =jsc.hadoopConfiguration();// Set left and right input paths for HSPPairedEndSequenceInputFormat.setLeftInputPath(config,"/path/to/file1",FastQInputFormat.class);PairedEndSequenceInputFormat.setRightInputPath(config,"/path/to/file2",FastQInputFormat.class);// Create RDDJavaPairRDD<LongWritable,PairText>readsRDD =jsc.newAPIHadoopFile("path/to/file1",PairedEndSequenceInputFormat.class,LongWritable.class,PairText.class,config);

Flink examples

Creating a Flink DataSet from a single-end dataset in FASTQ format stored in "/path/to/file" using Java:

ExecutionEnvironmentflinkExecEnv =ExecutionEnvironment.getExecutionEnvironment();JobhadoopJob =Job.getInstance();// Set input path for HSPSingleEndSequenceInputFormat.setInputPaths(hadoopJob,"/path/to/file");// Create DataSetDataSet<Tuple2<LongWritable,Text>>readsDS =flinkExecEnv.createInput(newHadoopInputFormat<LongWritable,Text>(FastQInputFormat.class,LongWritable.class,Text.class,hadoopJob));

Creating a Flink DataSet from a paired-end dataset in FASTQ format stored in "/path/to/file1" and "/path/to/file2" using Java:

ExecutionEnvironmentflinkExecEnv =ExecutionEnvironment.getExecutionEnvironment();JobhadoopJob =Job.getInstance();Configurationconfig =hadoopJob.getConfiguration();// Set left and right input paths for HSPPairedEndSequenceInputFormat.setLeftInputPath(config,"/path/to/file1",FastQInputFormat.class);PairedEndSequenceInputFormat.setRightInputPath(config,"/path/to/file2",FastQInputFormat.class);// Create DataSetDataSet<Tuple2<LongWritable,PairText>>readsDS =flinkExecEnv.createInput(newHadoopInputFormat<LongWritable,PairText>(newPairedEndSequenceInputFormat(),LongWritable.class,PairText.class,hadoopJob));

Projects using HSP

Authors

HSP is developed in theComputer Architecture Group at theUniversidade da Coruña by:

License

This library is distributed as free software and is publicly available under the GPLv3 license (see theLICENSE file for more details)


[8]ページ先頭

©2009-2025 Movatter.jp