rdblue/parquet-mrPublic

forked fromapache/parquet-java

NotificationsYou must be signed in to change notification settings
Fork1
Star0

Mirror of Apache Parquet

License

Apache-2.0 license

0 stars 1.5k forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 1,849 Commits
dev		dev
doc/dremel_paper		doc/dremel_paper
parquet-arrow		parquet-arrow
parquet-avro		parquet-avro
parquet-benchmarks		parquet-benchmarks
parquet-cascading-common23/src		parquet-cascading-common23/src
parquet-cascading		parquet-cascading
parquet-cascading3		parquet-cascading3
parquet-cli		parquet-cli
parquet-column		parquet-column
parquet-common		parquet-common
parquet-encoding		parquet-encoding
parquet-generator		parquet-generator
parquet-hadoop-bundle		parquet-hadoop-bundle
parquet-hadoop		parquet-hadoop
parquet-hive-bundle		parquet-hive-bundle
parquet-hive		parquet-hive
parquet-jackson		parquet-jackson
parquet-pig-bundle		parquet-pig-bundle
parquet-pig		parquet-pig
parquet-protobuf		parquet-protobuf
parquet-scala		parquet-scala
parquet-scrooge		parquet-scrooge
parquet-thrift		parquet-thrift
parquet-tools		parquet-tools
src		src
.editorconfig		.editorconfig
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGES.md		CHANGES.md
KEYS		KEYS
LICENSE		LICENSE
NOTICE		NOTICE
PoweredBy.md		PoweredBy.md
README.md		README.md
changelog.sh		changelog.sh
parquet_cascading.md		parquet_cascading.md
pom.xml		pom.xml

Repository files navigation

Parquet MR

Parquet-MR contains the java implementation of theParquet format.Parquet is a columnar storage format for Hadoop; it provides efficient storage and encoding of data.Parquet uses therecord shredding and assembly algorithm described in the Dremel paper to represent nested structures.

You can find some details about the format and intended use cases in ourHadoop Summit 2013 presentation

Building

Parquet-MR uses Maven to build and depends on both the thrift and protoc compilers.

Install Protobuf

To build and install the protobuf compiler, run:

wget https://github.com/google/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gztar xzf protobuf-2.5.0.tar.gzcd  protobuf-2.5.0./configuremakesudo make installsudo ldconfig

Install Thrift

To build and install the thrift compiler, run:

wget -nv http://archive.apache.org/dist/thrift/0.7.0/thrift-0.7.0.tar.gztar xzf thrift-0.7.0.tar.gzcd thrift-0.7.0chmod +x ./configure./configure --disable-gen-erl --disable-gen-hs --without-ruby --without-haskell --without-erlangsudo make install

Build Parquet with Maven

Once protobuf and thrift are available in your path, you can build the project by running:

LC_ALL=C mvn clean install

Features

Parquet is a very active project, and new features are being added quickly; below is the state as of June 2013.

Feature	In trunk	Planned	Expected release
Type-specific encoding	YES		1.0
Hive integration	YES (28)		1.0
Pig integration	YES		1.0
Cascading integration	YES		1.0
Crunch integration	YES (CRUNCH-277)		1.0
Impala integration	YES (non-nested)		1.0
Java Map/Reduce API	YES		1.0
Native Avro support	YES		1.0
Native Thrift support	YES		1.0
Complex structure support	YES		1.0
Future-proofed versioning	YES		1.0
RLE	YES		1.0
Bit Packing	YES		1.0
Adaptive dictionary encoding	YES		1.0
Predicate pushdown	YES (68)		1.0
Column stats	YES		2.0
Delta encoding	YES		2.0
Native Protocol Buffers support	YES		1.0
Index pages		YES	2.0

Map/Reduce integration

Input andOutput formats.Note that to use an Input or Output format, you need to implement a WriteSupport or ReadSupport class, which will implement the conversion of your object to and from a Parquet schema.

We've implemented this for 2 popular data formats to provide a clean migration path as well:

Thrift

Thrift integration is provided by theparquet-thrift sub-project. If you are using Thrift through Scala, you may be using Twitter'sScrooge. If that's the case, not to worry -- we took care of the Scrooge/Apache Thrift glue for you in theparquet-scrooge sub-project.

Avro

Avro conversion is implemented via theparquet-avro sub-project.

Create your own objects

The ParquetOutputFormat can be provided a WriteSupport to write your own objects to an event based RecordConsumer.
the ParquetInputFormat can be provided a ReadSupport to materialize your own objects by implementing a RecordMaterializer

See the APIs:

Apache Pig integration

ALoader and aStorer are provided to read and write Parquet files with Apache Pig

Storing data into Parquet in Pig is simple:

-- options you might want to fiddle withSET parquet.page.size 1048576 -- default. this is your min read/write unit.SET parquet.block.size 134217728 -- default. your memory budget for buffering dataSET parquet.compression lzo -- or you can use none, gzip, snappySTORE mydata into '/some/path' USING parquet.pig.ParquetStorer;

Reading in Pig is also simple:

mydata = LOAD '/some/path' USING parquet.pig.ParquetLoader();

If the data was stored using Pig, things will "just work". If the data was stored using another method, you will need to provide the Pig schema equivalent to the data you stored (you can also write the schema to the file footer while writing it -- but that's pretty advanced). We will provide a basic automatic schema conversion soon.

Hive integration

Hive integration is provided via theparquet-hive sub-project.

Build

to run the unit tests:mvn test

to build the jars:mvn package

The build runs inTravis CI:

Add Parquet as a dependency in Maven

The current release is version1.8.1

  <dependencies>    <dependency>      <groupId>org.apache.parquet</groupId>      <artifactId>parquet-common</artifactId>      <version>1.8.1</version>    </dependency>    <dependency>      <groupId>org.apache.parquet</groupId>      <artifactId>parquet-encoding</artifactId>      <version>1.8.1</version>    </dependency>    <dependency>      <groupId>org.apache.parquet</groupId>      <artifactId>parquet-column</artifactId>      <version>1.8.1</version>    </dependency>    <dependency>      <groupId>org.apache.parquet</groupId>      <artifactId>parquet-hadoop</artifactId>      <version>1.8.1</version>    </dependency>  </dependencies>

How To Contribute

We prefer to receive contributions in the form of GitHub pull requests. Please send pull requests against thegithub.com/apache/parquet-mr repository. If you've previously forked Parquet from its old location, you will need to add a remote or update your origin remote tohttps://github.com/apache/parquet-mr.git

If you are looking for some ideas on what to contribute, check out jira issues for this project labeled"pick-me-up".Comment on the issue and/or contactdev@parquet.apache.org with your questions and ideas.

If you’d like to report a bug but don’t have time to fix it, you can still post it to ourissue tracker, or email the mailing listdev@parquet.apache.org

To contribute a patch:

Break your work into small, single-purpose patches if possible. It’s much harder to merge in a large change with a lot of disjoint features.
Create a JIRA for your patch on theParquet Project JIRA.
Submit the patch as a GitHub pull request against the master branch. For a tutorial, see the GitHub guides on forking a repo and sending a pull request. Prefix your pull request name with the JIRA name (ex:apache#240).
Make sure that your code passes the unit tests. You can run the tests withmvn test in the root directory.
Add new unit tests for your code.

We tend to do fairly close readings of pull requests, and you may get a lot of comments. Some common issues that are not code structure related, but still important:

Use 2 spaces for whitespace. Not tabs, not 4 spaces. The number of the spacing shall be 2.
Give your operators some room. Nota+b buta + b and notfoo(int a,int b) butfoo(int a, int b).
Generally speaking, stick to theSun Java Code Conventions
Make sure tests pass!

Thank you for getting involved!

Authors and contributors

Code of Conduct

We hold ourselves and the Parquet developer community to two codes of conduct:

Discussions

Mailing list:dev@parquet.apache.org
Bug trackter:jira
Discussions also take place in github pull requests

License

Licensed under the Apache License, Version 2.0:http://www.apache.org/licenses/LICENSE-2.0See also:

About

Mirror of Apache Parquet

Releases

28tags

Packages

No packages published

Languages

Java98.6%
Other1.4%

Movatterモバイル変換

License

rdblue/parquet-mr

Folders and files

Latest commit

History

Repository files navigation

Parquet MR

Building

Install Protobuf

Install Thrift

Build Parquet with Maven

Features

Map/Reduce integration

Thrift

Avro

Create your own objects

Apache Pig integration

Hive integration

Build

Add Parquet as a dependency in Maven

How To Contribute

Authors and contributors

Code of Conduct

Discussions

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages