- Notifications
You must be signed in to change notification settings - Fork0
An example in Scala of reading data saved in hbase by Spark and an example of converter for python
License
luxq/spark_hbase
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Spark has their own example about integrating HBase and Spark in scalaHBaseTest.scala and python converterHBaseConverters.scala.
However, the python converterHBaseResultToStringConverter
inHBaseConverters.scala return only the value of first column in the result. AndHBaseTest.scala stops just at returningorg.apache.hadoop.hbase.client.Result and doing .count() call.
Here we provide a new example in Scala about transferring data saved in hbase intoString
by Spark and a new example of python converter.
The example in scalaHBaseInput.scala transfers the data saved in hbase intoRDD[String]
which containscolumnFamily, qualifier, timestamp, type, value.
The example of converter for pythonpythonConverters.scala transfer the data saved in hbase into string which contains the same information as the example above. We can useast
package to easily transfer this string to dictionary
- Make sure that you well set upgit
- Download this application by
$ git clone https://github.com/GenTang/spark_hbase.git
- Build the assembly by using SBT
assembly
$<the path to spark_hbase>/sbt/sbt clean assembly
Run example python scripthbase_input.py which use pythonConverter
ImmutableBytesWritableToStringConverter
andHBaseResultToStringConverter
to convert the data in hbase to dictionaryIf you are using
SPARK_CLASSPATH
:Add
export SPARK_CLASSPATH=$SPARK_CLASSPATH":<the path to hbase>/lib/*:<the path to spark_hbase>/target/scala-2.10/spark_hbase-assembly-1.0.jar
to./conf/spark-env.sh
.Launch the script by
$ ./bin/spark-submit<the path to hbase_input.py> \<host><table><column>
You can also use
spark.executor.extraClassPath
and--driver-class-path
(recommended):Add
spark.executor.extraClassPath <the path to hbase>/lib/*
tospark-defaults.conf
.Launch the script by
$ ./bin/spark-submit \ --driver-class-path<the path to spark_hbase>/target/scala-2.10/spark_hbase-assembly-1.0.jar \<the path to hbase_input.py> \<host><table><column>
Run example scala scriptHBaseInput.scala
If you are using
SPARK_CLASSPATH
:Add
export SPARK_CLASSPATH=$SPARK_CLASSPATH":<the path to hbase>/lib/*
to./conf/spark-env.sh
.Launch the script by
$ ./bin/spark-submit \ --class examples.HBaseInput \<the path to spark_hbase>/target/scala-2.10/spark_hbase-assembly-1.0.jar \<host><table>
You can also use
spark.executor.extraClassPath
and--driver-class-path
(recommended):The same configuration as above
Launch the script by
$ ./bin/spark-submit \ --driver-class-path<the path to hbase>/lib/*: \ --class examples.HBaseInput \<the path to spark_hbase>/target/scala-2.10/spark_hbase-assembly-1.0.jar \<host><table>
Assume that you have already some data in hbase as follow:
hbase(main):028:0> scan "test"ROW COLUMN+CELL r1 column=c1:a, timestamp=1420329575846, value=a1 r1 column=c1:b, timestamp=1420329640962, value=b1 r2 column=c1:a, timestamp=1420329683843, value=a2 r3 column=c1:, timestamp=1420329810504, value=3
By launching$ ./bin/spark-submit --driver-class-path <the path to spark_hbase>/target/scala-2.10/spark_hbase-assembly-1.0.jar <the path to hbase_input.py> localhost test c1
, you will get
(u'r1', {'columnFamliy': 'c1', 'timestamp': '1420329575846', 'type': 'Put', 'qualifier': 'a', 'value': 'a1'}) (u'r1', {'columnFamliy': 'c1', 'timestamp': '1420329640962', 'type': 'Put', 'qualifier': 'b', 'value': 'b1'}) (u'r2', {'columnFamliy': 'c1', 'timestamp': '1420329683843', 'type': 'Put', 'qualifier': 'a', 'value': 'a2'}) (u'r3', {'columnFamliy': 'c1', 'timestamp': '1420329810504', 'type': 'Put', 'qualifier': '', 'value': '3'})