- Notifications
You must be signed in to change notification settings - Fork49
Hadrian Data Format
PFA by itself does not define a data representation, only a type system (Avro's type system). Hadrian, as a software library rather than an application, does not require data to be serialized in a particular format. Three input formats are defined so far (Avro, JSON, and CSV), but applications using the library are encouraged to use their own input formats: anything that is appropriate for the workflow that Hadrian is to be embedded in.
However, data has to be represented in some form for processing by PFA functions. This is the data format used internally by Hadrian.
Avro type | Hadrian's internal format |
---|---|
null | null JavaObject (AnyRef ) |
boolean | java.lang.Boolean |
int | java.lang.Integer |
long | java.lang.Long |
float | java.lang.Float |
double | java.lang.Double |
string | JavaString |
bytes | Java array of bytes |
array | com.opendatagroup.hadrian.data.PFAArray[T] |
map | com.opendatagroup.hadrian.data.PFAMap[T] |
record | subclass ofcom.opendatagroup.hadrian.data.PFARecord |
fixed | subclass ofcom.opendatagroup.hadrian.data.PFAFixed |
enum | subclass ofcom.opendatagroup.hadrian.data.PFAEnumSymbols |
union | Java Object (AnyRef ) |
Input to a scoring engine'saction
method must be of this form, and output from that method will be of this form. This is not the format that the Avro library produces when you deserialize an Avro file (Hadrian uses a customorg.apache.avro.specific.SpecificData
calledcom.opendatagroup.hadrian.data.PFASpecificData
). However, it is a format that can be passed directly to the Avro library to serialize an Avro file.
Three of the above,PFARecord
,PFAFixed
, andPFAEnumSymbols
are compiled specifically for each PFA engine class. (If you run thefromJson
method ofcom.opendatagroup.hadrian.jvmcompiler.PFAEngine
withmultiplicity > 1
, all of the scoring engines returned share the same class; if you run it multiple times, the scoring engines belong to different classes.) You must use the right subclass. Since these subclasses are compiled at runtime, they must be accessed through a specialjava.lang.ClassLoader
.
Here is an example of creating aPFARecord
for a givenengine
(of classcom.opendatagroup.hadrian.jvmcompiler.PFAEngine
) and arecordType
(of classcom.opendatagroup.hadrian.datatype.AvroRecord
). Assume that the fields of this record have already been converted into the appropriate types and are stored, in field order, in an array of Objects calledfieldData
.
valrecordTypeName= recordType.fullNamevalclassLoader= engine.classLoadervalsubclass= classLoader.loadClass(recordTypeName)valconstructor= subclass.getConstructor(classOf[Array[AnyRef]])constructor.newInstance(fieldData)
Only the last line needs to be executed at runtime; the rest can be saved from an initialization phase. In fact, callingconstructor.setAccessible(true)
can speed upconstructor.newInstance(fieldData)
by skipping access checks at runtime.
Here is an example of creating aPFAFixed
from a givenengine
(of classPFAEngine
) and afixedType
(of classcom.opendatagroup.hadrian.datatype.AvroFixed
). Assume that the data is stored as an array of byte primitives calledbytesData
.
valfixedTypeName= fixedType.fullNamevalclassLoader= engine.classLoadervalsubclass= classLoader.loadClass(fixedTypeName)valconstructor= subclass.getConstructor(classOf[Array[Byte]])constructor.newInstance(bytesData)
Here is an example of creating aPFAEnumSymbol
from a givenengine
(of classPFAEngine
) and anenumType
(of classcom.opendatagroup.hadrian.datatype.AvroEnum
). Assume that the data is given as a string calledsymbolName
.
valenumTypeName= enumType.fullNamevalclassLoader= engine.classLoadervalsubclass= classLoader.loadClass(enumTypeName)valconstructor= subclass.getConstructor(classOf[org.apache.avro.Schema],classOf[String])constructor.newInstance(enumType.schema, symbolName)
PFAArray[T]
andPFAMap[T]
are templated classes that satisfy Java'sjava.util.List[T]
andjava.util.Map[String, T]
interfaces, though most methods raiseUnsupportedOperationException
. They are backed by Scala collections,Vector[T]
andMap[String, T]
. The normal way to create aPFAArray[T]
orPFAMap[T]
is with a given vectorv
or mapm
:
PFAArray.fromVector(v)PFAMap.fromMap(m)
However, they can also be constructed using in-place operations using the Java interfaces (sizeHint
is an integer hint for preallocation andarrayType
,mapType
are instances ofcom.opendatagroup.hadrian.datatype.AvroArray
andcom.opendatagroup.hadrian.datatype.AvroMap
):
valarray=PFAArray.empty(sizeHint, arrayType.schema)array.add(value1)array.add(value2)valmap=PFAMap.empty(sizeHint, mapType.schema)map.put(key1, value1)map.put(key2, value2)
To get a usable collection, call thearray.toVector
ormap.toMap
methods. In the building phase,PFAArray[T]
andPFAMap[T]
are backed byscala.collection.mutable.Builder[T, Vector[T]]
andscala.collection.mutable.Builder[(String, T), Map[String, T]]
for performance when progressively accumulating data. Oncearray.toVector
ormap.toMap
has been called, they are backed by collections. Thearray.toVector
andmap.toMap
operations should be considered rapid because they're already lazy-cached.
Note thatPFAArray[T]
takes primitive typesT
for booleans (Boolean
), integers (Int
), longs (Long
), floats (Float
), and doubles (Double
), butPFAMap[T]
takes boxed primitive typesT
for booleans (java.lang.Boolean
), integers (java.lang.Integer
), longs (java.lang.Long
), floats (java.lang.Float
), and doubles (java.lang.Double
). These quirks were forced by the way that the Avro library loads data.
Additionally,PFAArray[T]
has a mutablemetadata
field (of typeMap[String, Any]
) for optimizations. Some data mining models run faster if their input data are organized differently from a flat list. For instance,model.neighbor.nearestK
can be optimized by storing the training dataset as a KD-tree, rather than a list. With thelib.model.neighbor.nearestK.kdtree
option set totrue
, Hadrian will build the KD-tree and attach it to thePFAArray[T]
asmetadata
. On subsequent calls,model.neighbor.nearestK
will search the tree, rather than the array, replacing an O(n) algorithm with an O(log(n)) one. This is safe from inconsistencies because arrays are immutable in PFA.
Hadrian has a few built-in translator routines, which translate data from a form appropriate for one engine class to another engine class (com.opendatagroup.hadrian.data.PFADataTranslator
), from data deserialized by the Avro library to data appropriate for an engine class (com.opendatagroup.hadrian.data.AvroDataTranslator
), and to and from Scala code (com.opendatagroup.hadrian.data.ScalaDataTranslator
). All three minimize the effort needed to translate at runtime by saving constructors and skipping unnecessary translations (for example, fromjava.lang.Integer
tojava.lang.Integer
or arrays of these, etc.).
Antinous also has translator routines, which translate from PFA to Jython (com.opendatagroup.antinous.translate.PFAToJythonDataTranslator
), the reverse (com.opendatagroup.antinous.translate.JythonToPFADataTranslator
), and data deserialized by the Avro library to Jython (com.opendatagroup.antinous.translate.AvroToJythonDataTranslator
). They follow the same pattern as Hadrian's translators, but additionally have to deal with the problem of grafting the Avro type system onto Python's built-in type system.
The Hadrian library provides a few data serialization/deserialization methods out-of-the-box. Some are specific to a givenPFAEngine
class, others are generic, deserializing data that could be translated withPFADataTranslator
and then used as input toaction
or for serializing any data directly.
The specific methods are all member functions of thePFAEngine
class. The results of each input method can be directly passed toPFAEngine.action
and the output ofPFAEngine.action
(oremit
) can be directly passed to each output method.
avroInputIterator
reads a raw Avro file as ajava.io.InputStream
and yields data as ajava.util.Iterator
.jsonInputIterator
reads a file in which each line is a complete JSON document representing one input datum, again as ajava.io.InputStream
, producing ajava.util.Iterator
. If the input is ascala.collection.Iterator[String]
, then the output is ascala.collection.Iterator[X]
.csvInputIterator
usesApache Commons CSV to read a CSV file as record data. The engine's input type must be a record containing only primitives, to conform with CSV's limitations.jsonInput
loads one complete JSON document representing one datum. This function must be called repeatedly, since it does not operate on streams or iterators, and it is less efficient than the iterator version.avroOutputDataStream
creates an Avro data sink on a givenjava.io.OutputStream
that hasappend
andclose
methods for writing data.jsonOutputDataStream
does the same for JSON, printing one complete JSON document per line.csvOutputDataStream
does the same for CSV, assuming that the engine's output type is a record containing only primitives.fromPFAData
is a specializedPFADataTranslator
attached to thePFAEngine
. Use this to convert data from one scoring engine's output to another's input (i.e. chaining models).fromGenericAvroData
is a specializedAvroDataTranslator
attached to thePFAEngine
. Use this to convert data deserialized by the Avro library into data that can be sent toaction
.
The following functions are generic, not associated with any PFA engine class. To use them for input, be sure to run the data through the specific PFA engine'sPFAEngine.fromPFAData
first. Any can be used for output. They are all in thecom.opendatagroup.hadrian.data
package.
fromJson
converts one datum from a complete JSON document.fromAvro
converts one datum from Avro (as part of a stream or an RPC call, not an Avro file with header).toJson
converts one datum to a complete JSON document.toAvro
converts one datum to Avro (again, as part of a stream or an RPC call, not an Avro file with header).avroInputIterator
streams an Avro file like thePFAEngine
method with the same name, but produces generic data that must be translated withPFAEngine.fromPFAData
.jsonInputIterator
streams a file of one-JSON-per-line like thePFAEngine
method with the same name, but produces generic data that must be translated withPFAEngine.fromPFAData
.csvInputIterator
streams a CSV file like thePFAEngine
method with the same name, but produces generic data that must be translated withPFAEngine.fromPFAData
.avroOutputDataStream
streams an Avro file exactly like thePFAEngine
method with the same name.jsonOutputDataStream
streams a one-JSON-per-line file exactly like thePFAEngine
method with the same name.csvOutputDataStream
streams a CSV file exactly like thePFAEngine
method with the same name.
Return to theHadrian wiki table of contents.
Licensed under the Hadrian Personal Use and Evaluation License (PUEL).