Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Go library to magically turn json in json-avro

License

NotificationsYou must be signed in to change notification settings

ouzi-dev/avro-kedavro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A library to parse raw json to avro with magic!

Why avro-kedavro?

  • We want to store information about wizards in S3 to query with Athena later using this schema:
{  "name": "Wizard",  "type": "record",  "namespace": "com.avro.kedavro",  "fields": [      {"name": "name","type": [  "null",  "string"],"default": null  },      {"name": "id","type": "long"  },      {"name": "timestamp","type": "long",        "logicalType": "timestamp-millis"  }  ]}
  • We have multiple sources reporting wizards and leaving the data in JSON format in a stream... the problem is that each source uses a different format, so we have reports like:

    • {"name": "Harry", "id": 12345, "timestamp": 1571128870}
    • {"name": "Ron", "id": "98765", "timestamp": 1571128870}
    • {"name": "Hermione", "id": "56784", "timestamp": "1571128870000"}

    None of these reports are valid for the schema we have:

    • All of them will fail just with thename field, since the union in JSON-avro should be:"name": {"string": "..."}
    • Only the first record has the id aslong
    • The schema expects the timestamp as along with milliseconds, but none of the reports is correct: or they don't have milliseconds, or it's astring instead of along
  • We could try to implement an specific solution for each record, but what happens when we start dealing with 10 different types? And with 100? And even more, what if we want to change some schemas? Changing a schema would mean to go through all the parsers we built for specific "events". So we need some kind of magic where we have:

    • avro schema
    • JSON record
    • Some rules like: switch strings to numbers, or switch timestamps to timestamps with milliseconds, ...

    Well... that magic isavro-kedavro!

How to use it

avro-kedavro is design to work withgoavro. The idea isavro-kedavro will parse your raw JSON to avro-JSON supported by your schema, so you can usegoavro to generate youravro OCF files

Example:

import ("encoding/json""fmt""github.com/linkedin/goavro""github.com/ouzi-dev/avro-kedavro/pkg/kedavro")const schema = `{"name": "Wizard","type": "record","namespace": "com.avro.kedavro","fields": [{  "name": "name",  "type": ["null","string"  ],  "default": null},{  "name": "id",  "type": "long"},{  "name": "timestamp",  "type": "long",  "logicalType": "timestamp-millis"}]  }`const JSONrecord = `{"name": "Voldemort", "id": "66666", "timestamp": "1571128870"}`func ParseToJSONAvro() error {p, err := kedavro.NewParser(schema, kedavro.WithStringToNumber(), kedavro.WithTimestampToMillis())if err != nil {// Error parsing schemareturn err}avroJSON, err := p.Parse([]byte(JSONrecord))if err != nil {// Error parsing record with schemareturn err}// Marshal the map to show the result from avro-kedavrokedavroJSONResult, err := json.Marshal(avroJSON)if err != nil {// Error marshaling kedavro resultreturn err}fmt.Println(string(kedavroJSONResult))// this will print: {"name": {"string": "Voldemort"}, "id": 66666, "timestamp": 1571128870000}// use goavro to test the generated avroJSON is valid for the schemacodec, err := goavro.NewCodec(schema)if err != nil {// Error parsing schemareturn err}textual, err := codec.TextualFromNative(nil, avroJSON)if err != nil {// Error avroJSONreturn err}fmt.Println(string(textual))// this will print: {"name": {"string": "Voldemort"}, "id": 66666, "timestamp": 1571128870000}return nil}

Options

avro-kedavro supports 4 different options for now:

  • WithStringToNumber() will try to parse strings as numbers:{"test": "1234.56"} =>{"test": 1234.56}
  • WithStringToBool() will try to parse strings as booleans:{"test": "False"} =>{"test": false}
  • WithTimestampToMillis() will add milliseconds to timestamps, only works forlogicalType="timestamp-millis" fields:{"test": 1571128870} =>{"test": time.Time(1571128870000)}
  • WithTimestampToMicros() will add microseconds to timestamps, only works forlogicalType="timestamp-micros" fields:{"test": 1571128870} =>{"test": time.Time(1571128870000000)}
  • WithDateTimeFormat(format string) will try to parse a string to a timestamp using the format specified as param, only works forlogicalType="timestamp-millis" orlogicalType="timestamp-micros" fields:{"test": "2019-10-14T12:45:18Z"} => (usingtime.RFC3339 as format and typelogicalType="timestamp-millis) =>{"test": time.Time(15710571180000)}
  • WithNowForNullTimestamp will settime.Now() if the field is null, only works forlogicalType="timestamp-millis" orlogicalType="timestamp-micros" fields.

Supported types

Not all the avro types are supported byavro-kedavro yet! The current supported types are:

AvroGo
nullnil
booleanbool
bytes[]byte
floatfloat32
doublefloat64
longint64
intint32
stringstring
unionsee below
recordmap[string]interface{}

Unsupported types:

Avro
enum
fixed
map
array

Supported Unions

Only unions with two elements where the first one is null and the second is a supported type different than record are currently supported byavro-kedavro:

First fieldSecond field
nullboolean
nullbytes
nullfloat
nulldouble
nulllong
nullint
nullstring

Supported Logical Types

For now only two logical types are supported:

AvroGo
timestamp-millistime.Time
timestamp-microstime.Time

About timestamps

For logical types of type timestamp, the schema has to be defined always as a long.

Accepted values in json for timestamps are:

  • Numeric values: for example1586502702 will be accepted as a timestamp, if a numeric value has decimals, those decimals will be ignored when parsing totime.Time
  • Strings: only ifWithStringToNumber() option is provided, the string will be parsed like:
    • If the string is a number without decimals: it will be treated as a timestamp (in seconds, milliseconds, or microseconds depending on the provided options to the parser)
    • If the string is a number with decimals: it will be treated as a timestamp where the decimals will be consider fractions of seconds.
      • If the selected type istimestamp-millis the parser will keep the first three decimals.
      • If the selected type istimestamp-micros the parser will keep the first six decimals.
    • If the string has non-numeric characters: the parser will try to parse the string totime.Time using the provided format with the optionWithDateTimeFormat(format string)
  • null: only ifWithNowForNullTimestamp() option is provided. When the option is provided, if a null is found for atimestamp-millis ortimestamp-micros field,time.Now() will be used as value.

[8]ページ先頭

©2009-2025 Movatter.jp