- Notifications
You must be signed in to change notification settings - Fork6
Amazon CloudFront Log Parser
License
NotificationsYou must be signed in to change notification settings
rkalla/cloudfront-log-parser
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Amazon CloudFront Log ParserChangelog---------1.4* Fixed Issue #11 - IllegalArgumentException while parsing newer CF log format.* Fixed Issue #12 - Supporting new CF log fields.1.3* Improved performance by using new feature in tbm-common-parser-lib thatallows a tokenizer to re-use the same IToken for every token event insteadof creating a new one.1.2* More secure exception handling inside of Parse ensures cleanup of any temporary input streams.* Improved the exception semantics of the parse method.#2* Parser is more verbal with MalformedContentException errors when the contentof the log file doesn't match the Amazon-defined DOWNLOAD or STREAMING CloudFront log formats.#2* Fixed incorrect bounds check for insertion of fields beyond what ILogEntrywill allow.#5* Fixed a handful of "tightening" bugs that didn't break the parser previouslybut could lead to leaks or bugs down the road. * Addition Javadoc to the source was added to make it more clear what certainconstructs are for.* Added a new file to the benchmark that matches a worst-case-scenario logfile from CloudFront (50MB uncompressed, ~174k entries)* Added library to The Buzz Media's Maven repo.1.1* Initial public release.License-------This library is released under the Apache 2 License. See LICENSE.Description-----------CloudFront Log Parser is a Java library offering a low-complexity, high-performance, adaptive CloudFront log parser for both the Download and Streaming CloudFront log formats.The library is "adaptive" in the sense that it determines the format of thelog files at parse time, you don't need to tell it, and it is additionallyresilient in that it can skip unknown Field names and values while parsinginstead of throwing an exception. This is important if the library is everdeployed in a production setting and Amazon changes the CloudFront log formatbefore a new version of the library is read.CloudFront Log Parser's API was designed with the existing AWS Java JDK in mind and ensuring that integration would fit naturally. More specifically, the API can directly consume raw InputStream for the stored .gz log files on S3 directly from the S3Object.getObjectContent() method.That being said, the API is coded to accept ANY InputStream; whether you are processing local copies and passing FileInputStreams or processing remote copies; it just happened to be particularly easy to integrate it with the AWSJava SDK.CloudFront Log Parser is intended to be used in any deployment scenario from adesktop analysis app to a long-running server process environment.Design-----------CloudFront Log Parser was designed, first and foremost, to be as fast as possible with as little memory allocation as possible so as to work smoothlyin a long-running server log-processing usage scenario.Object creation is kept to a minimum during parsing (no matter how big theparse job) by re-using a single ILogEntry instance, per LogParser, to wrapparsed line values and report those to the given ILogParserCallback.The ILogEntry instance received by the callbacks is ephemeral; in that the ILogEntry instance is only valid for the scope of the callback's method. Oncereturned from that method, the ILogEntry instance is reused and the values itholds are swapped.*** CALLBACKS SHOULD NEVER HOLD ONTO ILOGENTRY INSTANCES! ***However, the values reported by the callbacks are copies and can be safelystored.This design was chosen because over large parse jobs for busy sites, where millions of log entries would not be uncommon, the heap memory savings andperformance improvement because of this design would be noticeable. The VMwould not be thrown into longer GC cycles as it attempted to clean up themillions of useless objects that were so short lived. Performance-----------Benchmarks can be found in the /src/test/java folder and can be run directlyfrom the command line (no need to setup JUnit).[Platform]* Java 1.6.0_24 on Windows 7 64-bit * Dual Core Intel E6850 processor* 8 GB of ram[Benchmark Results]Parsed 100 Log Entries in 27ms (3703 entries / sec)Parsed 100,000 Log Entries in 864ms (115740 entries / sec)Parsed 174,200 Log Entries in 1341ms (129903 entries / sec)Parsed 1,000,000 Log Entries in 7520ms (132978 entries / sec)The Amazon CloudFront docs say log files are truncated at a maximum size of 50MB (uncompressed) before they are written out to the log directory. The 3rd test,parsing the 174k log entries is exactly 50MB uncompressed and matches thisworst-case-scenario. This means that using CloudFront Log Parser to parse the largest log files thatCloudFront will write out to your S3 bucket, you can parse that file in a littleover a second on equivalent hardware.If you are running on beefier server hardware, you can increase that rate andif you are parsing logs in a multi-threaded environment (1 thread per LogParser)you can increase that rate my magnitudes.CloudFront Log Parser is fast. Runtime Requirements--------------------1.The Buzz Media common-lib (tbm-common-lib-<VER>.jar)2.The Buzz Media common-pars-erlib (tbm-common-parser-lib-<VER>.jar)History-------After deploying apps that utilized CloudFront for content delivery, I had theneed to parse the resulting access logs to get an idea of what kind of traffic,bandwidth and access patterns the content was receiving.Amazon promotes the use of their Map/Reduce Hadoop-based log parser multiple times on their site, but that requires additional EC2 charges to run.After about a week of prototyping and engineering I had initial versions of theCloudFront Log Parser written and running. Initial "does it work" prototypes tookan afternoon, but I toyed with a multitude of different API designs and approaches trying to best balance an easy-to-use API with runtime performance.Eventually settling on what was by far the cleanest and simplest API, I usedHPROF to tighten up the runtime performance and minimize object creation down tothe bare minimum.I hope this helps folks out there.
About
Amazon CloudFront Log Parser
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
No packages published