parth-p-shah/magikaPublic

forked fromgoogle/magika

NotificationsYou must be signed in to change notification settings
Fork0
Star0

Detect file content types with deep learning

google.github.io/magika/

License

Apache-2.0 license

0 stars 447 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 386 Commits
.github		.github
assets		assets
assets_generation		assets_generation
docs		docs
js		js
python		python
rust		rust
tests_data		tests_data
website		website
.dockerignore		.dockerignore
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md

Repository files navigation

Magika

Magika is a novel AI powered file type detection tool that relies on the recent advance of deep learning to provide accurate detection. Under the hood, Magika employs a custom, highly optimized Keras model that only weighs about 1MB, and enables precise file identification within milliseconds, even when running on a single CPU.

In an evaluation with over 1M files and over 100 content types (covering both binary and textual file formats), Magika achieves 99%+ precision and recall. Magika is used at scale to help improve Google users’ safety by routing Gmail, Drive, and Safe Browsing files to the proper security and content policy scanners.

You can try Magika without anything by using ourweb demo, which runs locally in your browser!

Here is an example of what Magika command line output look like:

For more context you can read our initialannouncement post on Google's OSS blog

Highlights

Available as a Python command line, a Python API, and an experimental TFJS version (which powers ourweb demo).
Trained on a dataset of over 25M files across more than 100 content types.
On our evaluation, Magika achieves 99%+ average precision and recall, outperforming existing approaches.
More than 100 content types (seefull list).
After the model is loaded (this is a one-off overhead), the inference time is about 5ms per file.
Batching: You can pass to the command line and API multiple files at the same time, and Magika will use batching to speed up the inference time. You can invoke Magika with even thousands of files at the same time. You can also use-r for recursively scanning a directory.
Near-constant inference time independently from the file size; Magika only uses a limited subset of the file's bytes.
Magika uses a per-content-type threshold system that determines whether to "trust" the prediction for the model, or whether to return a generic label, such as "Generic text document" or "Unknown binary data".
Support three different prediction modes, which tweak the tolerance to errors:high-confidence,medium-confidence, andbest-guess.
It's open source! (And more is yet to come.)

For more details, see the documentation for thepython package and for thejs package (devdocs).

Getting Started

Installation

Magika is available asmagika on PyPI:

$ pip install magika

Running in Docker

git clone https://github.com/google/magikacd magika/docker build -t magika .docker run -it --rm -v $(pwd):/magika magika -r /magika/tests_data

Usage

Python command line

Examples:

$ magika -r tests_data/tests_data/README.md: Markdown document (text)tests_data/basic/code.asm: Assembly (code)tests_data/basic/code.c: Csource (code)tests_data/basic/code.css: CSSsource (code)tests_data/basic/code.js: JavaScriptsource (code)tests_data/basic/code.py: Pythonsource (code)tests_data/basic/code.rs: Rustsource (code)...tests_data/mitra/7-zip.7z: 7-zip archive data (archive)tests_data/mitra/bmp.bmp: BMP image data (image)tests_data/mitra/bzip2.bz2: bzip2 compressed data (archive)tests_data/mitra/cab.cab: Microsoft Cabinet archive data (archive)tests_data/mitra/elf.elf: ELF executable (executable)tests_data/mitra/flac.flac: FLAC audio bitstream data (audio)...

$ magika code.py --json[    {"path":"code.py","dl": {"ct_label":"python","score": 0.9940916895866394,"group":"code","mime_type":"text/x-python","magic":"Python script","description":"Python source"        },"output": {"ct_label":"python","score": 0.9940916895866394,"group":"code","mime_type":"text/x-python","magic":"Python script","description":"Python source"        }    }]

$ cat doc.ini| magika --: INI configuration file (text)

$ magika -hUsage: magika[OPTIONS][FILE]... Magika - Determine type of FILEs with deep-learning.Options: -r, --recursive When passing this option, magika scans every file within directories, instead of outputting "directory" --json Output in JSON format. --jsonl Output in JSONL format. -i, --mime-type Output the MIME type instead of a verbose content type description. -l, --label Output a simple label instead of a verbose content type description. Use --list-output- content-types for the list of supported output. -c, --compatibility-mode Compatibility mode: output is as close as possible to`file` and colors are disabled. -s, --output-score Output the prediction score in addition to the content type. -m, --prediction-mode [best-guess|medium-confidence|high-confidence] --batch-size INTEGER How many files to process in one batch. --no-dereference This option causes symlinks not to be followed. By default, symlinks are dereferenced. --colors / --no-colors Enable/disable use of colors. -v, --verbose Enable more verbose output. -vv, --debug Enable debug logging. --generate-report Generate report useful when reporting feedback. --version Print the version and exit. --list-output-content-types Show a list of supported content types. --model-dir DIRECTORY Use a custom model. -h, --help Show this message and exit. Magika version: "0.5.0" Default model: "standard_v1" Send any feedback to magika-dev@google.com or via GitHub issues.

Seepython documentation for detailed documentation.

Python API

Examples:

>>>frommagikaimportMagika>>>m=Magika()>>>res=m.identify_bytes(b"# Example\nThis is an example of markdown!")>>>print(res.output.ct_label)markdown

Seepython documentation for detailed documentation.

Experimental TFJS model & npm package

We also provide Magika as an experimental package for people interested in using in a web app.Note that Magika JS implementation performance is significantly slower and you should expect to spend 100ms+ per file.

Seejs documentation for the details.

Development Setup

We usepoetry for development and packaging:

$ git clone https://github.com/google/magika$cd magika/python$ poetry shell&& poetry install$ magika -r ../tests_data

To run the tests:

$cd magika/python$ poetry shell$ pytest tests/

Important Documentation

Known Limitations & Contributing

Magika significantly improves over the state of the art, but there's always room for improvement! More work can be done to increase detection accuracy, support for additional content types, bindings for more languages, etc.

This initial release is not targeting polyglot detection, and we're looking forward to seeing adversarial examples from the community.We would also love to hear from the community about encountered problems, misdetections, features requests, need for support for additional content types, etc.

Check our open GitHub issues to see what is on our roadmap and please report misdetections or feature requests by either opening GitHub issues (preferred) or by emailing us atmagika-dev@google.com.

When reporting misdetections, you may want to use$ magika --generate-report <path> to generate a report with debug information, which you can include in your github issue.

NOTE: Do NOT send reports about files that may contain PII, the report contains (a small) part of the file content!

SeeCONTRIBUTING.md for details.

Frequently Asked Questions

We have collected a number of FAQshere.

Additional Resources

Google's OSS blog post about Magika announcement.
Web demo:web demo.

Citation

If you use this software for your research, please cite it as:

@software{magika,author ={Fratantonio, Yanick and Bursztein, Elie and Invernizzi, Luca and Zhang, Marina and Metitieri, Giancarlo and Kurt, Thomas and Galilee, Francois and Petit-Bianco, Alexandre and Farah, Loua and Albertini, Ange},title ={{Magika content-type scanner}},url ={https://github.com/google/magika}}

License

Apache 2.0; seeLICENSE for details.

Disclaimer

This project is not an official Google project. It is not supported byGoogle and Google specifically disclaims all warranties as to its quality,merchantability, or fitness for a particular purpose.

About

Detect file content types with deep learning

google.github.io/magika/

Releases

1tags

Packages

No packages published

Languages

Python61.9%
HTML10.7%
Rust8.7%
Rich Text Format6.7%
JavaScript5.1%
Vue5.0%
Other1.9%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Magika

Highlights

Table of Contents

Getting Started

Installation

Running in Docker

Usage

Python command line

Python API

Experimental TFJS model & npm package

Development Setup

Important Documentation

Known Limitations & Contributing

Frequently Asked Questions

Additional Resources

Citation

License

Disclaimer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

License

parth-p-shah/magika

Folders and files

Latest commit

History

Repository files navigation

Magika

Highlights

Table of Contents

Getting Started

Installation

Running in Docker

Usage

Python command line

Python API

Experimental TFJS model & npm package

Development Setup

Important Documentation

Known Limitations & Contributing

Frequently Asked Questions

Additional Resources

Citation

License

Disclaimer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages