- Notifications
You must be signed in to change notification settings - Fork4
An integration of JoernTI's CodeTIDAL5 neural type inference model.
License
joernio/joernti-codetidal5
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Artifact for theLearning Type Inference for Enhanced Dataflow Analysis paper
This repository provides means to add neural type inference to the code analysis platformJoern.The newly introduced pass makes use of a Large Language Model during the usual post-processing passes for thejssrc2cpg
language frontend to infer additional type information where it is missing.
For this process to make use of the neural type inference server, the JoernTI backend must be installed first.You can initialize thejoernti
submodule by running:
git submodule update --init --recursive
Before running the type inference passes with Joern, follow its install instructions and start the backend server:
joernti codetidal5 --run-as-server
You can then proceed to use JoernTI together with Joern:
sbt stage astGenDlTask./joernti-codetidal5 <target_source_directory> -Dlog4j.configurationFile=log4j2.xml
While the default values are usually all that is necessary, there are additional configurations available:
=== JoernTI x CodeTIDAL5 ===Usage: joernti-codetidal5 [options] input --help input source code directory (JavaScript or TypeScript) -o, --output <value> output path for the CPG (Default 'cpg.bin') -h, --hostname <value> JoernTI server hostname (Default 'localhost') -p, --port <value> JoernTI server port (Default 1337) --typeDeclDir <value> the TypeScript type declaration files to improve type info of the analysis --logTypeInference log the slice based type inference results (Default false for performance) -m, --min-calls <value> the minimum number of calls required for a usage slice (Default 1) --exclude-op-calls excludes <operator> calls from the slices, e.g. <operator>.add, <operator>.assignment, etc.
One notable configuration is to set--typeDeclDir ./type_decl_es5
which checks for type constraint violationsaccording to the ES5 standard library types.
For validating this artifact with the results of the paper, a good combination would be:
./joernti-codetidal5 <target_source_directory> --logTypeInference --typeDeclDir ./type_decl_es5
The argumentlogTypeInference
will provide CSVs listing what was inferred and print any schema violating inferences.
Note: This demo is aimed at versionv0.0.44
ofJoernTI.
We make a CodeTIDAL5 checkpoint available on Hugging Face:https://huggingface.co/joernio/codetidal5
The current version is fine-tuned for 175k steps on the adjusted (cf.Experiments) ManyTypes4TypeScript dataset.We plan on uploading refined versions in the future.
For experimenting with the ML model and the datasets used in./experiments
, install the dependencies incl. CUDA andPyTorch 2.0 (GPU required):
cd ./experiments./install_cuda_pytorch.sh
You can find scripts and instructions how to generate a training dataset for type inference with a decoder model such as CodeT5 in./experiments/training_dataset
.
We also publish a dataset of object usage slices for ~300k TypeScript programs, extracted withJoern Slice.The slices have been obtained from open source programs in theThe Stack dataset.
An example can be found in./testcode/test_slice
.
If you use JoernTI / CodeTIDAL5 in your research or wish to refer to the baseline results, we kindly ask you to cite us:
@inproceedings{joernti2023,title={Learning Type Inference for Enhanced Dataflow Analysis},author={Seidel, Lukas and {Baker Effendi}, David and Pinho, Xavier and Rieck, Konrad and {van der Merwe}, Brink and Yamaguchi, Fabian},booktitle={28th European Symposium onResearch in Computer Security (ESORICS)},year={2023}}
Some code and graphics in this repository are part of the work first published in the28th European Symposium onResearch in Computer Security by Springer Nature.
ManyTypes4TypeScript: A Comprehensive TypeScript Dataset for Sequence-Based Type Inference, International Conference on Mining Software Repositories (MSR) 2022
Deep Learning Type Inference, ACM ESEC/FSE `18
Probabilistic Type Inference by Optimising Logical and Natural Constraints, arXiv preprint `20
Advanced Graph-Based Deep Learning for Probabilistic Type Inference, arXiv preprint `20
Learning type annotation: is big data enough?, ESEC/FSE `21
About
An integration of JoernTI's CodeTIDAL5 neural type inference model.
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors2
Uh oh!
There was an error while loading.Please reload this page.