Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
experimental_results		experimental_results
experiments		experiments
images		images
joernti @ 9e36af6		joernti @ 9e36af6
project		project
src/main/scala/io/joern/codetidal5		src/main/scala/io/joern/codetidal5
testcode		testcode
type_decl_es5		type_decl_es5
.gitignore		.gitignore
.gitmodules		.gitmodules
.scalafmt.conf		.scalafmt.conf
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt
install-local-joern.sh		install-local-joern.sh
joernti-codetidal5		joernti-codetidal5
log4j2.xml		log4j2.xml
updateDependencies.sh		updateDependencies.sh

Repository files navigation

JoernTI x CodeTIDAL5

Artifact for theLearning Type Inference for Enhanced Dataflow Analysis paper

This repository provides means to add neural type inference to the code analysis platformJoern.The newly introduced pass makes use of a Large Language Model during the usual post-processing passes for thejssrc2cpg language frontend to infer additional type information where it is missing.

Installation

For this process to make use of the neural type inference server, the JoernTI backend must be installed first.You can initialize thejoernti submodule by running:

git submodule update --init --recursive

Before running the type inference passes with Joern, follow its install instructions and start the backend server:

joernti codetidal5 --run-as-server

You can then proceed to use JoernTI together with Joern:

sbt stage astGenDlTask./joernti-codetidal5 <target_source_directory> -Dlog4j.configurationFile=log4j2.xml

Configuration

While the default values are usually all that is necessary, there are additional configurations available:

=== JoernTI x CodeTIDAL5 ===Usage: joernti-codetidal5 [options] input  --help  input                    source code directory (JavaScript or TypeScript)  -o, --output <value>     output path for the CPG (Default 'cpg.bin')  -h, --hostname <value>   JoernTI server hostname (Default 'localhost')  -p, --port <value>       JoernTI server port (Default 1337)  --typeDeclDir <value>    the TypeScript type declaration files to improve type info of the analysis  --logTypeInference       log the slice based type inference results (Default false for performance)  -m, --min-calls <value>  the minimum number of calls required for a usage slice (Default 1)  --exclude-op-calls       excludes <operator> calls from the slices, e.g. <operator>.add, <operator>.assignment, etc.

One notable configuration is to set--typeDeclDir ./type_decl_es5 which checks for type constraint violationsaccording to the ES5 standard library types.

For validating this artifact with the results of the paper, a good combination would be:

./joernti-codetidal5 <target_source_directory> --logTypeInference --typeDeclDir ./type_decl_es5

The argumentlogTypeInference will provide CSVs listing what was inferred and print any schema violating inferences.

Note: This demo is aimed at versionv0.0.44 ofJoernTI.

Model

We make a CodeTIDAL5 checkpoint available on Hugging Face:https://huggingface.co/joernio/codetidal5

The current version is fine-tuned for 175k steps on the adjusted (cf.Experiments) ManyTypes4TypeScript dataset.We plan on uploading refined versions in the future.

Experiments

For experimenting with the ML model and the datasets used in./experiments, install the dependencies incl. CUDA andPyTorch 2.0 (GPU required):

cd ./experiments./install_cuda_pytorch.sh

You can find scripts and instructions how to generate a training dataset for type inference with a decoder model such as CodeT5 in./experiments/training_dataset.

Slice Dataset

We also publish a dataset of object usage slices for ~300k TypeScript programs, extracted withJoern Slice.The slices have been obtained from open source programs in theThe Stack dataset.

An example can be found in./testcode/test_slice.

Citation

If you use JoernTI / CodeTIDAL5 in your research or wish to refer to the baseline results, we kindly ask you to cite us:

@inproceedings{joernti2023,title={Learning Type Inference for Enhanced Dataflow Analysis},author={Seidel, Lukas and {Baker Effendi}, David and Pinho, Xavier and Rieck, Konrad and {van der Merwe}, Brink and Yamaguchi, Fabian},booktitle={28th European Symposium onResearch in Computer Security (ESORICS)},year={2023}}

Some code and graphics in this repository are part of the work first published in the28th European Symposium onResearch in Computer Security by Springer Nature.