Skip to content

An integration of JoernTI's CodeTIDAL5 neural type inference model.

License

Notifications You must be signed in to change notification settings

joernio/joernti-codetidal5

Repository files navigation

PWC


JoernTI x CodeTIDAL5

CodeTIDAL5 ESORICS paper

Artifact for the Learning Type Inference for Enhanced Dataflow Analysis paper

This repository provides means to add neural type inference to the code analysis platform Joern. The newly introduced pass makes use of a Large Language Model during the usual post-processing passes for the jssrc2cpg language frontend to infer additional type information where it is missing.

Installation

For this process to make use of the neural type inference server, the JoernTI backend must be installed first. You can initialize the joernti submodule by running:

git submodule update --init --recursive

Before running the type inference passes with Joern, follow its install instructions and start the backend server:

joernti codetidal5 --run-as-server

You can then proceed to use JoernTI together with Joern:

sbt stage astGenDlTask
./joernti-codetidal5 <target_source_directory> -Dlog4j.configurationFile=log4j2.xml

Configuration

While the default values are usually all that is necessary, there are additional configurations available:

=== JoernTI x CodeTIDAL5 ===
Usage: joernti-codetidal5 [options] input

  --help
  input                    source code directory (JavaScript or TypeScript)
  -o, --output <value>     output path for the CPG (Default 'cpg.bin')
  -h, --hostname <value>   JoernTI server hostname (Default 'localhost')
  -p, --port <value>       JoernTI server port (Default 1337)
  --typeDeclDir <value>    the TypeScript type declaration files to improve type info of the analysis
  --logTypeInference       log the slice based type inference results (Default false for performance)
  -m, --min-calls <value>  the minimum number of calls required for a usage slice (Default 1)
  --exclude-op-calls       excludes <operator> calls from the slices, e.g. <operator>.add, <operator>.assignment, etc.

One notable configuration is to set --typeDeclDir ./type_decl_es5 which checks for type constraint violations according to the ES5 standard library types.

For validating this artifact with the results of the paper, a good combination would be:

./joernti-codetidal5 <target_source_directory> --logTypeInference --typeDeclDir ./type_decl_es5

The argument logTypeInference will provide CSVs listing what was inferred and print any schema violating inferences.

Note: This demo is aimed at version v0.0.44 of JoernTI.

Model

We make a CodeTIDAL5 checkpoint available on Hugging Face: https://huggingface.co/joernio/codetidal5

The current version is fine-tuned for 175k steps on the adjusted (cf. Experiments) ManyTypes4TypeScript dataset. We plan on uploading refined versions in the future.

Experiments

For experimenting with the ML model and the datasets used in ./experiments, install the dependencies incl. CUDA and PyTorch 2.0 (GPU required):

cd ./experiments
./install_cuda_pytorch.sh

You can find scripts and instructions how to generate a training dataset for type inference with a decoder model such as CodeT5 in ./experiments/training_dataset.

Slice Dataset

DOI

We also publish a dataset of object usage slices for ~300k TypeScript programs, extracted with Joern Slice. The slices have been obtained from open source programs in the The Stack dataset.

An example can be found in ./testcode/test_slice.

Citation

If you use JoernTI / CodeTIDAL5 in your research or wish to refer to the baseline results, we kindly ask you to cite us:

@inproceedings{joernti2023,
  title={Learning Type Inference for Enhanced Dataflow Analysis},
  author={Seidel, Lukas and {Baker Effendi}, David and Pinho, Xavier and Rieck, Konrad and {van der Merwe}, Brink and Yamaguchi, Fabian},
  booktitle={28th European Symposium on
Research in Computer Security (ESORICS)},
  year={2023}
}

Some code and graphics in this repository are part of the work first published in the 28th European Symposium on Research in Computer Security by Springer Nature.

Related Work

ManyTypes4TypeScript: A Comprehensive TypeScript Dataset for Sequence-Based Type Inference, International Conference on Mining Software Repositories (MSR) 2022