This document serves to describe the many parts of Redexer, with focus on the process of generating an instrumented apk using the logging command.
In a nutshell, the process Redexer goes through to instrument an Android app looks like this:
- Unpack the apk file to extract the dex files. Then for each dex file, do the following.
- Parse the app's dex file into an internal data structure.
- Parse Redexer's logging dex file into the internal data structure
- Merge the logging dex into the app's dex
- Traverse through the dex structure in order to add logging code to each method
- Modify the registers of some instructions to validate instruction requirements
- Dump the internal data structure back into a dex file
- After this is completed for each dex file, repack the dex files into an instrumented apk
Unpacking the apk and running Redexer on each of the dex files actually happens outside of what is officially 'Redexer'. Redexer only accepts dex files as input, so there are supplemental scripts to assist with working on an entire apk. These scripts, found in the scripts
directory, unpack the apk and run the desired command on each of the dex files. When cmd.rb
is run to create an instrumented apk with the flag --cmd logging
, each dex file gets instrumented individually. These scripts can also pull additional information out of the apk, if needed. For example, the apk's facebook_app_id
gets updated in the scripts to ensure Facebook integration works in the instrumented apk.
The first thing that Redexer does is to parse the input dex file into Redexer's own internal data structure. This allows Redexer to make alterations to the dex before dumping it back to file. Parsing in Redexer is well-tested, and it's unlikely that any Redexer developer will need to worry about that code. The internal structure, however, is the main interface into observing and modifying the dex, and can be found in src/dex.ml
. The internal structure is closely aligned with Google's Dex Format Documentation, but different in a few key ways. Mainly, Redexer uses a new type called a link
to maintain the location of an item (an 'item' being any structure outlined in the Google's format doc) in the dex. A link can be either an index or an offset. This is useful because different structures in the dex will point to items in different ways. Sometimes an item will be referred to by its index in an array (it's position in the string_id
pool, for example) or it's bytecode offset from the start of the file (the location of a data_item
in the data pool is an example of this). Each serves a specific purpose for Android's dex format, but can be abstracted away for most computations in Redexer.
In addition to dex.ml
outlining the dex internal structure, instr.ml
handles the structuring of the instructions themselves. Here, you will find the mappings of bytecode integers to instruction types (called opcodes), as well as opcodes to strings. These mappings are a 1:1 relationship with Google's Dex Bytecode Documentation. As a Redexer developer, the functions defined in instr.ml
are mainly used to get information about specific instructions.
Finally, there is modify.ml
, which defines a number of functions for modifying the dex. Examples include adding a method_id
, inserting instructions into a method, and modifying register usage to meet instruction requirements (more on that below).
In order to add instrumentation to apks, logging code had to be written, compiled and inserted into the app. This code implements the actual logging features and interface, specifying what data gets logged and how each line is written to file. This logging code can be found in logging/app/src/main/java/org/umd/logging/
. The most important files are Logger.java
, LoggerI.java
, and FileWriterHandler.java
. LoggerI.java
defines an interface for the logging functions. Logger.java
implements this interface, taking specific inputs depending on the method being logged, and passing them on to FileWriterHandler.java
, which actually writes lines to file. FileWriterHandler.java
runs on a background thread to assist in the performance of the instrumented app. FileWriterHandler.java
is also where a developer can decide what should be logged. Since logging is time-intensive, the values stored in each variable cannot be logged. FileWriterHandler.java
outlines what gets written and what does not. To build this code, run logging/build_and_copy.sh
. This initiates a gradle build, then moves the built dex to data/loggingFull.dex
. loggingFull.dex
gets inserted into the instrumented apk, but is not added to each dex file.
In addition to loggingFull.dex
, there is a separate dex file called logging.dex
that gets merged with each dex from an apk. This code can be found in logging-interface/src
. In here, there is the same LoggerI.java
as in logging/
, but also LoggerShim.java
. LoggerShim.java
defines a shim that uses reflection to call the Logger interface. This allows us to build logging.dex
without ever implementing the Logger interface. That happens in LoggingFull.dex
and is added into the instrumented apk at the end. To build logging.dex
, run logging-interface/make.sh
. This builds the dex file, and then moves that dex to data/logging.dex
.
For the app's dex code to be able to call methods in the Logger interface, some code must be inserted into it. logging.dex
, described above, gets merged into the app's dex for this reason. All inserted code (see next section) calls to the Logging interface in logging.dex
. This merge happens in combine.ml
. Previous developers have attempted to modify this code to control what from logging.dex
gets merged, to no avail. Altering the code in combine.ml
should be avoided as much as possible.
To traverse the internal dex data structure, Redexer utilizes a modular Visitor class, implemented in visitor.ml
. This class uses the internal structure to jump from class definition to method ID to code item in order to visit every item of the dex. The Visitor class is used throughout Redexer, but is specifically used twice during the logging process. The first time, Redexer uses the Visitor class to instrument each method of the dex. If any class or method should be skipped, the Visitor class offers functionality to make that easy. It both reads from the data/skip.txt
file as well as has its own internal, hardcoded list of methods to skip.
There are two different function entries/exits that Redexer distinguishes between. The primary difference between them is where in a method they get logged. 'APIs' are functions that Redexer does not have access to the bytecode of, generally because they are Android OS system calls. Because of this, APIs must be logged at the site of the call/return. A 'Method' is any function that is able to be instrumented because we have access to the bytecode. Typically, these are user-defined functions or included libraries. Methods get logged at the start of the function and at each return statement. Bytecode snippets, as well as the code that inserts into the dex structure can be found in logging.ml
. It is also important to note that in order to avoid register conflicts, Redexer shifts all registers by 6, and can then use registers v0-v5 safely for the necessary logging code.
In order to be as space-efficient as possible, certain dex instructions can only be called by a register v0-v15. However, when Redexer shifted all registers by 6, a register that was previously, say, v14, is now v20. For this reason, Redexer must once again traverse the entire dex structure, finding situations where an instruction that requires registers v15 or lower are no longer being called on a register v15 or lower. The traversal uses visitor.ml
again, but is implemented in modify.ml
this time. Each category of instruction (move instruction, if instruction, etc.) has its own special case. Some cases make use of a DataFlow Analysis in order to determine the type of the value that the register is holding. This DFA is the main bottleneck in Redexer, and can take 20+ minutes to complete a single method's analysis. The analysis itself can be found in dataflow.ml
. This DFA determines what instruction 'defines' the type for each register. Back in modify.ml
, Redexer uses that instruction to calculate exactly what type that register is, and acts accordingly to update the instruction it is working on.
Dumping the internal dex data structure back to a .dex
file happens in src/dump.ml
. This code goes deep into the internal assumptions that Redexer makes about its internal structure, and is able to generate an output that satisfies the (very picky) dex verifier.