Bytecode Verification

Code in modern days is mobile in the sense that it gets downloaded and executed on the fly very often. In JVM they are usually shipped with:

Platform independent code
Organized into class files.

Class Loaders

The class loader is the part of the JVM that takes the bytecode and interprets all the class files to make the program runnable.

Security Considerations

Requirements for Security

Mobile code cannot be trusted
Code may not be type safe
Code may destroy or modify data
Code may expose personal information
Code may crash the underlying VM
Code may purposefully degrade performance (denial ofservice)

This is somewhat similar to the trust relations we explored in Notions of Security. Anyways, since the code is not trusted, we need a way to prove some of its properties.

Some properties that are easy to check are:

Proper execution requires that
Each instruction is type correct
Only initialized variables are read
No stack over- or underflow occurs

This can be done by verification of the types and runtime checking.

Sandboxed Execution Environment

The program only accesses the underlying OS through some API that have access control implemented. In this way we have a limitation of the damage that the system can have.

Sandbox- Applets get access tosystem resources onlythrough an API- Access control can beimplemented- Security relies on- Type safety- Code does not by-passsandbox

Summary of the JVM

The Java Virtual Machine is stack-based similar to the original assembly programs you have studied at starting of your bachelor.

Most operations popoperands from a stack and push a result.
But here, we also have Registers that store method parametersand local variables
Instructions are typed
Load and store instructions access registers
Control is handled by intra-method jumps(goto, conditional branches)

Type Verification

We can use type inference or type checking, at the end we still need some runtime checking for covariant or contravariant types.

Why Use Type-Inference?

We can simulate the execution of the code, just using the types instead of values (for some sort of abstract resolution). Each operation has some transition rules. We can follow these transition rules and check if there will be still a transition at the end or not, in this way if there is no transition, I already know that there is an error.

Types of the Engine

In the JVM inference engine we have the first layer of primitive types, then composite (Objects and arrays) and null for null reference.

graph BT
  subgraph "Type Hierarchy"
    int --> T
    Object --> T
    float --> T
    intArr["int[]"] --> Object
    floatArr["float[]"] --> Object
    C --> Object
    objArr["Object[]"] --> Object
    cArr["C[]"] --> objArr
    null --> intArr
    null --> floatArr
    null --> C
    null --> cArr
  end

Where T is the top type for uninitialized registers and C be used for Class, Collection, or a generic Type

Some Rules

MS stands for maximum stack size. MR is a maximum number of local register variable And dot stands for concatenation.

$iconst\ n:( S,R ) \rightarrow ( int.S,R ), \text{ if } \lvert S \rvert < M_S$

$iload\ n:( S,R ) \rightarrow ( int.S,R ),\text{ if } 0 \leq n < M_R \land R( n ) = int \land \lvert S \rvert < M_S$

$astore\ n:( t.S,R ) \rightarrow ( S,R\{ n \leftarrow t \} ),\text{ if } 0 \leq n < M_R \land t <: Object$, we use object so that we have a reference (ints or floats would fail).

$invokevirtual\ C.m.\sigma:( t'_n...t'_1.t'.S,R ) \rightarrow ( r.S,R ), \text{ if }\sigma = r ( t_1,...,t_n ) \land t' <: C \land t'_i <: t_i$. C stands for class name, m for method name and $\sigma$ the method signature. The stack needs to contain the parameters in reverse order, the end stack needs the result. The parameters of the function need to be subtypes of the signture, while the receiver subtype of the class $C$.

Bytecode Verification-20251030103644263 — Example from the Course Slides

Smallest Common Supertype

Jumps in the control flow can lead to joins. When you have a top object in the stack with different types, the smallest common supertype is used now. If we have multiple subtyiping, the result would not be well-defined, for this reason we:

Ignore interfaces (they are all objects)
Then in Java you can only subclass once, and we have resolved this problem.

But we need runtime checks for invokeinterface

Type Inference Algorithm

$in( 0 ) := ( [ \ ] , [ P_0,...,P_n,T,...,T ] )$ $worklist := \{ i \mid \text{instr}_i \text{ is an instruction of the method} \}$ $\mathbf{while}\ worklist \neq \emptyset\ \mathbf{do}$ $\quad i := \min( worklist )$ $\quad \text{remove } i \text{ from } worklist$ $\quad out( i ) := apply\_rule( \text{instr}_i, in( i ) )$ $\quad \mathbf{foreach}\ q \text{ in } successors( i )\ \mathbf{do}$ $\quad \quad in( q ) := pointwise\_scs( in( q ), out( i ) )$ $\quad \quad \mathbf{if}\ in( q ) \text{ has changed}\ \mathbf{then}\ worklist := worklist \cup \{ q \}$ $\quad \mathbf{end}$ $\mathbf{end}$

Remember q is the following instruction (could be more than one).

There is very little information that is needed to make this algorithm work, but it may be slow.

Pointwise SCS

$pointwise\_scs\left( \left( [ s_1.....s_k ] , [ t_0,...,t_n ] \right) ,\left( [ s'_1.....s'_k ] , [ t'_0,...,t'_n ] \right) \right) =$ $\quad\left( \left[ scs( s_1,s'_1 ).....scs( s_k,s'_k ) \right] , \left[ scs( t_0,t'_0 ),...,scs( t_n,t'_n ) \right] \right)$

pointwise_scs is undefined for stacks of different heights Bytecode verification results in an error if called with this.

Type Checking Algorithm

We can avoid SCS and fixed point computation by marking what is the expected type at the beginning and at the end of some specific code block. This means we need to extend the class code with more type information.

$\mathbf{foreach}\ \text{basic block of a method body}\ \mathbf{do}$ $\quad \mathbf{foreach}\ \{ i \mid \text{instr}_i \text{ is an instruction of basic block} \}\ \mathbf{do}$ $\quad \quad \mathbf{if}\ types( i ) \text{ is declared}\ \mathbf{then}\ in := types( i )$ $\quad \quad in := apply\_rule( \text{instr}_i, in )$ $\quad \quad \mathbf{foreach}\ q \text{ in } successors( i )\ \mathbf{do}$ $\quad \quad \quad \mathbf{if}\ types( q ) \text{ is declared}\ \mathbf{then}$ $\quad \quad \quad \quad \text{check that } in \text{ is assignable to } types( q )$ $\quad \quad \quad \mathbf{end}$ $\quad \quad \mathbf{end}$ $\quad \mathbf{end}$ $\mathbf{end}$

Bytecode Verification-20251030111912553 — Example of a type checking run, slide from the Course Slides

Bytecode Verifiers are General

There are some codes that are not verifiable in source code but completely fine in bytecode. For example dynamic assignment codes like this:

x = 5
if (...) {
	x = "hello"
	print(x)
	x = 7;
}
x = x * 2;

Since the bytecode knows the type after every step, the SCS after the block is still int, and that operation can be verified. Usually this is usable for the compiler (rediscovered using fixed point? No, type checking is far more efficient if you already have that information). The important thing to remember is that we are computing the types at the start of code blocks!

Type Inference for Source Programs

We can also apply type-inference at the start, at the source, not only at the bytecode level. If you have experience in programming, this is the standard version that you probably might know. Type inference can be done at this level to levy some typing burdens to the programmer. You can also use type annotations as in python to help the inference.

Type inference is only for local variables, fields and other kinds need still to be written (if not this would be not modular (e.g. you would need to know all overrides as well, and the types would change)).

Inference of method signatures generally requires knowledge of all implementations
Inference of field types generally requires knowledge of all accesses to the field
Inference of these types is non-modular
- Or based on speculation

Class Loaders#

Security Considerations#

Requirements for Security#

Sandboxed Execution Environment#

Summary of the JVM#

Type Verification#

Why Use Type-Inference?#

Types of the Engine#

Some Rules#

Smallest Common Supertype#

Type Inference Algorithm#

Pointwise SCS#

Type Checking Algorithm#

Bytecode Verifiers are General#

Type Inference for Source Programs#