A JVM in Rust part 2 - The class files format

Published Sunday, Jul 23, 2023 - 1627 words, 8 minutes

Tagged:

This post is part of the Writing a JVM in Rust series.

In this post, I will discuss the .class file format. I originally planned to discuss also how I have implemented it in my JVM written in Rust, rjvm, but the post turned out a bit too long, so I have decided to split it into two parts.

A primer on class files

The JVM is a virtual machine that executes Java bytecode, stored in .class files. Java is not the only language that can emit JVM bytecode though - the most famous alternatives are Kotlin, Scala, and Groovy. The bytecode format is independent of the language you have used to write your code.

The Java Virtual Machine specification

The documentation I have followed is hosted here and it is for version 7 of the .class file format. It is very well written and accessible.

If you are interested in newer versions, for example, Java 20, you will notice that the class format isn’t that different. Newer version files include things like information about the Java modules system, or to make the underlying mechanism for lambda work, but since I was not going to implement those in my JVM, I have opted for parsing an older version of the files.

The class files format

So, how is a .class file structured? First and foremost - there is one class per file. If you have, in your Java code, a nested class such as:

package com.andreabergia;

class Outer {
    class Inner {}
}

you will end up with two files: one named Outer.class and one named Outer$Inner.class. The internal name of the class would include the dollar sign, which is not a character that you can use in an identifier in Java but is perfectly valid at the JVM level.

Note also that the internal name of each class includes the package and uses slash for separators, rather than dots; therefore, the real name of the class would be something like com/andreabergia/Outer. I have no idea why the slashes are used - I imagine it is just an historical artifact related to a filesystem implementation, before .jar files became common.

A class file will always be stored in big-endian and will follow this structure:

ClassFile {
    u4             magic;
    u2             minor_version;
    u2             major_version;
    u2             constant_pool_count;
    cp_info        constant_pool[constant_pool_count-1];
    u2             access_flags;
    u2             this_class;
    u2             super_class;
    u2             interfaces_count;
    u2             interfaces[interfaces_count];
    u2             fields_count;
    field_info     fields[fields_count];
    u2             methods_count;
    method_info    methods[methods_count];
    u2             attributes_count;
    attribute_info attributes[attributes_count];
}

Thus, a class file will always start with an u4 (four bytes) representing a magic number: 0xCAFEBABE. This is a sort of joke since Java is also the most populous island in the world and a somewhat common name for coffee grown there.

It is then followed by the version numbers, in the form “major/minor” - which is honestly a legacy concept, given that every version of Java for a couple of decades has come with an increment of the major version and a zero for the minor. 😊 For example, Java 7 uses version 51.0, Java 8 uses 52.0 and Java 17 uses 61.0.

The constant pool

The following part of a class file is the constant pool: this is a section that includes all constants used by the code. It is used to encode all the strings referenced by the class, including the class, field, and method names, their type and signatures, the name of any referenced class and method, and any strings in the code. Furthermore, all numeric constants are encoded as well and referenced by the bytecode.

For example, a line of Java such as:

int x = 4242;

will be implemented via a constant valued 4242 stored in a constant slot, say at position 8, and referred to in the code by an instruction such as ldc 8, as we will see in a future post when we discuss the bytecode instructions.

In a class file, each constant is preceded by a byte identifying its type. For example, integer constants have a type of 3 and are stored as follows:

CONSTANT_Integer_info {
    u1 tag;     // = 3
    u4 bytes;
}

The constant pool has some strange idiosyncrasies for (I imagine) historical reasons. For example, they are indexed starting with one and not zero.

Another complication is the concept of references, for example, the name of the class itself. This is encoded in two steps: first, the Java compiler will create a constant of type utf8 (which is actually a slight variation on the real UTF-8 encoding, called CESU-8). Then, there will be another constant of type CONSTANT_Class_info, which refers to the utf8 constant.

The final strange thing is how long and double constants are stored: they take two entries in the constant table. That is, if the constant in position 7 is of type CONSTANT_Long_info, then no constant will be stored at position 8! Here is one of the many places where you can see that the JVM was originally designed for 32 bits CPUs - we will see more in the future. 😏 I am going to quote the official spec on this topic:

In retrospect, making 8-byte constants take two constant pool entries was a poor choice.

Constants are indexed by two bytes; therefore a maximum of 65,536 constants (2^16) are allowed.

Flags, superclass, interfaces

The next entry in the class file is a bit field representing the various flags of the class: these include public, final, abstract, and so on.

Afterward, a reference to a constant of type CONSTANT_Class_info follows, representing the name of the class itself. It is followed by the name of the superclass, which can be 0 for java/lang/Object, the only class in Java that has no superclasses.

The next section contains all the interfaces implemented by the class. The simple pattern we have seen for the constant pool repeats: first the length is stored, then all the entries.

Fields

The fields are the following entry in a class file. Each field has the following structure:

field_info {
    u2             access_flags;
    u2             name_index;
    u2             descriptor_index;
    u2             attributes_count;
    attribute_info attributes[attributes_count];
}

These represent, in order:

a bit field representing the flags (final, private, static, …);
a reference to a constant with the field name;
a reference to another constant with the type descriptor;
the field’s attributes.

The type descriptor for a field represents, well, its type. For compactness, it is not stored in the same version as the Java code, but in a shorter form. For example, int becomes I, long becomes J, while a field of type String becomes Ljava/lang/String; and an array of double becomes [D. Check out the JVM specs for all the details.

Attributes

Attributes are a generic mechanism used in class files to attach various sorts of data to fields. The same mechanism is used also for methods and for the class itself. Examples of attributes include:

annotations;
values of constant fields;
the code of a method;
the list of exceptions thrown by a method;
the exception table of a method’s code, used to implement try/catch;
the source file name for a class.

Most new versions of the JVM have extended the constant types and the set of valid attributes. For example, in Java 17 a new attribute has been added to implement sealed classes.

Methods

After fields, we get to the methods, which have a very similar format:

method_info {
    u2             access_flags;
    u2             name_index;
    u2             descriptor_index;
    u2             attributes_count;
    attribute_info attributes[attributes_count];
}

The type descriptor for methods is built upon the fields’ descriptor and has the form (<parameter 1 descriptor> <parameter 2 descriptor>) <return type> where V is used to represent void methods. For example:

// descriptor: (I)J
long method(int a)

// descriptor: (FI)V
void method(float a, int b)

// descriptor: (Ljava/lang/String;I)Ljava/lang/String;
String method(String a, int b)

Method’s code

A method, unless it is native, must always have an attribute of type Code, which is unusual because it is the only attribute that has its own attributes (that is, until records were introduced). This is its format:

Code_attribute {
    u2 attribute_name_index;
    u4 attribute_length;
    u2 max_stack;
    u2 max_locals;
    u4 code_length;
    u1 code[code_length];
    u2 exception_table_length;
    {   u2 start_pc;
        u2 end_pc;
        u2 handler_pc;
        u2 catch_type;
    } exception_table[exception_table_length];
    u2 attributes_count;
    attribute_info attributes[attributes_count];
}

A few interesting things:

the actual bytecode is stored in the code array, preceded by its length;
the maximum depth of the value stack at any time during the method’s execution will be stored in the class file - therefore, the JVM can allocate a stack with the correct maximum size once and avoid resizing it while executing the method;
the same is true for the local variables table;
the exception table is used to implement any catch block. We will discuss this in a future post.

An interesting attribute of the code is the LineNumberTable, which is used to map ranges of bytecode instructions to the source code location. This is useful to implement debuggers, but also to include the source file in a stack trace when an exception is generated.

Class attributes

The last entry in the class file is the class attributes. Some of the most interesting of these are Signature, used for generic classes, and Bootstrap which is used to implement the invokedynamic instruction - part of the lambda infrastructure.

Conclusions

The format for .class files has some beautiful ideas, like the attributes mechanism, which allowed it to evolve without requiring big changes in its structure even when huge features such as records or modules have been added to the language. However, it is also filled with some historical baggage, in particular storing long and double constants in two entries, and the weird CESU-8 encoding.

It is pretty easy to parse though, and we will discuss in the next blog post of this series how rjvm does it. Thanks for reading!