A JVM in Rust part 3 - Parsing class files

Published Sunday, Jul 30, 2023 - 2164 words, 11 minutes

Tagged:

This post is part of the Writing a JVM in Rust series.

In this post, I will discuss how rjvm parses .class files. The code I will discuss today is contained in the reader crate.

A warning before you read: this is the earliest part of the project and, since I have written this project to learn Rust, it is also the one that contains the least idiomatic code. Don’t take this as an example of the best Rust ever written! 😅

Modeling class files Link to heading

Broadly speaking, when parsing data, you can follow one of two approaches:

model your structures as necessary by your logic, regardless of its input form, and deal with the complexity of the transformation during the parsing phase;
or model your data structures to follow very closely the input structure, so that the parsing is simpler.

For many applications, the first strategy is the best choice, but I have opted for the second approach in rjvm and I think I made the right call. While some of the complexity of the file format ended up exposed in the VM code that executes the methods, the JVM specs for the bytecode instructions refer directly to concepts such as the constant pool, and thus my code ended up following quite closely the specification.

The following is the main struct that models a .class file:

/// Represents the content of a .class file.
#[derive(Debug, Default)]
pub struct ClassFile {
    pub version: ClassFileVersion,
    pub constants: ConstantPool,
    pub flags: ClassAccessFlags,
    pub name: String,
    pub superclass: Option<String>,
    pub interfaces: Vec<String>,
    pub fields: Vec<ClassFileField>,
    pub methods: Vec<ClassFileMethod>,
    pub deprecated: bool,
    pub source_file: Option<String>,
}

As you can see, it follows quite closely the actual format that we have discussed in the previous part. The deprecated and source_file fields are taken from the class attributes.

The ClassFileVersion enum is quite simple:

#[derive(Debug, PartialEq, Default, strum_macros::Display)]
#[allow(dead_code)]
pub enum ClassFileVersion {
    // ...
    Jdk7,
    #[default]
    Jdk8,
    Jdk9,
    // ...

To get an implementation of std::fmt::Display I have used the handy crate strum, which is an example of my favorite kinds of libraries: it does one thing, and it does it well.

I am going to skip for a second the constant pool, as it is a bit more complex than the rest. Next up are the class flags, modeled via ClassAccessFlags:

bitflags! {
    /// Class flags
    pub struct ClassAccessFlags: u16 {
        const PUBLIC = 0x0001;
        const FINAL = 0x0010;
        const SUPER = 0x0020;
        const INTERFACE = 0x0200;
        const ABSTRACT = 0x0400;
        const SYNTHETIC = 0x1000;
        const ANNOTATION = 0x2000;
        const ENUM = 0x4000;
    }
}

Given that the flags are modeled in the JVM spec as a bit mask, I have used the great bitflags crate to avoid having to do bit twiddling by hand.

The class name is a string. Here I did not follow my own advice and I used a raw string. I should have wrapped this in a newtype such as struct ClassName(String).

The superclass name is implemented via an Option<String>, as there is one class that does not have a superclass - java.lang.Object.

The implemented interfaces are simply a Vec<String>.

The constant pool Link to heading

As we discussed last time, the constant pool is by far the most complex part of the .class file format. In the code, I have modeled each constant type explicitly:

/// Types of a constant in the constant pool of a class, following the JVM spec:
/// https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.4
#[derive(Debug, PartialEq)]
pub enum ConstantPoolEntry {
    Utf8(String),
    Integer(i32),
    Float(f32),
    Long(i64),
    Double(f64),
    ClassReference(u16),
    StringReference(u16),
    FieldReference(u16, u16),
    MethodReference(u16, u16),
    InterfaceMethodReference(u16, u16),
    NameAndTypeDescriptor(u16, u16),
}

The constant pool is not simply a vector of these, but has a little trick:

/// Implementation of the constant pool of a java class.
/// Note that constants are 1-based in java.
#[derive(Debug, Default)]
pub struct ConstantPool {
    entries: Vec<ConstantPoolPhysicalEntry>,
}

/// Constants in the pool generally take one slot, but long and double take two. We do not use
/// the second one, so we have a tombstone to ensure the indexes match.
#[derive(Debug)]
enum ConstantPoolPhysicalEntry {
    Entry(ConstantPoolEntry),
    MultiByteEntryTombstone(),
}

Recall that a long or double constant takes two slots in the constant pool, meaning that if it is stored at index i, then the index i + 1 is invalid. I have modeled this explicitly via the ConstantPoolPhysicalEntry - normal constants are wrapped in an Entry and take one slot in the vector, but a long or double constant takes two entries: one Entry followed by one MultiByteEntryTombstone. The client API hides this complexity though and is expressed in terms of entries and indexes:

impl ConstantPool {
    pub fn add(
      &mut self, entry: ConstantPoolEntry
    )

    pub fn get(
        &self, input_index: u16,
    ) -> Result<&ConstantPoolEntry, InvalidConstantPoolIndexError>
}

Fields Link to heading

Fields are implemented as follows:

/// Models a field in a class
#[derive(Debug, PartialEq)]
pub struct ClassFileField {
    pub flags: FieldFlags,
    pub name: String,
    pub type_descriptor: FieldType,
    /// Fields which model a constant (final) will have an attribute specifying the value
    pub constant_value: Option<FieldConstantValue>,
    pub deprecated: bool,
}

FieldFlags is yet another bitset, so I won’t show it here. FieldType is implemented as follows:

/// Models the type of one field, or one parameter of a method
#[derive(Debug, Clone, PartialEq)]
pub enum FieldType {
    /// Primitive types
    Base(BaseType),

    /// Standard object
    Object(String),

    /// Array
    Array(Box<FieldType>),
}

/// Possible primitive types
#[derive(Debug, Clone, PartialEq, strum_macros::Display)]
#[repr(u8)]
pub enum BaseType {
    Byte,
    Char,
    Double,
    Float,
    Int,
    Long,
    Short,
    Boolean,
}

Quite simple, and yet another place where I love having sum types in Rust. Note that FieldType is recursive for arrays, and therefore I have used a Box to implement it. There is a rather simple method FieldType::parse that can parse the JVM field type descriptor.

Methods Link to heading

Methods are a bit more complex than fields:

/// Models a method in a class
#[derive(Debug, PartialEq)]
pub struct ClassFileMethod {
    pub flags: MethodFlags,
    pub name: String,
    /// The type descriptor in the internal JVM form, i.e. something like (L)I in the unparsed form
    pub type_descriptor: String,
    /// Parsed form of the method descriptor
    pub parsed_type_descriptor: MethodDescriptor,
    /// Generic attributes of the method
    // TODO: replace with some proper struct
    pub attributes: Vec<Attribute>,
    pub code: Option<ClassFileMethodCode>,
    pub deprecated: bool,
    /// List of exceptions in the `throws` clause of the method
    pub thrown_exceptions: Vec<String>,
}

Notice that I have kept both the unparsed and the parsed forms of the method descriptor. This is useful because when invoking a method, the VM will look up the parsed descriptor to check the arguments and return types, but when resolving a method, the unparsed descriptor will be used. We will discuss this in more detail when talking about the execution of methods, in a future post.

The parsed descriptor of a method is modeled as follows:

/// Models the signature of a method, i.e. the type of the parameters it takes and the type
/// of the return value
#[derive(Debug, Default, Clone, PartialEq)]
pub struct MethodDescriptor {
    pub parameters: Vec<FieldType>,
    pub return_type: Option<FieldType>,
}

Notice also that, since I have not modeled explicitly all the attributes that a method can have, I have opted to store a vector with the raw bytes of the attributes. Notice the “TODO” above it to remind me that I should remove it! 😅

Finally, the field code is wrapped in an Option since, for native methods, there is no bytecode!

Method code Link to heading

The code of a method is modeled by this struct:

/// Code of a given method
#[derive(Debug, Default, PartialEq)]
pub struct ClassFileMethodCode {
    /// Maximum depth of the stack at any time
    pub max_stack: u16,
    /// Number of local variables used by the method
    pub max_locals: u16,
    /// Raw bytecode
    pub code: Vec<u8>,
    pub exception_table: ExceptionTable,
    pub line_number_table: Option<LineNumberTable>,

    /// Generic unmapped attributes of the code
    // TODO: replace with some proper struct
    pub attributes: Vec<Attribute>,
}

I will leave the discussion of the exception table to a dedicated blog post since it would be a bit complex to explain before I discuss the JVM bytecode instructions.

Line number table Link to heading

The LineNumberTable is used in the VM to add the line number when creating a stack trace, for example when throwing an exception. Its implementation follows closely the spec:

/// Table that models the relationship between program counters and line numbers
/// in the source code. Entries are sorted by program counter. A table with two entries,
/// the first starting at 0 and the second at 3, means that the first three instructions
/// in the bytecode correspond to line 1 and the rest to line 2.
#[derive(Debug, PartialEq)]
pub struct LineNumberTable {
    entries: Vec<LineNumberTableEntry>,
}

#[derive(Debug, PartialEq, Eq, Clone, Copy)]
pub struct LineNumberTableEntry {
    pub program_counter: ProgramCounter,
    pub line_number: LineNumber,
}

/// Models the program counter, i.e. the address
/// of an instruction in the bytecode of a method
#[derive(Debug, PartialEq, Eq, Clone, Copy, PartialOrd, Ord)]
pub struct ProgramCounter(pub u16);

/// Line number in the source code
#[derive(Debug, PartialEq, Eq, Clone, Copy, PartialOrd, Ord)]
pub struct LineNumber(pub u16);

where here I have used some newtype! 😊 LineNumberTable includes a simple method lookup_pc(&self, pc: ProgramCounter) -> LineNumber that is implemented via a binary search.

Parsing classes Link to heading

The parser that rjvm uses is pretty basic and hand-written. In hindsight, I should have used the fantastic nom crate, to which I plan to dedicate a post in the future.

The basic structure I have built is a simple Buffer around a byte slice. I’m sure there are libraries that do this better, but writing this by hand was pretty easy. Here’s how it works:

/// A buffer reader, used to marshall data from a generic byte array
pub struct Buffer<'a> {
    buffer: &'a [u8],
    position: usize,
}

impl<'a> Buffer<'a> {
    pub fn new(data: &'a [u8]) -> Self {
        Buffer {
            buffer: data,
            position: 0,
        }
    }

    pub fn read_u8(&mut self) -> Result<u8> {
        self.advance(std::mem::size_of::<u8>())
            .map(|bytes| u8::from_be_bytes(bytes.try_into().unwrap()))
    }

    fn advance(&mut self, size: usize) -> Result<&'a [u8]> {
        if self.position + size > self.buffer.len() {
            Err(BufferError::UnexpectedEndOfData)
        } else {
            let slice = &self.buffer[self.position..self.position + size];
            self.position += size;
            Ok(slice)
        }
    }

    // ...
}

The public API for reading a class is this one:

/// Reads a class from a byte slice.
pub fn read_buffer(buf: &[u8]) -> Result<ClassFile> {
    ClassFileReader::new(buf).read()
}

The structure of the ClassFileReader, which is just an implementation detail and not exposed publically, is as follows:

/// A reader of a byte array representing a class. Supports only a subset of Java 7 class format,
/// in particular it does not support generics.
struct ClassFileReader<'a> {
    buffer: Buffer<'a>,
    /// The class being read, created empty and updated in place
    class_file: ClassFile,
}

/// Reference: https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html
impl<'a> ClassFileReader<'a> {
    fn new(data: &[u8]) -> ClassFileReader {
        ClassFileReader {
            buffer: Buffer::new(data),
            class_file: Default::default(),
        }
    }

    fn read(mut self) -> Result<ClassFile> {
        self.check_magic_number()?;
        self.read_version()?;
        self.read_constants()?;
        self.read_access_flags()?;
        self.class_file.name = self.read_class_reference()?;
        self.class_file.superclass = self.read_class_reference_optional()?;
        self.read_interfaces()?;
        self.read_fields()?;
        self.read_methods()?;
        self.read_class_attributes()?;

        Ok(self.class_file)
    }

where an example method of the parser is:

    fn read_version(&mut self) -> Result<()> {
        let minor_version = self.buffer.read_u16()?;
        let major_version = self.buffer.read_u16()?;

        self.class_file.version = ClassFileVersion::from(major_version, minor_version)?;
        Ok(())
    }

Most of the code follows the same, simple structure. I have used the question mark operator ? everywhere to simply bubble up errors and stop the parsing, which feels very ergonomic and nice to write.

Testing Link to heading

Given that this is a toy project and not production code, I decided to not be too precise with tests: basically, all the error paths are not tested. In any case, I think the best strategy for testing this code would be via fuzzing, but it is something I have not investigated.

While there are some simple unit tests where it made sense, most tests are written as integration tests and I have used some real .class files, generated by the javac compiler. For example, the test suite includes this simple Java class:

public class Complex implements Cloneable, Serializable {
    private final double real;
    private final double imag;

    public Complex(double real) {
        this.real = real;
        this.imag = 0;
    }

    public Complex(double real, double imag) {
        this.real = real;
        this.imag = imag;
    }

    public double getReal() {
        return this.real;
    }

    public double getImag() {
        return this.imag;
    }

    public double abs() {
        return Math.sqrt(this.real * this.real + this.imag * this.imag);
    }
}

and there is a Rust test that checks that this class file can be parsed:

fn can_read_pojo_class_file() {
    let class = read_class_from_bytes(include_bytes!("../resources/rjvm/Complex.class"));
    assert_eq!(ClassFileVersion::Jdk6, class.version);
    assert_eq!(
        ClassAccessFlags::PUBLIC | ClassAccessFlags::SUPER,
        class.flags
    );
    assert_eq!("rjvm/Complex", class.name);
    assert_eq!(
        "java/lang/Object",
        class.superclass.as_ref().expect("a valid superclass")
    );
    // ...
}

This simple strategy has gotten me enough code coverage to give me confidence in the code’s behavior.

Conclusions Link to heading

There you have it - an overview of a simple crate that can parse .class files. In the next post of this series, I plan to cover how the JVM bytecode works.

Thanks for reading, and remember that you can follow this blog via its RSS (Atom) feed.