A JVM in Rust part 3 - Parsing class files
This post is part of the Writing a JVM in Rust series.
In this post, I will discuss how rjvm parses .class
files. The code I will discuss today is contained in the reader
crate.
A warning before you read: this is the earliest part of the project and, since I have written this project to learn Rust, it is also the one that contains the least idiomatic code. Don’t take this as an example of the best Rust ever written! ๐
Modeling class files Link to heading
Broadly speaking, when parsing data, you can follow one of two approaches:
- model your structures as necessary by your logic, regardless of its input form, and deal with the complexity of the transformation during the parsing phase;
- or model your data structures to follow very closely the input structure, so that the parsing is simpler.
For many applications, the first strategy is the best choice, but I have opted for the second approach in rjvm and I think I made the right call. While some of the complexity of the file format ended up exposed in the VM code that executes the methods, the JVM specs for the bytecode instructions refer directly to concepts such as the constant pool, and thus my code ended up following quite closely the specification.
The following is the main struct that models a .class
file:
/// Represents the content of a .class file.
#[derive(Debug, Default)]
pub struct ClassFile {
pub version: ClassFileVersion,
pub constants: ConstantPool,
pub flags: ClassAccessFlags,
pub name: String,
pub superclass: Option<String>,
pub interfaces: Vec<String>,
pub fields: Vec<ClassFileField>,
pub methods: Vec<ClassFileMethod>,
pub deprecated: bool,
pub source_file: Option<String>,
}
As you can see, it follows quite closely the actual format that we have discussed in the previous part. The deprecated
and source_file
fields are taken from the class attributes.
The ClassFileVersion
enum is quite simple:
#[derive(Debug, PartialEq, Default, strum_macros::Display)]
#[allow(dead_code)]
pub enum ClassFileVersion {
// ...
Jdk7,
#[default]
Jdk8,
Jdk9,
// ...
To get an implementation of std::fmt::Display
I have used the handy crate strum, which is an example of my favorite kinds of libraries: it does one thing, and it does it well.
I am going to skip for a second the constant pool, as it is a bit more complex than the rest. Next up are the class flags, modeled via ClassAccessFlags
:
bitflags! {
/// Class flags
pub struct ClassAccessFlags: u16 {
const PUBLIC = 0x0001;
const FINAL = 0x0010;
const SUPER = 0x0020;
const INTERFACE = 0x0200;
const ABSTRACT = 0x0400;
const SYNTHETIC = 0x1000;
const ANNOTATION = 0x2000;
const ENUM = 0x4000;
}
}
Given that the flags are modeled in the JVM spec as a bit mask, I have used the great bitflags crate to avoid having to do bit twiddling by hand.
The class name is a string. Here I did not follow my own advice and I used a raw string. I should have wrapped this in a newtype such as struct ClassName(String)
.
The superclass name is implemented via an Option<String>
, as there is one class that does not have a superclass - java.lang.Object
.
The implemented interfaces are simply a Vec<String>
.
The constant pool Link to heading
As we discussed last time, the constant pool is by far the most complex part of the .class
file format. In the code, I have modeled each constant type explicitly:
/// Types of a constant in the constant pool of a class, following the JVM spec:
/// https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.4
#[derive(Debug, PartialEq)]
pub enum ConstantPoolEntry {
Utf8(String),
Integer(i32),
Float(f32),
Long(i64),
Double(f64),
ClassReference(u16),
StringReference(u16),
FieldReference(u16, u16),
MethodReference(u16, u16),
InterfaceMethodReference(u16, u16),
NameAndTypeDescriptor(u16, u16),
}
The constant pool is not simply a vector of these, but has a little trick:
/// Implementation of the constant pool of a java class.
/// Note that constants are 1-based in java.
#[derive(Debug, Default)]
pub struct ConstantPool {
entries: Vec<ConstantPoolPhysicalEntry>,
}
/// Constants in the pool generally take one slot, but long and double take two. We do not use
/// the second one, so we have a tombstone to ensure the indexes match.
#[derive(Debug)]
enum ConstantPoolPhysicalEntry {
Entry(ConstantPoolEntry),
MultiByteEntryTombstone(),
}
Recall that a long
or double
constant takes two slots in the constant pool, meaning that if it is stored at index i
, then the index i + 1
is invalid. I have modeled this explicitly via the ConstantPoolPhysicalEntry
- normal constants are wrapped in an Entry
and take one slot in the vector, but a long
or double
constant takes two entries: one Entry
followed by one MultiByteEntryTombstone
. The client API hides this complexity though and is expressed in terms of entries and indexes:
impl ConstantPool {
pub fn add(
&mut self, entry: ConstantPoolEntry
)
pub fn get(
&self, input_index: u16,
) -> Result<&ConstantPoolEntry, InvalidConstantPoolIndexError>
}
Fields Link to heading
Fields are implemented as follows:
/// Models a field in a class
#[derive(Debug, PartialEq)]
pub struct ClassFileField {
pub flags: FieldFlags,
pub name: String,
pub type_descriptor: FieldType,
/// Fields which model a constant (final) will have an attribute specifying the value
pub constant_value: Option<FieldConstantValue>,
pub deprecated: bool,
}
FieldFlags
is yet another bitset, so I won’t show it here. FieldType
is implemented as follows:
/// Models the type of one field, or one parameter of a method
#[derive(Debug, Clone, PartialEq)]
pub enum FieldType {
/// Primitive types
Base(BaseType),
/// Standard object
Object(String),
/// Array
Array(Box<FieldType>),
}
/// Possible primitive types
#[derive(Debug, Clone, PartialEq, strum_macros::Display)]
#[repr(u8)]
pub enum BaseType {
Byte,
Char,
Double,
Float,
Int,
Long,
Short,
Boolean,
}
Quite simple, and yet another place where I love having sum types in Rust. Note that FieldType
is recursive for arrays, and therefore I have used a Box
to implement it. There is a rather simple method FieldType::parse
that can parse the JVM field type descriptor.
Methods Link to heading
Methods are a bit more complex than fields:
/// Models a method in a class
#[derive(Debug, PartialEq)]
pub struct ClassFileMethod {
pub flags: MethodFlags,
pub name: String,
/// The type descriptor in the internal JVM form, i.e. something like (L)I in the unparsed form
pub type_descriptor: String,
/// Parsed form of the method descriptor
pub parsed_type_descriptor: MethodDescriptor,
/// Generic attributes of the method
// TODO: replace with some proper struct
pub attributes: Vec<Attribute>,
pub code: Option<ClassFileMethodCode>,
pub deprecated: bool,
/// List of exceptions in the `throws` clause of the method
pub thrown_exceptions: Vec<String>,
}
Notice that I have kept both the unparsed and the parsed forms of the method descriptor. This is useful because when invoking a method, the VM will look up the parsed descriptor to check the arguments and return types, but when resolving a method, the unparsed descriptor will be used. We will discuss this in more detail when talking about the execution of methods, in a future post.
The parsed descriptor of a method is modeled as follows:
/// Models the signature of a method, i.e. the type of the parameters it takes and the type
/// of the return value
#[derive(Debug, Default, Clone, PartialEq)]
pub struct MethodDescriptor {
pub parameters: Vec<FieldType>,
pub return_type: Option<FieldType>,
}
Notice also that, since I have not modeled explicitly all the attributes that a method can have, I have opted to store a vector with the raw bytes of the attributes. Notice the “TODO” above it to remind me that I should remove it! ๐
Finally, the field code
is wrapped in an Option
since, for native
methods, there is no bytecode!
Method code Link to heading
The code of a method is modeled by this struct:
/// Code of a given method
#[derive(Debug, Default, PartialEq)]
pub struct ClassFileMethodCode {
/// Maximum depth of the stack at any time
pub max_stack: u16,
/// Number of local variables used by the method
pub max_locals: u16,
/// Raw bytecode
pub code: Vec<u8>,
pub exception_table: ExceptionTable,
pub line_number_table: Option<LineNumberTable>,
/// Generic unmapped attributes of the code
// TODO: replace with some proper struct
pub attributes: Vec<Attribute>,
}
I will leave the discussion of the exception table to a dedicated blog post since it would be a bit complex to explain before I discuss the JVM bytecode instructions.
Line number table Link to heading
The LineNumberTable
is used in the VM to add the line number when creating a stack trace, for example when throwing an exception. Its implementation follows closely the spec:
/// Table that models the relationship between program counters and line numbers
/// in the source code. Entries are sorted by program counter. A table with two entries,
/// the first starting at 0 and the second at 3, means that the first three instructions
/// in the bytecode correspond to line 1 and the rest to line 2.
#[derive(Debug, PartialEq)]
pub struct LineNumberTable {
entries: Vec<LineNumberTableEntry>,
}
#[derive(Debug, PartialEq, Eq, Clone, Copy)]
pub struct LineNumberTableEntry {
pub program_counter: ProgramCounter,
pub line_number: LineNumber,
}
/// Models the program counter, i.e. the address
/// of an instruction in the bytecode of a method
#[derive(Debug, PartialEq, Eq, Clone, Copy, PartialOrd, Ord)]
pub struct ProgramCounter(pub u16);
/// Line number in the source code
#[derive(Debug, PartialEq, Eq, Clone, Copy, PartialOrd, Ord)]
pub struct LineNumber(pub u16);
where here I have used some newtype! ๐ LineNumberTable
includes a simple method lookup_pc(&self, pc: ProgramCounter) -> LineNumber
that is implemented via a binary search.
Parsing classes Link to heading
The parser that rjvm
uses is pretty basic and hand-written. In hindsight, I should have used the fantastic nom crate, to which I plan to dedicate a post in the future.
The basic structure I have built is a simple Buffer
around a byte slice. I’m sure there are libraries that do this better, but writing this by hand was pretty easy. Here’s how it works:
/// A buffer reader, used to marshall data from a generic byte array
pub struct Buffer<'a> {
buffer: &'a [u8],
position: usize,
}
impl<'a> Buffer<'a> {
pub fn new(data: &'a [u8]) -> Self {
Buffer {
buffer: data,
position: 0,
}
}
pub fn read_u8(&mut self) -> Result<u8> {
self.advance(std::mem::size_of::<u8>())
.map(|bytes| u8::from_be_bytes(bytes.try_into().unwrap()))
}
fn advance(&mut self, size: usize) -> Result<&'a [u8]> {
if self.position + size > self.buffer.len() {
Err(BufferError::UnexpectedEndOfData)
} else {
let slice = &self.buffer[self.position..self.position + size];
self.position += size;
Ok(slice)
}
}
// ...
}
The public API for reading a class is this one:
/// Reads a class from a byte slice.
pub fn read_buffer(buf: &[u8]) -> Result<ClassFile> {
ClassFileReader::new(buf).read()
}
The structure of the ClassFileReader
, which is just an implementation detail and not exposed publically, is as follows:
/// A reader of a byte array representing a class. Supports only a subset of Java 7 class format,
/// in particular it does not support generics.
struct ClassFileReader<'a> {
buffer: Buffer<'a>,
/// The class being read, created empty and updated in place
class_file: ClassFile,
}
/// Reference: https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html
impl<'a> ClassFileReader<'a> {
fn new(data: &[u8]) -> ClassFileReader {
ClassFileReader {
buffer: Buffer::new(data),
class_file: Default::default(),
}
}
fn read(mut self) -> Result<ClassFile> {
self.check_magic_number()?;
self.read_version()?;
self.read_constants()?;
self.read_access_flags()?;
self.class_file.name = self.read_class_reference()?;
self.class_file.superclass = self.read_class_reference_optional()?;
self.read_interfaces()?;
self.read_fields()?;
self.read_methods()?;
self.read_class_attributes()?;
Ok(self.class_file)
}
where an example method of the parser is:
fn read_version(&mut self) -> Result<()> {
let minor_version = self.buffer.read_u16()?;
let major_version = self.buffer.read_u16()?;
self.class_file.version = ClassFileVersion::from(major_version, minor_version)?;
Ok(())
}
Most of the code follows the same, simple structure. I have used the question mark operator ?
everywhere to simply bubble up errors and stop the parsing, which feels very ergonomic and nice to write.
Testing Link to heading
Given that this is a toy project and not production code, I decided to not be too precise with tests: basically, all the error paths are not tested. In any case, I think the best strategy for testing this code would be via fuzzing, but it is something I have not investigated.
While there are some simple unit tests where it made sense, most tests are written as integration tests and I have used some real .class
files, generated by the javac
compiler. For example, the test suite includes this simple Java class:
public class Complex implements Cloneable, Serializable {
private final double real;
private final double imag;
public Complex(double real) {
this.real = real;
this.imag = 0;
}
public Complex(double real, double imag) {
this.real = real;
this.imag = imag;
}
public double getReal() {
return this.real;
}
public double getImag() {
return this.imag;
}
public double abs() {
return Math.sqrt(this.real * this.real + this.imag * this.imag);
}
}
and there is a Rust test that checks that this class file can be parsed:
fn can_read_pojo_class_file() {
let class = read_class_from_bytes(include_bytes!("../resources/rjvm/Complex.class"));
assert_eq!(ClassFileVersion::Jdk6, class.version);
assert_eq!(
ClassAccessFlags::PUBLIC | ClassAccessFlags::SUPER,
class.flags
);
assert_eq!("rjvm/Complex", class.name);
assert_eq!(
"java/lang/Object",
class.superclass.as_ref().expect("a valid superclass")
);
// ...
}
This simple strategy has gotten me enough code coverage to give me confidence in the code’s behavior.
Conclusions Link to heading
There you have it - an overview of a simple crate that can parse .class
files. In the next post of this series, I plan to cover how the JVM bytecode works.
Thanks for reading, and remember that you can follow this blog via its RSS (Atom) feed.