Decompiler Concepts: Reconstructing a flattened class hierarchy

Recently I’ve done some experimentation with creating a compiler toolchain in JavaScript (well, technically TypeScript, but…). You might think me crazy, but it actually worked out really well. Anyway, I wanted to share some of the more-interesting aspects of this project, so here’s the first post in that series.

The language I was targeting was an undocumented, compiled, Java-like language where the class hierarchy was flattened at compile time. So for example, code like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
class animal {
animal() {
}
}

class mammal extends animal {
legs;
mammal(legs) {
this.animal();
this.legs = legs;
}
speak() {
print("...");
}
}

class chimpanzee extends mammal {
chimpanzee() {
this.mammal(2);
}
speak() {
print("ooh ooh ooh ooh");
}
}

class wolf extends mammal {
wolf() {
this.mammal(4);
}
speak() {
print("owooooo");
}
}

Might compile to assembly code like this (vastly simplified for illustrative purposes):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
begin class "animal"
method "animal" sub_0
end

begin class "mammal"
property "legs"
method "animal" sub_0
method "mammal" sub_1
method "speak" sub_2
end

begin class "chimpanzee"
property "legs"
method "animal" sub_0
method "chimpanzee" sub_3
method "mammal" sub_1
method "speak" sub_4
end

begin class "wolf"
property "legs"
method "animal" sub_0
method "mammal" sub_1
method "speak" sub_6
method "wolf" sub_5
end

begin subroutine sub_0
ReturnVoid
end

begin subroutine sub_1
PushArg 0
SetThis "legs"
ReturnVoid
end

begin subroutine sub_2
PushStr "..."
Print
ReturnVoid
end

begin subroutine sub_3
PushInt 2
CallThis "mammal"
ReturnVoid
end

begin subroutine sub_4
PushStr "ooh ooh ooh ooh"
Print
ReturnVoid
end

begin subroutine sub_5
PushInt 4
CallThis "mammal"
ReturnVoid
end

begin subroutine sub_6
PushStr "owooooo"
Print
ReturnVoid
end

Take note, this ASM does not specific what classes extends which. The properties and methods have been flattened in a lossy operation, so each subclass simply has a copy of the properties and method pointers, discarding this valuable information for decompiling the original hierarchy.

So what can we do about it? Is it possible to accurately rebuild the hierarchy from the resulting assembly? Actually, yes!

First, we start with what we do and don’t know:

  • A child class might choose not to add a new constructor (at-least in my case).
  • Sub-classes have all the properties and methods of their ancestors (not just their direct parent), and then some.
  • A class with all the properties and methods of another class is not necessarily a child class (in the above example, a class that happens to have an animal method is not necessarily a child of animal).
  • Two completely unrelated classes could have the same property and method names (class a with method b is not a child of class b with method a).
  • If a subroutine is a constructor for a class (same name as class), it is always a constructor, so any subclass that has that constructor subroutine as a method is a descendant (not necessarily direct child) of that class.
  • Each child class only adds zero or one constructor method from their parent class.
  • When a child class overrides a methods, that subroutine cannot ever be referenced be a child again. If one candidate has 3 methods in common while the other has 2, the class is clearly a direct child of the former. Similarly, you can use the number of properties in common to further choose the best candidate, although not as strongly.
  • A child class does not necessarily ever call the parent constructor, so simply analyzing the constructor code is not reliable.
  • The order of the properties and methods does not relate to the class hierarchy.

With this information, we can map out the class parents in just a few passes:

  1. Find possible ancestors: Double-loop over all the classes, comparing their property and method names, looking for possible ancestor candidates. If a class has all the properties and methods of the child, it may be an ancestor, but if any are missing, it cannot be an ancestor. This phase can produce false-positives.
  2. Find possible direct parents: Create a set of all the constructor subroutines, then loop over each class and their possible parents. If the child class adds more than one new constructor, it cannot be a direct child. In most cases, there will only be one candidate at this point, but if any ancestor does not add a new constructor function there will be multiple candidates.
  3. Find the child with the most in-common: Of the possible direct parents, compare the methods in common, and see how many point to the same subroutine, keeping only those with the highest number in common. Additionally, if multiple candidates still remain, compare the number of properties in common, and select the candidates with the most properties in common.
  4. Look for parent constructor call (optional): In the highly-unlikely scenario that multiple candidates still exist, you could analyze the instructions in the subroutine. If there is a call to a constructor subroutine, that could be the direct parent, or an ancestor there of. The actual direct parent may not have add a unique constructor though, so this is a complex way to try to match the parent.
  5. Finally: In an extreme edge-case, it’s possible that there may still be multiple candidates. In that case, you can just pick the first candidate since their is no discernible difference between them anyhow.

So there we have it, a rock-solid strategy for reconstructing a flattened class hierarchy, that would produce a perfect mapping in virtually all scenarios. Even in the most-extreme edge-cases, it would only map a class to a class with no functional differences (meaning re-compiling the code would produce the same result).

Despite the lossy operation of flattening the class hierarchy, with some analysis, it’s still possible to recreate the class hierarchy. With this information, it’s possible to continue disassembling or decompiling each individual class without duplicating all the code from the parent. Just what we wanted!

Got any other questions, or want to hear more about my adventures in creating a compiler toolchain in JavaScript with TypeScript? Leave a comment below!

Comments