Biome Type Inference: A Look Behind The Scenes


A little over a month ago, Biome announced their partnership to work with Vercel on improving their type inference.

Concretely, this meant I was contracted to implement type inference in Biome for the purpose of enabling the noFloatingPromises rule, as well as a similar, but upcoming noMisusedPromises rule.

So why is this newsworthy? What challenges did we face enabling such rules? And what is still to come?

If these questions interest you, this post is for you. I’ll start by explaining the challenges we face, the constraints we operate under, and finally move onto the technical details on how we’re tackling all this.

The Challenge of Type Inference

Nowadays, if you are familiar with the JavaScript scene, you probably recognise this simple reality: If you want type safety, you turn to TypeScript. TypeScript is like JavaScript, but with types.

So what does it mean when we say TypeScript has types? Doesn’t JavaScript have types too?

And yes, JavaScript does have types. It even has a typeof operator that can tell us the type of any given value. So what is it really that sets TypeScript apart from JavaScript?

Well, two things:

  • TypeScript can annotate type expectations.
  • TypeScript can validate type assignments.

I’ll illustrate both with an example.

Type Annotations

The first thing TypeScript does that JavaScript doesn’t is to provide the ability to annotate variables and parameters with the type they expect. Let’s first look at an example without type annotations:

function greet(subject) {
    console.log(`Hello, ${subject}`);
}

greet('World');

Looking at this code, we can understand that greet accepts a string as the type of its subject parameter. After all, try the following in any JavaScript console:

> typeof 'World'
"string"

Well, indeed. The type of 'World' is a string.

But when we look at the function signature of greet, this is not exactly obvious. We can look at the body of the function, then look at the console.log() statement, and understand that yes, subject is being used to display part of a string message. And while in this example, that’s not exactly rocket science, imagine if greet had been 10 or 20, or even a 100 lines long.

We don’t want to look at function bodies to understand what type a parameter has, we want it laid out for us. That’s why TypeScript allows us to write:

function greet(subject: string) {
    console.log(`Hello, ${subject}`);
}

Such a simple and subtle change. But now we don’t need to look at the function body anymore to understand that yes, subject should be a string.

Those few extra characters, : string, that’s what we call a type annotation. And it tells us, immediately, what the type of a given variable or parameter is.

Type Validation

Compared to type annotations, type validation may be considered the other side of the coin. Because if type annotations tell us what a variable should be, how do we know what it actually is?

That’s where type validation comes into play. Thanks to the type annotation in our greet function, we, and the computer, know that the subject parameter is of type string. That means we know we should pass strings to it, and the computer can validate that’s what we do.

The computer in this case is tsc, or the TypeScript Compiler. While TypeScript is a language, which dictates us what we can and can’t do, it’s the compiler that enforces that we follow the rules.

So what happens if we were to call greet like this instead:

greet(1);

The TypeScript Compiler will complain and tell us:

Argument of type 'number' is not assignable to parameter of type 'string'.

Well, fair enough. It’s trying to prevent us from making mistakes, which is generally a good thing. In fact, that’s pretty much the entirety of the reason for TypeScript’s existence. By detecting type errors early, meaning during compilation instead of at runtime, we can discover mistakes more early, and thus save ourselves time during development.

Thank you, TypeScript.

So What’s the Challenge with Inference?

Type inference is the art of figuring out what type something evaluates to. In the examples above that was easy. The type of 'World' is a string, and the type of the subject parameter is also a string. In the first case we know this because 'World' is a string literal, and those are always of type string. In the second case we knew it because of the type annotation.

Similarly, we learned that the type of 1 is number. But what’s the type of 1 + 1? It doesn’t take a lot of imagination to figure out it’s also number. But what about 1 + ' ounce'?

See where we’re going here?

Reader nods understandingly.

Thankfully, even though the results are not always intuitive, JavaScript defines pretty well what each kind of expression should evaluate to. For operators, such as +, we can find references what every combination of operands should evaluate to.

But what about myFunction()? This is what we call a call expression, and it means it will call the function myFunction, and it will evaluate to the result of that function.

JavaScript has no way of knowing what myFunction() evaluates to, unless it’s a predefined function written down in some standard. But assuming myFunction is defined in someone’s own codebase, the answer to what myFunction() evaluates to should be sought in that codebase as well.

TypeScript’s annotations make this easier again. Assume the following function signature:

function myFunction(): Promise<number> {
    return Promise.resolve(1);
}

Great! We know the answer! myFunction() evaluates to something of type Promise<number>.

“That was easy! So where’s the challenge?”

The problem is that TypeScript’s type system is complicated. So far, we’ve barely scratched the surface, and this blog post is too short to go down into all of the dark corners. But just to scare you with some imposing terminology, imagine how the above could be complicated with generics, mapped types, conditional types, interfaces and inheritance, template literal types, method overloads, unions and intersections, declaration files, and more.

If those words don’t sound imposing enough on their own, consider that their meanings aren’t even formally defined. TypeScript doesn’t have a specification that says exactly how all those feature are supposed to work. Instead, the rule of the land is: Whatever the TypeScript Compiler says is right, is right.

Except when a bug report is filed, and accepted by the TypeScript team, in which case what is right may change when the next version of TypeScript is released.

So now you hopefully see the real problem: TypeScript is a complex moving target that is not even strictly defined to begin with. As such, it is no wonder that most tools have hardly made an effort to attempt to interpret TypeScript’s type system. It’s a complex endeavour and 100% compatibility seems an ever-elusive goal.

What Is Type Inference Really?

“But… didn’t you say you were contracted by Vercel to implement type inference in Biome? How are you going to do that if it’s an ever-elusive goal?”

Yes, I did say that 100% compatibility was an ever-elusive goal. Thankfully, Biome is a linter, not a type checker. As mentioned before, TypeScript offers two things: type annotations and type validation.

Type validation falls into the domain of a type checker. It’s what the TypeScript Compiler is good at, and we as Biome developers expect our users to keep using it for the foreseeable future. So type validation is not part of type inference, and we don’t even need to strive for compatibility on that front.

Type annotations, on the other hand, are a tool to make type inference easier, not harder. Type inference can be performed on arbitrary JavaScript expressions. Having annotations just makes it easier, although admittedly, supporting all the type system features of TypeScript does add complexity again. But the basics of TypeScript are pretty well defined at this point, even if it lacks an official specification.

And because Biome is a linter, not a type checker, it doesn’t require 100% compatibility to achieve value to its users. Seen in the context of rules such as noFloatingPromises, if Biome can infer 80% of the instances where a floating promise would occur, it can offer 80% of the value such a rule could theoretically provide. And to put things into perspective, not even the TypeScript Compiler can offer 100% correctness here.

So to sum things up:

  • Type inference attempts to figure out what type a given expression evaluates to.
  • Type annotations help to narrow down applicable types.
  • A high degree of TypeScript compatibility is desired to conform to user expectations.
  • A 100% compatibility with the TypeScript Compiler is not required for providing (significant) value.

So what this means is: For Biome, which doesn’t need to do type checking, mere inference, an independent implementation that is very similar to TypeScript’s compiler may be worthwile, even if it doesn’t offer perfect compatibility with tsc.

Why Is It Worthwhile?

“All right, so you made a case that Biome could deliver something worthwhile… But why not just use tsc if it’s already there?”

Two reasons:

  • tsc is slow. That’s especially the case today. But the TypeScript team has also announced a port to Go, which will supposedly speed things up. How fast it will really be remains to be seen however: Even if Biome were to integrate with a native-speed version of tsc, unless it were a shared library that could directly utilise our data structures, there would be duplicated parsing of source files, duplicated analysis, and so on.
  • tsc is difficult to integrate into Biome. Biome is written in Rust, and can easily utilise libraries written in Rust or even C. But integrating a project written with Node.js means we have to launch an external process and integrate through IPC (Inter-Process Communication), which generally involves quite some complication. And moreover, when Biome needs to launch an external process, it shifts the burden onto our users to make sure that process can even be launched. In a CI environment, that means our users are responsible for making sure the executable is installed. We can try to help them to make things easy to set up, but we’ll have to pay for the effort as well as the brunt when things don’t work as they’re supposed to. And unfortunately, it appears the Go port will offer little relief there.

Biome Type Architecture

“Okay, so you really decided to do it yourself. But how?”

Not so fast there. Before I’m going to dive into our architecture, I’ll attempt to explain the constraints that led us to make the choices we did…

Architecture Constraints

The main thing to understand about Biome is that we put our User Experience front and center. Whether it’s our Rule Pillars, our Batteries-Included approach, the biome migrate command for users coming from other tools, or our focus on IDE support, we know that without users we are nowhere.

And it’s precisely this last point, our IDE support, that’s so important in this story. IDE support was already an important consideration in our approach to multi-file support, and this seeps through into our type inference architecture.

For many tools, such as bundlers, it is sufficient to optimise the performance for CLI usage. Development servers may have an interest in optimising hot-reload performance as well, but they tend to do so by pushing responsibility to the client instead of rebuilding their bundles faster.

For Biome, priorities are different: If a user changes file A, they want the diagnostics for file B to update in their IDE, regardless of whether it has dependencies on file A. Updates need to happen near-instantaneously, and the IDE is not a client we can offload responsibility to.

Module Graph

Biome’s module graph is central to our multi-file support and is designed with these considerations in mind. And our type architecture is built upon this module graph. The module graph is effectively just a fancy hash map that contains entries for every module (every JS/TS file in a repository), including metadata such as which other modules that module depends on, which symbols it exports, and yes, also which types it contains.

The key constraint the module graph operates under is this: No module may copy or clone data from another module, not even if that data is behind an Arc. The reason for this is simple: Because of our focus on IDE support, we maintain the idea that any module in the module graph may be updated at any point in time due to a user action. Whenever that happens, we shouldn’t have trouble figuring out which other modules need their data to be invalidated, which might happen if modules were to copy each other’s data.

Some other tools use complex systems to track dependencies between modules, both explicit dependencies as well as implicit ones, so they can do very granular cache invalidation. With Biome we’re trying radical simplicity instead: just make sure we don’t have such dependencies between entries in our module graph. So far, that appears to be working well enough, but naturally, it comes with its own challenges.

Type Data Structures

“You’re finally getting to the interesting stuff, aren’t you?”

I guess that depends. If like me, you’re a nerd for the technical nitty gritty, yeah, we’re getting there :)

In Biome, the most basic data structure for type information is a giant enum, called TypeData:

enum TypeData {
    Unknown,
    Boolean,
    Number,
    String
    Null,
    Undefined,
    Function(Box<Function>),
    Object(Box<Object>),
    Class(Box<Class>),
    Reference(Box<TypeReference>),
    // many, many more...
}

This enum has many different variants in order to cover all the different kinds of types that TypeScript supports. But a few are specifically interesting to mention here:

  • TypeData::Unknown is important because our implementation of type inference is only a partial implementation. Whenever something is not implemented, we default to Unknown to indicate that, well, the type is unknown. This is practically identical to the unknown keyword that exists in TypeScript, but we do have a separate TypeData::UnknownKeyword variant for that so that we can distinguish between situations where our inference falls short versus situations where we can’t infer because the user explicitly used unknown. They’re semantically identical, so the difference is only for measuring the effectiveness of our inference.
  • Complex types such as TypeData::Function and TypeData::Object carry extra information, such as definitions of function parameters and object properties. Because function parameters and object properties themselves also have a type, we can recognise that TypeData is potentially a circular data structure.
  • Rather than allowing the data structure itself to become circular/recursive, we use TypeReference to refer to other types. And because we try to avoid duplicating types if we can, we have TypeData::Reference to indicate a type is nothing but a reference to another type.

Why Use Type References?

“You mentioned types could become circular/recursive. But is that a problem? Why can’t you use something like Arc to reuse TypeData instances and let them reference each other directly?”

Glad you asked! Yes, we could use Arc and let types reference each other directly. But remember that module graph I mentioned? If a type from module A were to reference a type from module B, and we’d store the type from module B behind an Arc, then what would happen if module B were replaced in our module graph?

The result would be that the module graph would have an updated version of module B, but the types in module A would hang on to old versions of those types, because the Arc would keep those old versions alive. Of course we could try to mitigate that, but solutions tend to become either very complex or very slow, and possibly both.

We wanted simplicity, so we opted to sidestep this problem using TypeReferences instead.

But even though the constraints of our module graph were our primary reason for choosing to use type references, they have other advantages too:

  • By not putting the type data behind Arcs, we can store data for multiple types in a linear vector. This improves data locality, and with it, performance.
  • Storing type data in a vector also makes it more convenient to see which types have been registered, which in turn helps with debugging and test snapshots.
  • Not having to deal with recursive data structures made some of our algorithms easier to reason about as well. If we want to perform some action on every type, we just run it on the vector instead of traversing a graph while tracking which parts of the graph have already been visited.

Type Resolution Phases

“I get it, you like type references. But how do they work?”

Which kind?

“Huh?”

You see, type references come in multiple variants:

enum TypeReference {
    Qualifier(TypeReferenceQualifier),
    Resolved(ResolvedTypeId),
    Import(TypeImportQualifier),
    Unknown,
}

There isn’t a singular answer to how type references work. The reason for this is that type resolution, the process of resolving type references, works in multiple phases.

Biome recognises three levels of type inference, and has different resolution phases to support those…

Local Inference

Local inference is when we look at an expression and derive a type definition. For example, consider this seemingly trivial example:

a + b

It looks like this should be easy, but because local inference doesn’t have any context such as definitions from surrounding scopes, it will never be able to understand what a or b refers to.

“That doesn’t sound very useful. If you know neither a nor b, how can you make anything more useful of this expression than TypeData::Unknown?”

It’s true that local inference cannot resolve this to a concrete type. But with the help of type references, we can rewrite the expression into something useful:

TypeData::TypeofExpression(TypeofExpression::Addition {
    left: TypeReference::from(TypeReferenceQualifier::from_name("a")),
    right: TypeReference::from(TypeReferenceQualifier::from_name("b"))
})

Local inference doesn’t do any type resolution yet, it only creates type references. So in most cases we won’t know a concrete type yet, but it still provides a useful starting point for later inference.

Module-Level (“Thin”) Inference

Module-level inference, or as I sometimes like to call it: “thin inference”, allows us to put those types from the local inference phase into context. This is where we look at a module as a whole, take its import and export definitions, look at the scopes that are created, as well as the types derived using local inference, and apply another round of inference to it.

Within the scope of a module, we do our first round of type resolution: We take all the references of the variant TypeReference::Qualifier (the only ones created thus far), and attempt to look them up in the relevant scopes. If a local scope declaration is found, we consider the type resolved and convert the reference into a TypeReference::Resolved variant with an associated ResolvedTypeId structure, which looks like this:

struct ResolvedTypeId(ResolverId, TypeId)

Both ResolverId and TypeId are a u32 internally, so this is a really compact representation for referencing another type, not bigger than a regular 64-bit pointer. The TypeId is a literal index into a vector where types are stored, while the ResolverId is a slightly more complex identifier that allows us to determine which vector we need to look in, because every module will have its own vector (and there are a few more places to look besides).

“But what if the type reference qualifier could not be found in one of the scopes?”

Good to see you’re still paying attention!

Another possibility is that the qualifier references a binding from an import statement, such as import { a } from "./a.ts". In this case, we cannot fully resolve the type yet, because thin inference cannot look beyond the boundaries of its own module. But we can mark this case as an explicit import reference. This is what the TypeReference::Import variant is for.

And if the qualifier exists neither as a local declaration, nor as an imported binding, then we know it must come from the global scope, where we can find predefined bindings such as Array and Promise, or the window object. If a global reference is found, it also gets converted to a TypeReference::Resolved variant, where the ResolverId can be used to indicate this type can be looked up from a vector of predefined types.

But ultimately, if not even a global declaration was found, then we’re at a loss and fall back to TypeReference::Unknown.

Full Inference

Full inference is where we can tie all the loose ends together. It’s where we have the entire module graph at our disposal, so that whenever we run into an unresolved TypeReference::Import variant, we can resolve it on the spot, at which point it becomes a TypeReference::Resolved variant again.

Today, results from our full inference cannot be cached for the same reason we’ve seen before: Such a cache would get stale the moment a module is replaced, and we don’t want to have complex cache invalidation schemes. But we may have one more trick up our sleeve…

“Hey, do tell! What’s the trick?”

One moment, dear reader, I will get to that, but first I have to explain about…

Type Resolvers

The thing about having all these type references all over the place is that you need to perform explicit type resolution to follow these references. That’s why we have type resolvers. I use plural, because we have a whole bunch of them. There’s a TypeResolver trait, and at this point we already have 6 implementations of it.

That may sound worse than it is though. As of today, the implementations are:

  • HardcodedSymbolResolver. This one is purely for test purposes.
  • GlobalsResolver. This is the one that is responsible for resolving globals such as Promise and Array. The way we do this is still rather primitive with hardcoded, predefined symbols. At some point we probably should be able to use TypeScript’s own global .d.ts files, such as es2023.array.d.ts, directly.
  • JsModuleInfoCollector. This one is responsible for collecting information about a module, and for performing our module-level inference.
  • JsModuleInfo. Once the JsModuleInfoCollector has done its job, a JsModuleInfo instance is created, which is stored as an entry in our module graph. But this data structure also implements TypeResolver so that our full inference can access the module’s types too.
  • ScopedResolver. This is the one that is responsible for our actual full inference. It’s named as it is because it is the only resolver that can really resolve things in any arbitrary scope. Compare this to the JsModuleInfoCollector which only cares about the global scope of a module, because at least so far that’s all we need to determine types of exports (we don’t determine the return type of functions without annotations yet, and it’s not yet decided when or if we’ll do this).
  • ScopeRestrictedRegistrationResolver may sound impressive, but is but a helper for ScopedResolver to conveniently set the correct scope ID on certain references, so that when the time comes for the ScopedResolver to resolve it, it will still know which scope should be used for resolving it.

I’ve mentioned before that types are stored in vectors. Those type vectors are stored inside the structures that implement TypeResolver, and with the exception of ScopeRestrictedRegistrationResolver, they all have their own internal storage for types.

“Super interesting! But I’m still waiting for that trick you were teasing.”

Yeah, well. By now you can hopefully see that while type resolvers are somewhat of a necessity because of our choice to use type references pervasively, they also offer us quite a bit of flexibility.

One potential area of flexibility — as of yet unexplored — is that it is entirely feasible to imagine a type resolver that does cache the results of our full inference. Such a resolver would be very hard to get right for our language server, where modules may be updated at any time, but it might work very well for our CLI where we can assume this doesn’t happen.

Flattening

Apart from type resolution, there’s one other, last important piece to type
inference: type flattening.

Let’s look at the a + b expression again. After local inference, it was interpreted as this:

TypeData::TypeofExpression(TypeofExpression::Addition {
    left: TypeReference::from(TypeReferenceQualifier::from_name("a")),
    right: TypeReference::from(TypeReferenceQualifier::from_name("b"))
})

But at some point, supposedly one of the resolvers is going to be able to resolve a and b, and the expression becomes something such as:

TypeData::TypeofExpression(TypeofExpression::Addition {
    left: TypeReference::from(ResolvedTypeId(/* resolver ID and type ID */)),
    right: TypeReference::from(ResolvedTypeId(/* resolver ID and type ID */))
})

At this point we know the actual types we are dealing with. If the types for both left and right resolve to TypeData::Number, the entire expression can be flattened to TypeData::Number, because that’s the result of adding two numbers. And in most other cases it will become TypeData::String instead.

What’s Next

Biome type architecture is still very young. I’ve been working on this for about 6 weeks since the first code was written. The module graph was already there before that, but not a single type-related structure was present in Biome’s codebase.

Even so, preliminary results on one of Vercel’s codebases shows that our inference is able to detect about 40% of the cases that should be flagged by our noFloatingPromises rule. No false positives are being reported as of yet.

While a 40% success rate doesn’t sound particularly impressive, we have a clear indication where we need to focus next: Currently our path resolution only works for relative paths. So import ... from "./foo.ts"; works, but import ... from "#/foo" with a path alias won’t. Nor does import ... from "foo" work with external libraries. With such major limitations, it’s no surprise that we have a limited success rate.

But looking at the bright side, we’re only 6 weeks in, and we clearly know where to go next: At this moment the limiting factor isn’t even our type resolution, it’s our import resolution.

And thankfully, that’s not the only thing we’re doing either. Our core contributor Siketyan has already implemented a new useExhaustiveSwitchCases rule too, improving some of our inference in the process. Such efforts reinforce one another, and as we attract more people working on these type-informed rules, our inference only stands to become better and better.

Wrapping Up

“Thanks. I thought you were never getting there.”

Thank you for reading all this way. Hopefully this has provided an informative look behind the scenes on Biome’s type inference. I’ll get back to hacking on our type inference and our import resolution :)

“Before you go, I have one question: Do you think Biome will ever implement type checking too?”

Heh, that’s a great question! To be honest, it’s hard to give a definitive answer, because I can’t look into the future. What I can say is that it is certainly outside the scope of my current contract.

That said, there is a rather natural path towards implementing type checking too. TypeScript has conditional types, which look something like A extends B ? C : D and which can be read as: If type A can be assigned to type B, use type C, otherwise use D. And that condition is awefully close to type validation. And if we have type validation, we’re awefully close to becoming a type checker as well.

So, who knows, we may venture into type checking territory some day. But remember that full compatibility with tsc is a rather elusive goal for any type checker, so please temper your expectations and keep in mind you will definitely still be using tsc for the foreseeable future.

Thanks again for reading! And if you have any other questions, feel free to leave a comment here. Or reach out to us on our Discord in the #type-inference channel.


Comments are generated from replies to this Mastodon post. Reply to the post to have your own comment appear.