Jan 24, 2023·

article

Demystifying bundlers

A friendly introduction to whats and hows

If you read my Demystifying interpreters post, you know that I like to write to learn. And it's the case. If you want to learn from an expert, this is not the place. I can and will make mistakes.

# Bundler?!

Let's begin with the basics. What is a bundler? Many people - myself included in the past - get confused when trying to understand what a bundler is. I had difficulty understanding webpack and what it did differently from babel, for example.

Babel, SWC, and other compilers, just emit code from another code. The idea is: you write javascript in a modern syntax (which the browser cannot understand yet) and it emits javascript in an older syntax (which the browser can understand). However, it is not limited to just browsers, although this is the main use. It can output code that targets older versions of Node.js too.

That's it. They transform code. But, what about the bundler?

The bundler, however, is not that easy to explain because their concept is higher level and more specific. And today different kinds of bundlers do different kinds of jobs. I will talk about the bundler for the web, like browserify, webpack, parcel, etc.

Its main objective is to simplify the way you write your javascript for the web. Its result will always be raw html, CSS, and javascript in a way that browsers can understand. It not only allows you to write a modern syntax (and they use compilers like Babel and SWC for that) but also allows you to use imports inside the files that will work seamlessly the moment you flip. Another main objective is to simplify and improve the development experience, with features like hot-module-replacement, faster rebuilds, etc. But I will focus on the first.

You simply cannot import or require other files in the browser as you do in Node. So the bundler can identify all dependencies from an entrypoint, map them, compile them, and put them all together in a final file. You write your code as you write in Node.js, importing or requiring things, and the bundler will take care of all work needed to transform everything in a script.

Note: nowadays you can import other files using modules, but only in modern browsers, and this is not an option for code in production. So let's ignore this for now.

# The bundler in theory

Let's exemplify with some code. Take a source with those two files:

app.js

export const app = document.querySelector("#app");

index.js

import { app } from "./app";
if (app) {
  app.innerHTML = "Hello from index.js";
}

A bundler should be able to take `index.js` as an entrypoint, map all the dependencies (in this case only `./app.js`) into modules, transform module codes in an older javascript syntax using a compiler, and put everything in a single file providing a `require` function that resolves to a mapped module.

The output code looks like this:

(function (modules) {
  function require(id) {
    const [fn, mapping] = modules[id];
    function localRequire(name) {
      return require(mapping[name]);
    }
    const module = { exports: {} };
    fn(localRequire, module, module.exports);
    return module.exports;
  }
  require(0);
})({
  0: [
    function (require, module, exports) {
      "use strict";

      var _app = require("./app.js");

      if (_app.app) {
        _app.app.innerHTML = "Hello from index.js";
      }
    },
    { "./app.js": 1 },
  ],
  1: [
    function (require, module, exports) {
      "use strict";

      Object.defineProperty(exports, "__esModule", {
        value: true,
      });
      exports.app = void 0;
      var app = document.querySelector("#app");
      exports.app = app;
    },
    {},
  ],
});

If you look at the code carefully, you'll see that `index.js` and `app.js` are there, just written differently. They were transformed using babel to a syntax browser can understand.

So it:

import { app } from "./app";

if (app) {
  app.innerHTML = "Hello from index.js";
}

Became it:

"use strict";

var _app = require("./app.js");

if (_app.app) {
  _app.app.innerHTML = "Hello from index.js";
}

Although the browser can understand the syntax, it doesn't have `require`, `module` and `exports` references in its context. If you try to run only this piece of code, `require("./app.js")` will throw an Uncaught ReferenceError:

Uncaught ReferenceError: require is not defined
    at <anonymous>:1:1

That's why each module is wrapped in a function that receives these three arguments, so when the code executes, require will be a valid function.

function (require, module, exports) {
  "use strict";

  var _app = require("./app.js");

  if (_app.app) {
    _app.app.innerHTML = "Hello from index.js";
  }
}

Now when code executes `require("./app.js")` it will not throw an error because it actually received a function in its first argument.

But how does this `require` function works? How it knows what is `"./app.js"`?

With mappings. As I said before, the bundler scans your code to identify all dependencies. And it does this recursively, generating a graph. It knows all modules and their dependencies. It effectively knows `index.js` depends on `app.js`, and when it generates the final code, while each module receives a unique id (0 for `index.js` and 1 for `app.js`) it also creates a mapping to each module with their dependencies.

const modules = {
  0: [ // index.js module have id 1
    function (require, module, exports) {
      // index.js code
    },
    { "./app.js": 1 }, // index.js dependencies
  ],
  1: [ // app.js module have id 1
    function (require, module, exports) {
      // app.js code
    },
    {}, // app.js depenpendencies
  ],
}

Now everything the `require` function should do is to get the id of the required file (so now it nows `"./app.js"` have id 1), and call its module code.

function require(id) {
  const [fn, mapping] = modules[id];

  function mappedRequire(name) {
    return require(mapping[name]);
  }
  
  const module = { exports: {} };
  fn(mappedRequire, module, module.exports);
  return module.exports;
}

require(0); // executes entry point module that is always 0.

The only difference is that the bundler put everything in a self-evoking function in the original output code not to pollute the main scope. Self-evoking is a technique where you create an anonymous function and run it at the same time. For example:

(function (name /* will be "bar" */) {
  const greet = "Hello " + name;
  console.log(foo);
})("bar")

In the code above, the `greet` const is assigned in the anonymous function scope and cannot be accessed in the browser's main scope.

(function (modules) {
  function require(id) {
    // require implementation
  }
  require(0);
})({
  0: [function (require, module, exports) { /* index.js code */ }, { "./app.js": 1 }],
  1: [function (require, module, exports) { /* app.js code */   }, {}],
});

# The bundler in practice

Now that you know what a bundle is and how it works, it's time to create one so we can learn it better in practice, right? A complete bundler is not that simple to write, it involves many complexities and specifics, but to fit this post, we will create a simple and limited bundler, which does not resolve dependencies in node_modules, does not code-split, does not have loaders and other important aspects of a full-featured bundler. We will do the basics, which is transforming them into commonjs, join more than one file into one script, and assure the browser can run it.

# Dependency graph

Let's start from... the beginning. The idea is to create a graph of our code's dependencies from an entry point. But first, let's understand what this is. A graph, in short, is a non-linear structure composed of nodes and edges. In our context, nodes are the modules (files) and edges are dependencies. We will have more than one node, and each one can have none, one, or more than one dependency. But all modules will have at least one dependent (otherwise the module is not used, it will not be included in the bundle). You will understand by the explanation below.

Imagine you have this structure:

src
├── path
│   ├── a.js
│   ├── b.js
│   └── c.js
└── index.js

With this code:

src/index.js

import { a } from "./path/a";
a();

src/path/a.js

import { b } from "./b";
import { c } from "./c";
export const a = () => {
  b();
  c();
}

src/path/b.js

export const b = () => console.log("b");

src/path/c.js

export const c = () => console.log("c");

When analyzing the dependency graph, we will have this:

This structure, in code, can be represented something like this: Every object in the graph is a node. The edges are represented by dependencies property, pointing to other nodes by their ids.

const graph = [{
  id: 0, // src/index.js
  dependencies: [1],
}, {
  id: 1, // /src/path/a.js
  dependencies: [2, 3],
}, {
  id: 2, // /src/path/b.js
  dependencies: [],
}, {
  id: 3, // /src/path/c.js
  dependencies: [],
}]

In order for us to know what are the dependencies of a file, we need to do a static analysis of it.
There is more than one way to do this, we could use regex to identify these imports, for example, but for simplicity, we will use a parser to transform our code into an AST, and then read the imports from the generated structure.

For this, we will use SWC, which will also be used later to transform our modules into es5 and commonjs.

const createGraph = async (entry: string) => {
  type Module = {
    id: number;
    dependencies: number[];
  };

  const modules: Module[] = [];

  const createModule = async (filename: string) => {
    // Recursively create modules & add to modules array
  };

  await createModule(entry);

  return modules;
};

In the `createModule` function we should:

read the file content;
parse the file content to get its AST;
get all import declarations from AST;
recursevely call createModules for each import (dependency);
save dependency ids to parent module;
add the created module to the modules array;

import * as fs from "node:fs";
import * as path from "node:path";
import * as util from "node:util";
import * as swc from "@swc/core";

const readFile = util.promisify(fs.readFile);

const createGraph = async (entry: string) => {
  let ID = 0;

  type Module = {
    id: number;
    dependencies: number[];
  };

  const modules: Module[] = [];

  const createModule = async (filename: string) => {
    const id = ID++;

    // To simplify we will not implement module resolution, just append `.js`
    // extension if omitted.
    const absoluteFile =
      path.join(process.cwd(), filename) +
      (path.extname(filename) === "" ? ".js" : "");

    // 1. read the file content;
    const content = await readFile(absoluteFile, "utf8");

    // 2. parse the file content to get its AST;
    const ast = await swc.parse(content, { syntax: "ecmascript" });

    // 3. get all import declarations from AST;
    const imports = ast.body.filter(
      (node): node is Extract<typeof node, { type: "ImportDeclaration" }> => {
        return node.type === "ImportDeclaration";
      }
    );

    const dependencies: number[] = [];
    for (const node of imports) {
      // 4. recursevely call createModules for each import (dependency);
      const mod = await createModule(
        path.join(path.dirname(filename), node.source.value)
      );
      // 5. save dependency ids to parent module;
      dependencies.push(mod.id);
    }

    // 6. add the created module to the modules array;
    const mod: Module = { id, dependencies };
    modules.push(mod);
    return mod;
  };

  await createModule(entry);

  return modules;
};

Testing it against an example app code with 2 files;

example
├── path
│   └── to
│       └── app.js -> export const app = document.querySelector("#app");
└── index.js       -> import { app } from "./path/to/app";

We got this:

createGraph("./example/index.ts").then(console.log);
// => [ { id: 1, dependencies: [] }, { id: 0, dependencies: [1] } ]

Note: the reason why module 1 is ahead of module 0 is that as we run our `createModules` function recursively, the leafs dependencies are executed first - since we first create the module and then get its id. But this is not a problem for us. The order here doesn't really matter.

Now that we have our graph, we are going to add other information to our module so that in the end it is possible to transform everything into a single bundle.

First, using swc, we will read our module code and turn it into commonjs (i.e. roughly speaking, `import * from` will become `require`, `export *` will become `module.exports`, and so on...)

import * as fs from "node:fs";
import * as path from "node:path";
import * as util from "node:util";
import * as swc from "@swc/core";

const readFile = util.promisify(fs.readFile);

const createGraph = async (entry: string) => {
  let ID = 0;

  type Module = {
    id: number;
    code: string;
    dependencies: number[];
  };

  const modules: Module[] = [];

  const createModule = async (filename: string) => {
    const id = ID++;

    const absoluteFile =
      path.join(process.cwd(), filename) +
      (path.extname(filename) === "" ? ".js" : "");
    const content = await readFile(absoluteFile, "utf8");

    const ast = await swc.parse(content, { syntax: "ecmascript" });
    const imports = ast.body.filter(
      (node): node is Extract<typeof node, { type: "ImportDeclaration" }> => {
        return node.type === "ImportDeclaration";
      }
    );

    const dependencies: number[] = [];
    for (const node of imports) {
      const mod = await createModule(
        path.join(path.dirname(filename), node.source.value)
      );
      dependencies.push(mod.id);
    }

    const { code } = await swc.transform(content, {
      filename,
      module: { type: "commonjs" },
    });

    const mod: Module = { id, code, dependencies };
    modules.push(mod);
    return mod;
  };

  await createModule(entry);

  return modules;
};

Second, we need to make a simple change to `dependencies` so that we can do the mappings at the end. Modules/files import other files relatively. That is, a file having an import `./app.js` does not exactly mean that it is the same dependency as another file that also has an import `./app.js`.

Think in this scenario:

src
├── a
│   ├── app.js
│   └── index.js -> import { app } from "./app.js"; 
└── b
    ├── app.js
    └── index.js -> import { app } from "./app.js";

In the code that will be bundled, both modules will have this code `const app = require("./app.js")`, and we will need to map them in some way so that it's possible to know what is the id of the module represented by `"./app.js"`, so then with the id, read the correct code of the module required. But how to do this if `require("./app.js")` is ambiguous? To solve this, we will add a mapping in each module. Instead of the dependencies of each module being just an array of ids, now it will be a Map where the key is the import string (`./app.js`) and the value is the module id (number);

So in one case, you will have `dependencies: Map { "./app.js" => 0 }` and in another `dependencies: Map { "./app.js" => 1 }`, that way we will be able to do the disambiguation as we will be able to map the import string to the module id easily.

import * as fs from "node:fs";
import * as path from "node:path";
import * as util from "node:util";
import * as swc from "@swc/core";

const readFile = util.promisify(fs.readFile);

const createGraph = async (entry: string) => {
  let ID = 0;

  type Module = {
    id: number;
    code: string;
    dependencies: Map<string, number>;
  };

  const modules: Module[] = [];

  const createModule = async (filename: string) => {
    const id = ID++;

    const absoluteFile =
      path.join(process.cwd(), filename) +
      (path.extname(filename) === "" ? ".js" : "");
    const content = await readFile(absoluteFile, "utf8");

    const ast = await swc.parse(content, { syntax: "ecmascript" });
    const imports = ast.body.filter(
      (node): node is Extract<typeof node, { type: "ImportDeclaration" }> => {
        return node.type === "ImportDeclaration";
      }
    );

    const dependencies = new Map<string, number>();
    for (const node of imports) {
      const mod = await createModule(
        path.join(path.dirname(filename), node.source.value)
      );
      dependencies.set(node.source.value, mod.id);
    }

    const { code } = await swc.transform(content, {
      filename: absoluteFile,
      module: { type: "commonjs" },
    });

    const mod: Module = { id, code, dependencies };
    modules.push(mod);
    return mod;
  };

  await createModule(entry);

  return modules;
};

Now we just need to output all our modules inside an IIFE and our tiny bundle is ready!

export const bundle = async (entry: string) => {
  const modules = await createGraph(entry);

  const output = `
    (function (modules) {
      function require(id) {
        const [fn, dependencies] = modules[id];
        function mappedRequire(name) {
          return require(dependencies[name]);
        }
        const module = { exports : {} };
        fn(mappedRequire, module, module.exports);
        return module.exports;
      }
      require(0);
    })({
      ${modules.map(
          (mod) => `${mod.id}: [
            function (require, module, exports) {
              ${mod.code}
            },
            ${JSON.stringify(Object.fromEntries(mod.dependencies))}
          ]`
        )
        .join(",\n")}
    })
  `;

  const minified = await swc.minify(output, { compress: true });
  return minified.code;
};

You can check the full code (with a few changes) in the renatorib/caixinha repo.

# Conclusion

A bundler isn't that hard from the moment you start understanding the concepts. Still, it's hard to build an efficient and complete bundler with all the modern features one bundler should have. We didn't talk about tree shaking, loaders, caching, etc. Maybe in a next post?

Follow me on twitter, and github. Leave a comment.