The Methods & Motivations of Balena's Organizational Structure

Nov 12, 2021

Part 1: The Motivations

Traditional organizational scaling is broken. Most successful companies grow despite their organizational structures, not because of them. At balena, we take aim at the heart of this problem by putting as much thought into our internal structure as we do into our commercial products. To better understand the challenges of organizational scaling, and the need for new organizational paradigms, let’s first start by taking a look at the life cycle of a company.

The company starts life as a startup, a small team with a unique insight around a problem. The early work isn’t limited to just building the solution, it also involves strategizing around the distribution of that solution, as well as administrative work like securing funding. A startup’s early days see the entire team exposed to both product disorder - work around the solution - and commercial disorder - ensuring the company can continue to work on the solution. While the company is still small, the flow of information is akin to that of a tribe and, like in a tribe, everyone contributes everywhere. This is what allows young companies to be nimble. There are countless stories of tiny upstarts upending giant incumbents, even though they’re outmatched on all other fronts. This nimbleness is a direct result of the upstart’s organizational efficiency, and, as such, it’s hard to overstate the advantage that organizational efficiency confers. And yet, prevailing wisdom says that preserving even a fraction of this efficiency as a company grows is impossible.

A startup sets out on that supposedly inevitable path of becoming a lumbering giant when it gets too big to act as a tribe. This specific point is different for every company, but a useful model comes from British anthropologist Robin Dunbar. Dunbar’s number, as it’s commonly known, suggests that a person can maintain stable social relationships with up to 150 people. Dunbar explained it informally as "the number of people you would not feel embarrassed about joining uninvited for a drink if you happened to bump into them in a bar.” It’s easy to see how a team dynamic like this would be invaluable to fast-growth companies.

Everything past the Dunbar number, however, is qualitatively different. When the company gets too big for us to be involved in everything, our “innate firmware” reaches its limits. Here, we begin to put structures in place to neatly order the chaos. The managerial status quo includes departments, org charts, and deadlines as the go-to infrastructure for scaling the company. Slowly, but surely, employees drift away into their own silos. The company structure gets fragmented - even though the product they ship remains a whole that can’t be neatly divided into pieces. This artificial abstraction leads teams to be exposed to separate types of uncertainty. Some only see product disorder, while others only see commercial disorder.

The curse of bureaucratic legibility

While the commercial product gets feedback from the market, thus incentivizing it to fit to the needs of paying customers, organizational structures have a much more vague feedback loop. The way most companies deal with this brings us to the crux of the issue: traditional organizational structures optimize for anything but product quality: Satisfying the executives’ ego, making as few changes as possible, and, perhaps most interestingly, legibility. That is to say that the organization orders itself to be most easily understood by authorities such as executives, board of directors, and certification agencies who want to get easy answers as to attribution of responsibility. This is very far from optimizing for the best product (no evidence of tribal org charts has been discovered in prehistoric caves.)

The legibility of an organization allows neat structures to exist. It’s very easy to picture the pyramid shape of a few executives at the top with a bunch of managers under them and a bunch of everyday employees at the bottom. It’s easy to think in terms of different departments, especially when they are separated by the particular skillsets of employees. Engineering deals with product architecture, marketing and sales get the product to market, and finance makes sure everyone has money to keep going. This is very easy to imagine in one’s mind, but how often is the best solution the one that’s the easiest to understand?

As it turns out, optimizing for legibility is an insidious cancer that slowly builds bloat until it turns the organization into an unworkable bureaucracy - and, inevitably, impacts the final product. A commonly quoted maxim inside balena is “Conway’s law” which states: “Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure.” A simple example of Conway's law is the redundant number of distinct CPUs and microcontrollers that go into the average automobile. If the vehicle were designed from first principles, there would be one or two large computers that would do the job of a dozen smaller ones with a lot less complexity, however, since cars have been digitized in retrospect, this architecture reflects the artificial role separations the manufacturer employs internally. What starts as a well intentioned method of organizing large scale cooperation slowly devolves into a series of intertwined races to meet deadlines, hit KPIs, and satisfy managers’ need to save face.

The backfiring bait-and-switch of a deadline

Within a traditional organization, a new project has managers assigned to it, timelines defined, and measurements set. Initially this makes sense as these are seen as guard rails and guide posts put in place with the intention of ensuring a quality product. Once the project gets underway, however, it is these measurements themselves which end up defining the end result much more than the fuzzy ideal of “building a good product”

Let’s take a look at an example in the context of building a software product. When the project first begins, a timeline is defined. Right off the bat, we have a problem: estimating timelines for big projects is a fool’s errand. At balena, we genuinely believe that, for a team that wants to do things properly, there are roughly two groups of projects. Ones that should take no time, and ones that will take an unknown amount of time. Our logic goes like this: only projects that contain no uncertainty at all can have accurately defined timelines, and if a task contains no uncertainty, it should be automated, hence the category of projects that take no time.

This is a problem, of course, how often do new projects contain no uncertainty? When you are doing something for the first time, you only know how long the optimal path will take. It's like being asked “how long will it take to cross this 200 meter minefield?” The only answer is “5 minutes, if I don't step on a landmine.” If you do, it might take months. Or it also might take hours, you really don’t know…

The reality, of course, is that most engineers are asked for timelines, regardless of how useless they may be. And, after giving their best guesses that satisfy management, they embark on a project that, sooner or later, will invariably run into some unforeseen issue and becomes pressed for time. And teams that are pressed for time will cut corners. They will not simply incur technical debt. They will incur the equivalent of paying off their previous payday loan by mortgaging their kidney to the local loanshark. I'm kidding about their kidney... They mortgaged your kidney. Because a software engineer can always leave. And all the hurt, and hope, and promise, will become a few open PRs on a codebase nobody can read. Who pays the cost of your foolishness? Why, that would be you, startup founder and/or project leader. It's you that will end up with a deadlocked codebase that nobody dares to touch. It's you that will have to resort to doubling down on pricing tweaks and sales commissions to juice sales, as new features are no longer flowing, and your development team is bogged down fighting fires.

Bad incentives are no joke, but cutting corners isn’t the worst of it. The guard rails and guideposts themselves often end up serving the opposite of their intended purpose. This happens when organizations take estimates and turn them into targets. “It looks like a task that might take a month” becomes “you promised me it would be ready in a month”. And here we come to another adage that we often repeat inside balena: Goodhart’s law. Courtesy of British economist Charles Goodhart, Goodhart’s law essentially states that “When a measure becomes a target, it ceases to be a good measure.” The timelines example illustrates this problem well - a guide post that was meant to keep the product on schedule, and thus deliver a better outcome, eventually forces the team to cut corners and the project actually ends up worse off than if the measure didn’t exist in the first place. Examples of Goodhart’s law are plentiful across organizations, so much so that a whole book, “The Tyranny of Metrics” is dedicated to this idea. Not only do measurements derail the development of quality products, but insisting on measuring everything also implicitly grants value only to that which can be measured. This is another major shortcoming of legible organizations since, often, the highest value initiatives cannot be easily measured. This can be succinctly expressed in programmerese as “Valuable != Measurable.”

The book “Thinking in Systems: A Primer” by Donella Meadows (also a piece of widely circulated literature within balena) sums up the peril of heedless scaling with the following quote, “Large organizations of all kinds, from corporations to governments, lose their resilience, simply because the feedback mechanisms by which they sense and respond to their environment have to travel through too many layers of distortion and delay.”

The Anti-corruption agency

A final hypothetical to illustrate the problem: Say you're the Chief of an anti-corruption agency. You're trusted with broad decision-making authority. Your agency has a large budget, staff, career ladder, pension plan, union, etc. One day, a scientist shows you a button that will end corruption immediately and forever. You are shown enough evidence to be convinced the button will work as described. Do you press it?

This thought experiment shows a number of things at the heart of our current civilizational predicament. On some naive level, one would expect the Chief of an anti-corruption agency not only to press such a button immediately, but to have been desperately seeking it already. The fact that it seems intuitively obvious that the Chief would be motivated to find a reason not to press the button, even if it was someone deeply honest and caring, shows us that traditional systems are structured in such a way as to force good people to do bad things. Core to the agency's structure is an assumption of perpetuity. By creating a career path and structures that imply the problem will never be truly solved, we not only discourage the staff from looking for a solution, we actively motivate them to prevent the solution.

What’s even worse is that by calling it “The Anti-Corruption Agency,” we prevent others from contributing by giving off the impression that the problem is under control. If someone else had an idea, they'd be told to take it to the anti-corruption people. Surely they know how to put good ideas to use, right? Instead, what you're likely to see is the anti-corruption agency having its own 10-year or 20-year plans, complete with seemingly-plausible ideas that hit all the politically-savvy talking points but never work. And if another idea takes hold in the public, the Chief of the agency will be expected to take a public stance against “pseudo-scientific approaches to the problem that come from non-domain experts and interfere with the much-needed work of the Anti-Corruption Agency.”

So we have a paradox: The very existence of an Anti-Corruption Agency not only fails to weed out corruption, it actually ensures that corruption will continue in perpetuity and that any good ideas to combat it will be murdered in the crib. We do this today for every large-scale problem. (For the fans of the pithy axioms, there’s also one for this organizational pitfall: the Shirky principle).

Part 2: The Methods

Building a company out of loops

A broad question we ask at balena that encompasses the scope of the problem is, “How can we make the most of our collective skills, talents, and intelligence?” A follow-up question that this poses is: “Rather than succeeding despite organizational structure, how does a company benefit from its organizational structure, and, in turn, become greater than the sum of its parts?” Our answer, in the broadest sense, is this: structure information first, then structure people around information.

This is why balena is made up of a series of feedback loops that are simply referred to as the “Loops”. While most traditional companies are segmented into functional departments, our belief is that teams of people should see themselves as building products that can be iterated based on feedback and ultimately judged on their own merit - just like any end-user commercial product. Each of the four main Loops at balena has a mission statement and defined “customers” that use its products. Thus, these Loops are less static “departments,” and more dynamic processes that use feedback to continuously align better with the needs of their customers.

Balena’s four main Loops are balena.io, ProductOS, TeamOS, and CompanyOS. Each Loop is responsible for a different aspect of the company’s operation.

balena.io covers the commercial balena product itself. balenaCloud, balenaOS, balenaEtcher, balenaFin, etc. The mission of balena.io is to “unlock the promise of physical computing by reducing friction for fleet owners.” Everything an embedded device fleet owner might need to succeed, including balena’s cloud services, hardware offerings, software infrastructure, and any other customer-facing aspect of the commercial product, is a part of the balena.io Loop.

ProductOS is a platform for building products. The supporting infrastructure of “loops” is itself a “product” that would come from ProductOS. This one is a bit meta and harder to grok, so for now we’ll just leave it at this: ProductOS is the part of balena responsible for coming up with and building better systems to get things done. The mission statement of ProductOS is to “reduce friction for teams building products” and its customers are product builders (more on this in a bit). Currently, ProductOS’s biggest function is building and maintaining Jellyfish, balena’s proprietary internal software that organizes Loops and provides the communication tools for the organization’s unique structure.

TeamOS strives to deliver the best fit team for each Loop. Anything team members need to make the best of their time in the company lives in TeamOS. This includes everything from how people are hired and how they work within the company, including interactions with other members, to evolving their broadly defined roles.

CompanyOS deals with the needs of shareholders, board members, and other legal and regulatory authorities. CompanyOS covers the financials, legal, and all other administrative aspects necessary to allow every other Loop to thrive in their unconventional structures, safe from the harsh world outside.

We view these four main Loops as the minimum infrastructure required to build a fully functioning company. The main product Loop supports all other Loops financially, TeamOS provides talent to each Loop, CompanyOS provides a legal and financial framework, and ProductOS provides software tools. Each supports all the others and they all, in conjunction, cover the entirety of our everyday operations. Atop this foundation, more Loops can be added for additional functionality.

The “OS” in the Loop name stands for “operating system” and serves to stress the idea that the process of a Loop’s operation is itself a product. The key difference between a product and an organization is that a product can be viewed from the outside, examined as a whole, questioned, and intentionally and continuously iterated. Instead of tribal knowledge, half-written wikis, and painful human-driven processes, each Loop aspires to be as smooth and automated as our main product is. If we can make something as complex as fleet management easy and self-serve, why can’t we do the same for the core functions our team needs, especially given the fact that our team is the one thing we spend the vast majority of our budget on? Teams tend to take care of the things that are intended for them last, leading to all sorts of chaos and pain. This bias is why they need to be seen as customers to internal platforms, and it’s how we aspire to stave off entropy in all aspects of balena.

How Loops actually work

This is all well and good in theory, but how does a Loop actually work? Let’s now turn to the process of a Loop. The reason they’re called Loops in the first place is because, inherent to their function, is a “loop” process of constant feedback that results in continuous improvement. This process is itself also called a loop (lowercase-l) and it defines the way in which the big four (capital-L) Loops improve based on feedback.

All four Loops use the same loop process to function. Let’s go through the steps of that process to get an idea. We’ll start from the node at the top called “surface”.

The surface is all that is externally available to users. Put simply, any part of the product that the user interacts with can be considered the surface - whether it is the product itself, documentation, a blog post, a twitter account, or anything else that the outside world can see and associate to the product. In addition, any ongoing operational activities that are required for these things to continue to be available are also included in the surface.

Any interaction with the surface may produce a signal. These “signals” are points of feedback that indicate how users are interacting with the product. Any signal that is generated from the surface is generated through a channel. Channels, such as customer support, outreach, and security, are monitored by the team and signals are processed with two main goals. First, handle the issue at hand, and second, learn as much as possible about the underlying cause so that the issue does not recur. Of the two, the second is the more important function, as it diagnoses and, ideally, addresses the underlying cause of the problem so it does not recur in the future. All channel signals should ideally be attached to patterns in the knowledge base.

A pattern is a step in the loop in which a team member uses a signal to note an issue or general opportunity to improve a product or function. This may be anything from already reported problems to general improvement ideas the team may have based on prior observations. Signals emerge out of user interactions with the product, and patterns help classify those signals. Any point of friction is a pattern waiting to be identified and improved upon. For the balena.io Loop, for example, a pattern may be created after user reports indicate that a hardware device is running out of memory in certain circumstances, while for TeamOS it may be multiple internal discussions around scheduling that lead a team member to identify the need for a new way of running team meetings. Any team member that notices something that can be improved is encouraged to formalize their observation as a pattern, even if they don’t necessarily have specific ideas for improvement. The purpose of patterns is to start discussions around the product’s points of friction.

The knowledge base consists of patterns that the Loop receives signals about, across all channels. It’s important to have a single knowledge base that unifies all patterns so that any two signals from any channel can be combined into one pattern. For instance, if there is an incident that users experience as a particular error message, (recent example: unable to create a new application) we may hear about it from support, DevOps, customer success and more. It is important for a pattern to be created so that all channels can discover it, share context, and have a coherent response, rather than each agent doing their own investigation and responding with a different answer to each occurrence. As a pattern receives higher volume, or higher urgency signals, its relative importance is raised so that the Loop team can see it emerge.

Patterns, usually the higher-priority ones, are linked to improvements. Improvements are proposals for changes to the product that address one or more patterns. We mentioned earlier that we encourage everyone to create patterns when they notice product friction, even if they don’t have specific ideas for how the product could be made better. It is the improvements step where patterns are brought together into an idea for a solution. Framed another way: while patterns are problem-statements, improvements are solution-statements.

So far in the loop, we’ve identified signals, noted patterns, and come up with an improvement, but no code has been written. Our next set of steps concern the implementation part of the loop, but before we get there, we need to note that we’ve hit an inflection point in our loop diagram. Given that we’ve reached the end of our ideation stage and we have a concrete change that we’d like to implement, we have now altered our Model. The model is the inverse of the surface. While the surface is the product as it exists today, the model is the ideal way that we would like our product to work in the future. It includes any improvements that are queued up but haven’t yet started down the road of implementation.

An improvement that is approved for implementation is converted into issues that are filed in the respective source code repositories. From this point on, things start to look a lot like a classic software development process, with issues becoming pull requests (PRs), that in turn get merged into new versions of a particular source code repository.

These versions of various components flow into the various products. A component may be used in multiple products, in which case it can trigger all those products to produce a new release.

The product releases in turn are deployed back into the surface, thus taking us back to where we started with our loop. At any point in the process, if a technical problem, lack of clarity, confusion, or other unforeseen issue appears, the team is encouraged to raise their question for a brainstorm. Brainstorms happen almost every day of the week, dedicated to different Loops and aspects of each (we usually differentiate architecture or “how” questions from product or “what” questions).

A unique feature of the loop worth mentioning here, since all the aforementioned terms are tracked via interlinked data objects, is that when a new release is deployed to production, we are able to trace back our steps and automatically mark versions as released, issues as resolved, improvements as completed, patterns as addressed, and even reach all the way back into the original channels that started the process. For instance, if a feature that was originated via a support request is released, we have enough information in the system to automatically resurface the appropriate support ticket and let the user know of the change made as a result of their request!

We refer to this function as the “snapback” and we have been running this workflow at balena for years. The surprised reactions of our customers when they realize that a support request led to material improvement without any need for payment or cajoling (and that we even remembered their original thread to inform them) are priceless. It is this kind of otherwise-impossible operation that makes us hopeful that the loop is the foundation of an entirely novel approach to collective intelligence - one that can even exhibit collective memory of specific contexts in a way traditional organizations typically don’t.

Enter Jellyfish

A unique organizational structure requires unique organizational tools. This may sound like a platitude, but we actually spent several years attempting to facilitate our distinct process through traditional SaaS communication tools. The result was a spatter of over two dozen intertwined pieces of software that were not only impossible to organize, but also achieved only a fraction of our organizational vision. As a result, the ProductOS team has spent the last few years hard at work on Jellyfish, our proprietary software platform that integrates information from all of our sources and is eventually set to become the single tool that team members use to organize, communicate, and, ultimately, build products.

To illustrate the shortcomings of off-the-shelf software, and the need for an in-house solution, let’s use an example of a critical support function that was once nearly impossible: Having a single, complete user timeline for customers of the balena.io product. The timeline starts when a user signs up for the product, that part is easy enough to track, but what happens after they sign up? Their first step may be to submit a survey about the signup process, which would be recorded in a survey software backend. Their next set of actions, however, may get recorded as analytics in Amplitude. Later, their interactions with customer success end up on a third platform, all while they’re talking about their experience on Twitter. To the customer, their set of actions is an uninterrupted, linear experience. For the internal team trying to chase them down across various platforms, however, it’s anything but. This fragmentation of information blinds the team and can lead to absurd situations like trying to sell to the customer while they’re loudly complaining about the product elsewhere. It’s easy to agree that this is an unacceptable state of affairs, and yet, it is the status quo inside many organizations. Most importantly, this is an excellent example of an infrastructure headed towards unworkability, not unlike the deadlocked codebase in our kidney mortgaging example earlier. Working from first principles, the balena team reasoned that if anyone at the company wants to see a complete user timeline, it should always be within reach of a few clicks. A proprietary software like Jellyfish allows “broad decisions” like this to become a frictionless reality, ensuring the teams can iterate on their processes and build fundamentally sound products.

There’s another pitfall inherent in siloed software, and that’s the notion of “priesthood” - which is the idea that only certain members of the team can provide certain functions since they’re the only ones familiar with the specific tools used for those functions. Esoteric software setups implicitly communicate to the team that only those ordained in the sacred knowledge of a set of practices and tools are allowed to do a job that, like support, for example, should otherwise be easily picked up and done by anyone. Here we have another example of insidious organizational patterning which, if scaled, makes for a deadlocked and unworkable communication structure, and, ultimately, a slow, bureaucratic organization (what Samo Burja calls a “dead player”).

The way of the Product Builder

The final piece of balena’s organizational agility is the team itself. The team culture is a major part of our innovation DNA and a full-length essay could be dedicated just to the cultural makeup of the company, however, since the scope of this piece is organizational structure, what’s included here is just the tip of the iceberg.

Everyone at balena strives toward the ideal of being a “Product Builder”. Given that all processes inside the company are products to be iterated and improved, it makes sense that we encourage the individual team members to adopt a product building mindset. While the notion encompasses a wide range of values including first principles thinking, open mindedness, and continuous improvement, for our purposes, a product builder is most concisely described as someone who takes ownership. Many of the most important values of a product builder are intrinsic to a team member that owns the outcome of their work. This also makes them responsible for improving the processes that go into that work.

If successful, everyone in the company should be a platform for everyone else. In this sense, it’s not necessary for improvements to happen just on the level of the big four Loops; rather, every person on the team should be able to sense and provide what’s necessary for everyone else to succeed, based on their individual talents and value adds. What’s more, if the loop does its job well enough, each team member’s individual skills can be provided as platforms for everyone else to take advantage of. This is a mindset that we encourage the team to have anytime they’re building anything - that, although we may not actually do this, the culture of product building should be robust enough to be provided to other companies as a service, wherein it can be applied to any product or process and ultimately make it a nimble and resilient structure. This is the true benchmark of a self-improving system which has replaced every static function with an iterable product.

Part 3: Conclusion

Balena has been succeeding for years in a space where giants have been unable to. Google, Ubuntu, GE, Intel, and hundreds of startups have tried to become the go-to computational paradigm for edge computing, but have been unsuccessful. We have chosen the path of compounding value and going slow in order to go fast which has enabled us to build on a solid foundation as others around us falter. Our goal is to enable a more embodied connection between humans and machines by bringing physical space back into the equation. Through this quest, we have created a fundamentally different way of organizing collective intelligence. We’re not done yet, not in our quest to make balena an unchallenged success, nor in building a completely autonomous collective intelligence, but we’ve now got enough pieces of the puzzle to start sharing and taking feedback.

The company’s commercial function, however, is to provide everything a mainstream developer needs to succeed on the edge. As such, this organizational structure is not an intellectual exercise in spending venture capital to pursue commercial applications of systems theory. We fundamentally believe that balena will not succeed unless we can do significantly more with fewer resources. An analogy we like to give to the team that illustrates this is the following: if you’re digging a tunnel and you want to dig a few meters in front of you, you get a team and you all pick up some pickaxes and you dig; if you want to dig 100 meters, you’ll need a bigger crew and more time, but to dig 200 kilometers, no pickaxe wielding crew is enough. You’ll need to hire engineers and build a lab where you can create a boring machine and then go back to dig your tunnel. Part of the reason neither startups nor big players have succeeded in dominating the space is because an edge platform requires a company that’s nimble like a startup but also has the time and resources to think long-term like a big company. To claim this spot, balena will need to scale while keeping its organizational dexterity.

Or, in the words of productivity guru Abraham Lincoln:

Ultimately, while it’s necessary for our end goal, balena’s unique organizational structure is more than a means to revolutionize edge computing. It’s an organizational paradigm designed to beat the tradeoff of agility and scale. In fact, we don’t intend to simply prevent corrosion of our startup agility. We fully intend to increase the speed at which we evolve as we grow. The value of the company is the sum of the problems it solves, and this is the problem whose solution is most valuable. By building a company that creates platforms for solving problems, we put that value at the forefront. All that’s left to see is if we can keep both kidneys as we work to get there.

Ourovoros

Discussion about this post