Good Platforms, Good Platform Engineers
I’ve built on developer platforms for almost ten years and have built them for the last five. I’d like to think I’ve developed a sense of what “good” looks like for both the Platform and the Platform Engineer, so I’m sharing. The content comes from my experiencing building and using developer platforms, seeing what works and doesn’t, and hearing people’s complaints. Mostly the complaints. The opinions are my own.
What is a “Platform” anyway?
Short answer
It’s software used to build other software.
Longer answer
There are roughly two types of platforms:
A Developer Platform. This is often one or more frameworks, a code repository, infrastructure running in dev/staging/prod environments, a prescribed way of building and path to Production, and automation & tooling around that way of building.
A Core Service. This is an abstraction over a critical business function (e.g. processing a payment, sending an email or SMS) made available via an API. The API could be REST, gRPC, or something similar like a collection of classes & functions inside of a repo.
Both types of platforms are used to build applications. An application is software that directly solves a business problem. Engineers use a Developer Platform to build, ship, and host applications. They use core services to power specific features in those applications. There are platforms sold as products, e.g. Vercel (dev platform) or Stripe (core service), but I’ll focus on internal platforms, i.e. a platform built by one engineer at a company in service to another engineer building applications.
I’ll mostly focus on developer platforms because I have the most experience with them. I’m also a Data Engineer by trade, so most of my examples will be Data Engineer-y, but the underlying principles apply to platforms that serve other types of engineers.
Examples
Back-end developer platform. This is a developer platform that enables back-end software engineers to build applications & features on a monolith back-end. It runs on Kubernetes in AWS EKS and Aurora Postgres. The code is written in Python and lives in a single repo. The back-end is a REST service running as a FastAPI web server. Table schemas are managed with SQLAlchemy and Alembic. Asynchronous tasks are managed with Temporal. Developers can run the application locally and send requests to it with Postman. Deployments are managed with GitHub Actions, and trigger when a developer merges to main.
A Data Engineering platform. Developers write data pipelines using Airflow, and they load data from external sources into a
RAWdatabase in Snowflake. Data is transformed using dbt in a single daily dbt run, also orchestrated with Snowflake. All Airflow and dbt code lives in a single repo but have separate deployment processes. Airflow runs in a Kubernetes cluster on AWS EKS, while dbt runs in an Airflow DAG. Developers can provision fully-functional local Airflow environments or developer-specific dbt databases and can run Airflow DAGs and dbt runs locally.
The Fundamentals of a “Good” Platform
My favorite analogy for what a “good” developer platform looks like is a lane in a bowling alley:
The bowling pins. The primary use case(s) of the platform are at the center. It’s clearly documented in documentation and is broadly known throughout the engineering org.
The bumper rails (I’m terrible at bowling). It’s impossible - or extremely difficult - to do the wrong thing. Static analyzers block bad patterns at time of commit or PR. The abstractions catch common errors and show descriptive error messages and suggest good solutions. Non-prod environments are as Prod-like as possible, giving developers high confidence that if their code runs in dev, then it’ll run in prod.
The lane - It’s dead simple and low friction to do the right thing. Any services, patterns, or other abstractions on our platform are intuitive, and building with them is concise. We have reasonable defaults for inputs into these abstractions. Secrets management and authentication is largely hidden from the developer. There are minimal/no footguns - no data/schema consistency issues, weird permission differences between environments, flaky tests or deployment quirks, or arcane errors. There’s no “tribal knowledge” - any knowledge a new developer needs in order to build is documented or immediately obvious.
The bowling ball - applications are independent of each other. There’s minimal/no resource contention between applications and deployments of one application don’t affect the state of other applications.
The pinsetter - Deployments are automated, reliable, and fast. dev or staging environments are easy and fast to spin up and use. We’re using the latest tools like uv, ruff (I’m a Python guy), ast grep, etc. to reduce developer idle time. The Production environment is highly reliable.
The scoring monitor - At any given time, the developer knows the state of their application, deployment, job, or other Stuff on the platform. They have a Datadog dashboard that measures the latency and success rate of their APIs or the success/failure rate of asynchronous jobs. They can watch the CICD pipeline progress, and they know exactly when their change lands in Production. There are high-signal, automated alerts for anomalous and problematic behavior. The platform team also has infrastructure-level or broad observability tracking overall system health.
There’s probably more, but you get the idea. These are general guidelines, and not all of the details apply to you. An internal developer platform is only good if it enables and accelerates other builders in the org and it’s baked-in opinions fits the company’s way of building. Opinions are critical - more on that later.
Qualities of a “good” Platform Engineer
There are obvious qualities - technical excellence, agreeableness, can-do attitude. I’ll focus on the nonobvious ones.
Understands Asymmetries
The most important quality of a platform engineer is they understand the company’s asymmetries and tradeoffs. A developer platform at a Series A startup and a high frequency trading firm might be built on similar technology but will look completely different, because those companies and their engineering teams don’t operate the same way. A high frequency trading firm has more asymmetric downsides, i.e. each change could interrupt normal trading operations causing millions of dollars in losses. A young startup has more asymmetric upsides, as each change they make could bring them closer to product-market fit. One company needs high reliability, while the other needs product velocity. This isn’t static either. If that young startup builds a business critical product that tens of thousands of customers depend on, the tradeoffs change, because reliability and scalability matter more than they did previously. If the platform engineer understands this, they can build a platform with the right amount of complexity at each stage to maximize the platform’s business value.
Empathetic
Next is empathy for the application engineers and their use cases. The use cases’ requirements dictate the technology choice and the abstractions built into the platform. Let’s use the Data Engineering platform example from before, and let’s say we want to establish the pattern for loading data into Snowflake. The simplest approach we could take may look something like
@task
def load_data(file_path: str, **context):
import pandas as pd # I'm a simple man with simple solutions.
from airflow.providers.snowflake.hooks.snowflake import SnowflakeHook
hook = SnowflakeHook()
df = pd.read_csv(file_path)
df.to_sql(...) # etc...
# etc...
This will work great if the files are small and are in a format that pandas can read. It won’t work for terabytes of data or files with poor/no structure. Likewise, you shouldn’t force other engineers to use Spark if they’re only dealing with small & medium-sized data. The platform engineer can only know what pattern to implement if they work with the stakeholder engineer and understand the business problems they solve. The best engagement model for this is the embedded model, where the platform engineer embeds on the stakeholder engineer’s team and solves the problem with them. I have a lot of thoughts on embedding platform engineers on other teams, but I’ll save that for a future post.
Opinionated
The next quality is opinions. Engineers that use a platform do not want to hear “you can approach it in X or Y way. Up to you!” from the platform engineer. They want Dr. Platform Engineer to prescribe a solution to their problem, i.e. an opinion. The difference between a Good and a Great developer platform are the opinions baked into it through abstractions, guardrails, default arguments, code organization, and other stuff. Let’s use loading CSV files into Snowflake as an example again. Here’s an unopinionated way to do it:
COPY INTO RAW.EXAMPLE.TABLE
FROM 's3://my-internal-bucket-name/path/to/file.csv'
CREDENTIALS = (
AWS_KEY_ID = 'yep'
AWS_SECRET_KEY = 'mhmm'
)
FILE_FORMAT = (
TYPE = CSV,
SKIP_HEADER = 1,
FIELD_OPTIONALLY_ENCLOSED_BY = '"',
NULL_IF = ('nan', '')
);and here’s an opinionated way:
COPY INTO RAW.EXAMPLE.TABLE
FROM @RAW.PUBLIC.MY_BUCKET/path/to/file.csv
FILE_FORMAT = RAW.PUBLIC.DEFAULT_CSV_FORMAT;The second one is better because nobody needs to think about the name of the bucket (I’m surprisingly bad at this), values to pass into FILE_FORMAT, or AWS credentials, because the platform engineer already set up an S3 Integration, stage, and file format that other engineers can use off-the-shelf and know it “just works.” Opinions from the platform engineer reduce the cognitive load on everyone else. They accelerate development, improve the success rate, and improve the maintainability. Opinions come from
Repeated use of a common pattern that you formalize it into an abstraction.
Previous incidents or small bugs reveal new failure modes, so you add guardrails.
Your experience with the technology. You know the footguns and the “best” parameters default to use, so you make that the default.
That’s all for now! Thanks for reading.
