Ryan Delgado

How AI changes Platform Engineering

Ryan Delgado — Mon, 11 May 2026 01:52:56 GMT

(writing this as of May 2026. It might be out of date by the time you read this)

AI is a really big deal, including for Platform Engineering. The role, possibilities, and expectations of the platform engineer change significantly with coding agents. I’ve spent the last year and a half building with coding agents (mainly Claude Code) and building platforms for engineers who use coding agents. I’ve seen agents excel in tedious but important tasks, lower the barrier to entry onto platforms, and reduce the cost of code and time to delivery. I’ve also seen them cause problems. We’re in a different world now, and platform engineering teams must adapt. I’ll review what I’ve experienced and learned since building with/for coding agents.

I feel bold talking about the Topic of our Times in blog two, but eh. In that vein, I’ll start off in the most obnoxious way possible by saying

IN THE AGE OF AI

Everything from before still matters, and actually matters more. Before, we built on platforms by writing code with their bare hands in our IDEs. Nowadays, we instead prompt a coding agent, and the coding agent writes the code and takes other actions around the platform in order to build The Thing. Two major risks here: the agent writes the code wrong or it does something it shouldn’t. Agents are rapidly becoming as good or better than humans at writing code, but their doggone persistence and naivety of the risks they take make them sometimes at best annoying and at worst dangerous.

At the same time, the - enormous - productivity gains that agents provide make the trade worth it. When the cost of generating code approaches zero, bottlenecks in other parts of the dev lifecycle become more pronounced. When writing the code took three hours, that ten minute spin-up time for a cloud QA environment was nothing. If it takes three minutes for Claude to generate the code, then it takes >3x as long to spin up a testing environment than it does to write the code, the tests, and maybe even run the tests. The platform becomes the bottleneck.

AI-by-default is the norm in AI-pilled engineering teams and will become the norm industry-wide soon (best guess is two years, tops). This means that platform teams need to build platforms for agents driven by engineers rather than just engineers. A lot of stuff changes in this new paradigm.

Agents - not humans - are the primary users of platforms

AI is trained on text produced by humans, its reasoning has similar thought patterns to humans, and AI agents take similar actions to humans. They’re also similar to us in how we build software, and they have the same needs as us in order to build (except food and water, and sleep apparently). Their default interface is also the terminal rather than the browser (for now, at least). Because agentic engineering will become the default mode of working, a platform isn’t ready for Prime Time until agents can interact with it and build on it. Let’s use the Back-end Engineering platform from my first blog as an example. Coding agents should be able to

Clone and fully interact with the git repo. Create branches, commits, PRs, etc.
Spin up a local dev environment, and iterate with it.
Spin up an ephemeral QA environment in the cloud, including a database.
Query Datadog, Sentry, and build logs in the CI tool.
Query the database directly.
Interact with stuff in AWS.

These are all things we human developers take for granted, but if the coding agent can’t do all/most these things then they can’t build on the platform effectively. If your developers primarily use local agents, this is actually easy, because the agent will assume the human’s identity and act on their behalf while running. Remote agents (e.g. Devin or maybe you’ve built your own) are trickier. They can’t (yet?) assume your identity, and they’re likely running in a different VPC than the rest of your infra. You’ll need to work this out with your Infra and Security teams. I’ll discuss this more later on.

SKILL.md files, not Notion docs

Well, mostly. Notion and other knowledge management tools still have a place. But much of the content related to writing code and interacting with other aspects of the platform should live in agent skills. Specifically:

How to write code/use the abstractions in the best way on the platform.
How to provision and interact with a dev or testing environment.
What metrics to look at or what log filters to use in Datadog to monitor application health.
How to run tests, linters, and other checks before raising a PR.

Most - but not any - task(s) that a human developer normally does on the platform should also doable by an agent with the same or almost the same quality. North star: users of your platform should be able to one-shot a working v0 of their application >80% of the time. You get here with agent skills and other context provided to your agents.

On access - More breadth, less depth

Think about all of the systems you use in your day-to-day as an engineer. It probably consists of your IDE, Slack, Notion, Linear/Jira, Datadog, and the CLI & web interface for your cloud provider. Sub out any of those apps for a similar app you use. Let’s say there’s a Linear ticket that says the /search/invoices route in your back-end service has become slow for larger enterprise customers. If you’re a human, you’d probably do something like this:

Read the relevant Linear ticket. Mark it as In Progress.
Examine the existing code and tables around the slow route.
Noodle on an approach.
Write the code for that approach. Iteratively test & write it in your local dev environment.
Test it again, but in a testing environment that’s more prod-like.
Code is ready, so you post a “yo can I get a review?: ” message in a Slack channel.
Wait for a review.
Code merges and deploys.
You check Datadog to see if the route sped up after the change. Assuming it does, you mark the Linear ticket as “Done.”

A human can do this no problem. If it has access to Linear, Slack, GitHub, Datadog, and AWS, then so can an agent. To perform a task successfully, an agent needs a clear objective to meet, an environment to experiment on solutions, a way to evaluate those solutions, a way to promote those solutions to production, and time. An agent with enough access to all of the dev tooling paired with a well-written prompt can be as effective as a human engineer in most/all shallow engineering tasks.

The keyword in that last sentence is “enough” access, and I’d even say just enough and nothing more is more correct. An agent with full access to production systems can wreak havoc.

Another prescient joke from Silicon Valley

The appropriate safeguards are guardrails, not friction. Some specific advice:

Agents should have baseline read access to public knowledge stores (Notion, Slack), GitHub, the data warehouse, observability tools like Datadog and Sentry, and - yes - Production databases. The more systems they have access to, the more context they can gather, and the more effective they’ll be.
Agents should not be able to write or mutate data or code in Production. Database migrations, code deployments, backfills, and similar functions should be handled by non-agent automated processes. Agents should be able to monitor them and troubleshoot issues, but they should not be able to trigger them.
Agents should not be able to read PII. Gate your database behind a proxy layer (Formal is a good solution, or you could build your own), and set up ACLs on PII fields across your tables/collections.
Data warehouses need fine-grained ACLs. A marketing professional querying Snowflake through Claude Desktop doesn’t need access to the super_sensitive_financial_data schema to do their job, so they shouldn’t have access.
Only humans with elevated privileges should be able to delete, restart, or make similar changes to critical systems or data. Default to using delete protection on critical infrastructure.
Credentials should be hidden from agents. Dangling database passwords, API keys, etc. are ways agents can get elevated privilege unintentionally.
Data egress should always have a human in the loop. Agents shouldn’t be able to send emails, upload files to an external S3 bucket, or post in external Slack channels when they also have access to sensitive data. If you give them that access, then that’s a security incident waiting to happen.

The shrewd security engineer reading this is likely thinking “these were good ideas before agents.” Yep, except they’re way more important now. Platform engineers need to be more security-minded and partner closely with their security engineers on setting up guardrails. A bad security breach could cost you your job, and it’s way more likely to happen with agents in the loop.

The barrier to entry is lower

With coding agents, customer experience team members can take a first pass at fixing bugs reported by customers. Members of the Finance team can build their own Streamlit apps rather than relying on someone on the Data team. Members of the Growth team can tune the outbound lead scoring model and ship changes to the ML pipeline without help from an Applied Scientist. Since more stuff gets shipped, this is mostly a good thing. Mostly. Non-engineers should be able to specify intent and get something working and shipped as long as it’s low risk. Prompts like these should work on the first try:

> I'm just starting in this repo. Set up my local environment, and give me directions on how to build and test Streamlit apps.

> Build a Streamlit app that plots the sale of widgets across our three major hubs. Use the `orders` table in Delta Lake as the primary data source. Also give me the ability to include or exclude customers from a particular industry. Read through the Streamlit docs to find the best a way to do it, and then implement. Execute the appropriate command to run the Streamlit app locally.

> A customer reported a UI bug [pasted screenshot in prompt]. It's supposed to look like  but it changed within the last week or so. Please fix it.

Doing this well while staying sane requires easier & more automated onboarding, easier dev & testing environments, stronger opinions, better enablement materials, stiffer guardrails, and direct engagement with new stakeholders.

The quality bar is much higher

Spinning up a local dev environment, provisioning an ephemeral QA environment, running deployments, building Docker images, and other not-so-obvious components and tasks on the platform will become painfully obvious if they’re slow or unreliable. Those flaky tests that developers always ignore because they can easily retry the GitHub Action now become a snag that agents get caught on during an overnight job, or a stampeding herd problem from agents cancelling and retrying when things seem to hang or fail intermittently. (Btw this is a huge reason why Modal is taking off - stuff spins up really quickly, which makes it awesome for agents) Smaller issues become especially pronounced when engineers use long-running & complex teams of agents to ship something complex, one of them gets snagged by a temporary failure, causing all of the other downstream steps to fail. Optimizing provisioning latency and reliability will be a constant task for platform engineers (or their agents, more ideally). Platform teams need to adopt a “If we can do in 10 minutes, why not 5? If in 5, why not 2?” mantra when it comes to quality.

Build more, buy less

The main value props of a managed service are:

Outsourced infra management. They (often) own the infrastructure, and are thus responsible for all of the standard infra duties like patching software, scaling, and ensuring reliability.
Premium support. When something goes wrong, you have someone you can depend on to help out.
Additional features. A polished UI, faster startup times, observability out of the box, or other pre-baked opinions that provide a better user experience.
(if it’s open source) A better version of what you can build for free. Maybe it’s a database product and their forked version provides more scalability or performance.

A good vendor means you need to hire one - or more! - fewer engineers on your team. There are tradeoffs, though:

Data security. Outsourcing the infra often means the data resides off-prem. As mentioned before, security is more important with agents, and it’s heightened by the imminent security risks from more powerful AI.
Product velocity. Vendors build to meet the needs of most customers rather than prioritizing your needs. If you flag sharp edges in the product that slow you down, they’ll likely answer “We’ll get to it in the next few quarters.” With fast things are moving and how quickly your needs likely change, this often isn’t acceptable.

I could go on-and-on about buy vs build (and I will in a future post), but let’s go back to agents. With coding agents much of the vendor value prop erodes. Recurring agent jobs can proactively discover new software patches and automate much of the infra upgrades. An on-call agent can be extremely effective with support when it has full access to the documentation, source code, infra diagnostics, and human-written skills. Most of the nice-to-have features offered by a managed service can be easily replicated by a cracked engineer with Claude Code. With full control of the platform direction, you and your coding agents can compress timelines of new platform improvements from months to days. Over time, the compounding effects add up. Platform teams should choose their vendors carefully and default to building first instead of buying for most problems, at least to understand why the problem is hard enough to warrant a vendor.

Platforms are less sticky

“If we could migrate off of , we could save 600k a year.”

“I wish we could upgrade to version 3.0, but it we’ll need to make small changes in hundreds of places in our codebase.”

“Our data scientists could be flying if we used , but we’ve used for years and it’s so ingrained. A migration will be painful.”

Platform engineers often groan at these tasks because the work is tedious and the business value takes time to realize or goes unnoticed. As a result, it gets pushed off. This changes with agents. Prompts like these are possible and successful now:

> Find every Kubeflow pipeline in the `pipelines/` subdirectory, and look at the git blame to determine the owner. Create a new "Kubeflow to Prefect" migration Notion Database with these columns: Pipeline Name, Owner Name, Status, PR. Then for each pipeline
  1. Spin up an Opus sub-agent.
  2. Use the kubeflow-to-prefect-migration skill to rewrite the pipeline as a Prefect flow.
  3. Run the flow to test it. Specifically, test its outputs. If it writes to Snowflake, run a parity check between the Kubeflow-derived data and Prefect derived data - row counts, aggregations over columns, etc. If it trains a model, then confirm the version is saved in MLFlow.
  4. Raise a PR and assign to the owner.
  5. Update the entry for the flow in the Notion database.
  
  Use good judgement on expensive training jobs. If the code runs a large, expensive distributed training job, then either aggressively downsample the dataset or adjust the training parameters to cut down costs. Leave a comment in the PR indicating that you made this change.

> Upgrade us to version 2.15.0 of the Snowflake Terraform provider. Use the snowflake-in-localstack skill to run `terraform plan` and `terraform apply` in LocalStack to test changes. "Done" means
- We've upgraded to version 2.15.0
- `terraform plan` results in no changes to the infra, and `terraform apply` runs successfully.
Make changes only in the `snowflake/` and `modules/snowflake/` subdirectories. Raise a PR after you're done.

or similar. Coding agents can automate much of the tedium with large scale migrations. Now, a migration that would take six+ months and 15 engineers from three different teams can largely be done by one or two cracked platform engineers in a single quarter. Agents excel at extremely tedious tasks, especially migrations and upgrades.

Self-managing platforms

The automation that agents provide raise the possibility - and expectations - of how proactive platform engineers can be. With the right prompt and access to code & production systems, agents can manage codebases and platforms for the platform engineer. This part of the platform engineer’s job transitions from managing platforms themselves to providing context and setting up agents to manage the platform for the engineer. Specifically it can look like:

An on-call agent spins up upon creation of each incident and kicks off the initial investigation. Before the on-call engineer has even acknowledged the page and logged in, the agent has spent five minutes investigating the issue, and maybe even raised a PR to resolve the issue. The on-call agent uses skills written by platform engineers to query the correct in the correct way systems for context.
A recurring agent looks for antipatterns in the codebase not caught by static analyzers and proactively fixes them. It makes a code changes, tests them, and assigns PRs to the code owners.
A recurring agent looks at telemetry data on the Spark clusters spun up for each data pipeline and right-sizes overprovisioned clusters to save money.
A recurring agent looks at a Slowest Database Queries Datadog dashboard and identifies the top five queries that are the slowest and frequently run. For each query, it spins up a sandbox environment to iterate on faster ways to query the same data. If it finds one, it makes a change, tests it, and raises a PR.

The context provided to agents in recurring jobs provides a new way to express and enforce the opinions of the platform engineer. When done correctly, technical debt is addressed, costs stay under control, and performance is optimized proactively rather than reactively or never at all.

Tying it all together

With AI, the fundamentals of building and running a platform still matter, and the qualities that make a platform engineer successful still hold. However, like other types of engineering (and other professions) the role is changing significantly. The quality bar is higher, the barrier to entry lower, the platform will be more automated and nimble, the needs will evolve faster, and solid security controls are P0. This makes the engineering (and people) problems around platforms more difficult, but also more interesting.

Thanks for reading!

Good Platforms, Good Platform Engineers

Ryan Delgado — Mon, 04 May 2026 01:20:08 GMT

I’ve built on developer platforms for almost ten years and have built them for the last five. I’d like to think I’ve developed a sense of what “good” looks like for both the Platform and the Platform Engineer, so I’m sharing. The content comes from my experiencing building and using developer platforms, seeing what works and doesn’t, and hearing people’s complaints. Mostly the complaints. The opinions are my own.

What is a “Platform” anyway?

Short answer

It’s software used to build other software.

Longer answer

There are roughly two types of platforms:

A Developer Platform. This is often one or more frameworks, a code repository, infrastructure running in dev/staging/prod environments, a prescribed way of building and path to Production, and automation & tooling around that way of building.
A Core Service. This is an abstraction over a critical business function (e.g. processing a payment, sending an email or SMS) made available via an API. The API could be REST, gRPC, or something similar like a collection of classes & functions inside of a repo.

Both types of platforms are used to build applications. An application is software that directly solves a business problem. Engineers use a Developer Platform to build, ship, and host applications. They use core services to power specific features in those applications. There are platforms sold as products, e.g. Vercel (dev platform) or Stripe (core service), but I’ll focus on internal platforms, i.e. a platform built by one engineer at a company in service to another engineer building applications.

I’ll mostly focus on developer platforms because I have the most experience with them. I’m also a Data Engineer by trade, so most of my examples will be Data Engineer-y, but the underlying principles apply to platforms that serve other types of engineers.

Examples

Back-end developer platform. This is a developer platform that enables back-end software engineers to build applications & features on a monolith back-end. It runs on Kubernetes in AWS EKS and Aurora Postgres. The code is written in Python and lives in a single repo. The back-end is a REST service running as a FastAPI web server. Table schemas are managed with SQLAlchemy and Alembic. Asynchronous tasks are managed with Temporal. Developers can run the application locally and send requests to it with Postman. Deployments are managed with GitHub Actions, and trigger when a developer merges to main.
A Data Engineering platform. Developers write data pipelines using Airflow, and they load data from external sources into a RAW database in Snowflake. Data is transformed using dbt in a single daily dbt run, also orchestrated with Snowflake. All Airflow and dbt code lives in a single repo but have separate deployment processes. Airflow runs in a Kubernetes cluster on AWS EKS, while dbt runs in an Airflow DAG. Developers can provision fully-functional local Airflow environments or developer-specific dbt databases and can run Airflow DAGs and dbt runs locally.

The Fundamentals of a “Good” Platform

My favorite analogy for what a “good” developer platform looks like is a lane in a bowling alley:

The bowling pins. The primary use case(s) of the platform are at the center. It’s clearly documented in documentation and is broadly known throughout the engineering org.
The bumper rails (I’m terrible at bowling). It’s impossible - or extremely difficult - to do the wrong thing. Static analyzers block bad patterns at time of commit or PR. The abstractions catch common errors and show descriptive error messages and suggest good solutions. Non-prod environments are as Prod-like as possible, giving developers high confidence that if their code runs in dev, then it’ll run in prod.
The lane - It’s dead simple and low friction to do the right thing. Any services, patterns, or other abstractions on our platform are intuitive, and building with them is concise. We have reasonable defaults for inputs into these abstractions. Secrets management and authentication is largely hidden from the developer. There are minimal/no footguns - no data/schema consistency issues, weird permission differences between environments, flaky tests or deployment quirks, or arcane errors. There’s no “tribal knowledge” - any knowledge a new developer needs in order to build is documented or immediately obvious.
The bowling ball - applications are independent of each other. There’s minimal/no resource contention between applications and deployments of one application don’t affect the state of other applications.
The pinsetter - Deployments are automated, reliable, and fast. dev or staging environments are easy and fast to spin up and use. We’re using the latest tools like uv, ruff (I’m a Python guy), ast grep, etc. to reduce developer idle time. The Production environment is highly reliable.
The scoring monitor - At any given time, the developer knows the state of their application, deployment, job, or other Stuff on the platform. They have a Datadog dashboard that measures the latency and success rate of their APIs or the success/failure rate of asynchronous jobs. They can watch the CICD pipeline progress, and they know exactly when their change lands in Production. There are high-signal, automated alerts for anomalous and problematic behavior. The platform team also has infrastructure-level or broad observability tracking overall system health.

There’s probably more, but you get the idea. These are general guidelines, and not all of the details apply to you. An internal developer platform is only good if it enables and accelerates other builders in the org and it’s baked-in opinions fits the company’s way of building. Opinions are critical - more on that later.

Builders will use your platform in ways you don’t expect.

Qualities of a “good” Platform Engineer

There are obvious qualities - technical excellence, agreeableness, can-do attitude. I’ll focus on the nonobvious ones.

Understands Asymmetries

The most important quality of a platform engineer is they understand the company’s asymmetries and tradeoffs. A developer platform at a Series A startup and a high frequency trading firm might be built on similar technology but will look completely different, because those companies and their engineering teams don’t operate the same way. A high frequency trading firm has more asymmetric downsides, i.e. each change could interrupt normal trading operations causing millions of dollars in losses. A young startup has more asymmetric upsides, as each change they make could bring them closer to product-market fit. One company needs high reliability, while the other needs product velocity. This isn’t static either. If that young startup builds a business critical product that tens of thousands of customers depend on, the tradeoffs change, because reliability and scalability matter more than they did previously. If the platform engineer understands this, they can build a platform with the right amount of complexity at each stage to maximize the platform’s business value.

Empathetic

Next is empathy for the application engineers and their use cases. The use cases’ requirements dictate the technology choice and the abstractions built into the platform. Let’s use the Data Engineering platform example from before, and let’s say we want to establish the pattern for loading data into Snowflake. The simplest approach we could take may look something like

@task
def load_data(file_path: str, **context):
    import pandas as pd # I'm a simple man with simple solutions.
    from airflow.providers.snowflake.hooks.snowflake import SnowflakeHook
    hook = SnowflakeHook()
    df = pd.read_csv(file_path)
    df.to_sql(...) # etc...

# etc...

This will work great if the files are small and are in a format that pandas can read. It won’t work for terabytes of data or files with poor/no structure. Likewise, you shouldn’t force other engineers to use Spark if they’re only dealing with small & medium-sized data. The platform engineer can only know what pattern to implement if they work with the stakeholder engineer and understand the business problems they solve. The best engagement model for this is the embedded model, where the platform engineer embeds on the stakeholder engineer’s team and solves the problem with them. I have a lot of thoughts on embedding platform engineers on other teams, but I’ll save that for a future post.

Opinionated

The next quality is opinions. Engineers that use a platform do not want to hear “you can approach it in X or Y way. Up to you!” from the platform engineer. They want Dr. Platform Engineer to prescribe a solution to their problem, i.e. an opinion. The difference between a Good and a Great developer platform are the opinions baked into it through abstractions, guardrails, default arguments, code organization, and other stuff. Let’s use loading CSV files into Snowflake as an example again. Here’s an unopinionated way to do it:

COPY INTO RAW.EXAMPLE.TABLE
FROM 's3://my-internal-bucket-name/path/to/file.csv'
CREDENTIALS = (
  AWS_KEY_ID = 'yep'
  AWS_SECRET_KEY = 'mhmm'
)
FILE_FORMAT = (
  TYPE = CSV,
  SKIP_HEADER = 1,
  FIELD_OPTIONALLY_ENCLOSED_BY = '"', 
  NULL_IF = ('nan', '')
);

and here’s an opinionated way:

COPY INTO RAW.EXAMPLE.TABLE
FROM @RAW.PUBLIC.MY_BUCKET/path/to/file.csv
FILE_FORMAT = RAW.PUBLIC.DEFAULT_CSV_FORMAT;

The second one is better because nobody needs to think about the name of the bucket (I’m surprisingly bad at this), values to pass into FILE_FORMAT, or AWS credentials, because the platform engineer already set up an S3 Integration, stage, and file format that other engineers can use off-the-shelf and know it “just works.” Opinions from the platform engineer reduce the cognitive load on everyone else. They accelerate development, improve the success rate, and improve the maintainability. Opinions come from

Repeated use of a common pattern that you formalize it into an abstraction.
Previous incidents or small bugs reveal new failure modes, so you add guardrails.
Your experience with the technology. You know the footguns and the “best” parameters default to use, so you make that the default.

That’s all for now! Thanks for reading.