How AI changes Platform Engineering
(writing this as of May 2026. It might be out of date by the time you read this)
AI is a really big deal, including for Platform Engineering. The role, possibilities, and expectations of the platform engineer change significantly with coding agents. I’ve spent the last year and a half building with coding agents (mainly Claude Code) and building platforms for engineers who use coding agents. I’ve seen agents excel in tedious but important tasks, lower the barrier to entry onto platforms, and reduce the cost of code and time to delivery. I’ve also seen them cause problems. We’re in a different world now, and platform engineering teams must adapt. I’ll review what I’ve experienced and learned since building with/for coding agents.
I feel bold talking about the Topic of our Times in blog two, but eh. In that vein, I’ll start off in the most obnoxious way possible by saying
IN THE AGE OF AI
Everything from before still matters, and actually matters more. Before, we built on platforms by writing code with their bare hands in our IDEs. Nowadays, we instead prompt a coding agent, and the coding agent writes the code and takes other actions around the platform in order to build The Thing. Two major risks here: the agent writes the code wrong or it does something it shouldn’t. Agents are rapidly becoming as good or better than humans at writing code, but their doggone persistence and naivety of the risks they take make them sometimes at best annoying and at worst dangerous.
At the same time, the - enormous - productivity gains that agents provide make the trade worth it. When the cost of generating code approaches zero, bottlenecks in other parts of the dev lifecycle become more pronounced. When writing the code took three hours, that ten minute spin-up time for a cloud QA environment was nothing. If it takes three minutes for Claude to generate the code, then it takes >3x as long to spin up a testing environment than it does to write the code, the tests, and maybe even run the tests. The platform becomes the bottleneck.
AI-by-default is the norm in AI-pilled engineering teams and will become the norm industry-wide soon (best guess is two years, tops). This means that platform teams need to build platforms for agents driven by engineers rather than just engineers. A lot of stuff changes in this new paradigm.
Agents - not humans - are the primary users of platforms
AI is trained on text produced by humans, its reasoning has similar thought patterns to humans, and AI agents take similar actions to humans. They’re also similar to us in how we build software, and they have the same needs as us in order to build (except food and water, and sleep apparently). Their default interface is also the terminal rather than the browser (for now, at least). Because agentic engineering will become the default mode of working, a platform isn’t ready for Prime Time until agents can interact with it and build on it. Let’s use the Back-end Engineering platform from my first blog as an example. Coding agents should be able to
Clone and fully interact with the git repo. Create branches, commits, PRs, etc.
Spin up a local dev environment, and iterate with it.
Spin up an ephemeral QA environment in the cloud, including a database.
Query Datadog, Sentry, and build logs in the CI tool.
Query the database directly.
Interact with stuff in AWS.
These are all things we human developers take for granted, but if the coding agent can’t do all/most these things then they can’t build on the platform effectively. If your developers primarily use local agents, this is actually easy, because the agent will assume the human’s identity and act on their behalf while running. Remote agents (e.g. Devin or maybe you’ve built your own) are trickier. They can’t (yet?) assume your identity, and they’re likely running in a different VPC than the rest of your infra. You’ll need to work this out with your Infra and Security teams. I’ll discuss this more later on.
SKILL.md files, not Notion docs
Well, mostly. Notion and other knowledge management tools still have a place. But much of the content related to writing code and interacting with other aspects of the platform should live in agent skills. Specifically:
How to write code/use the abstractions in the best way on the platform.
How to provision and interact with a dev or testing environment.
What metrics to look at or what log filters to use in Datadog to monitor application health.
How to run tests, linters, and other checks before raising a PR.
Most - but not any - task(s) that a human developer normally does on the platform should also doable by an agent with the same or almost the same quality. North star: users of your platform should be able to one-shot a working v0 of their application >80% of the time. You get here with agent skills and other context provided to your agents.
On access - More breadth, less depth
Think about all of the systems you use in your day-to-day as an engineer. It probably consists of your IDE, Slack, Notion, Linear/Jira, Datadog, and the CLI & web interface for your cloud provider. Sub out any of those apps for a similar app you use. Let’s say there’s a Linear ticket that says the /search/invoices route in your back-end service has become slow for larger enterprise customers. If you’re a human, you’d probably do something like this:
Read the relevant Linear ticket. Mark it as In Progress.
Examine the existing code and tables around the slow route.
Noodle on an approach.
Write the code for that approach. Iteratively test & write it in your local dev environment.
Test it again, but in a testing environment that’s more prod-like.
Code is ready, so you post a “yo can I get a review?: <link to pr>” message in a Slack channel.
Wait for a review.
Code merges and deploys.
You check Datadog to see if the route sped up after the change. Assuming it does, you mark the Linear ticket as “Done.”
A human can do this no problem. If it has access to Linear, Slack, GitHub, Datadog, and AWS, then so can an agent. To perform a task successfully, an agent needs a clear objective to meet, an environment to experiment on solutions, a way to evaluate those solutions, a way to promote those solutions to production, and time. An agent with enough access to all of the dev tooling paired with a well-written prompt can be as effective as a human engineer in most/all shallow engineering tasks.
The keyword in that last sentence is “enough” access, and I’d even say just enough and nothing more is more correct. An agent with full access to production systems can wreak havoc.
The appropriate safeguards are guardrails, not friction. Some specific advice:
Agents should have baseline read access to public knowledge stores (Notion, Slack), GitHub, the data warehouse, observability tools like Datadog and Sentry, and - yes - Production databases. The more systems they have access to, the more context they can gather, and the more effective they’ll be.
Agents should not be able to write or mutate data or code in Production. Database migrations, code deployments, backfills, and similar functions should be handled by non-agent automated processes. Agents should be able to monitor them and troubleshoot issues, but they should not be able to trigger them.
Agents should not be able to read PII. Gate your database behind a proxy layer (Formal is a good solution, or you could build your own), and set up ACLs on PII fields across your tables/collections.
Data warehouses need fine-grained ACLs. A marketing professional querying Snowflake through Claude Desktop doesn’t need access to the
super_sensitive_financial_dataschema to do their job, so they shouldn’t have access.Only humans with elevated privileges should be able to delete, restart, or make similar changes to critical systems or data. Default to using delete protection on critical infrastructure.
Credentials should be hidden from agents. Dangling database passwords, API keys, etc. are ways agents can get elevated privilege unintentionally.
Data egress should always have a human in the loop. Agents shouldn’t be able to send emails, upload files to an external S3 bucket, or post in external Slack channels when they also have access to sensitive data. If you give them that access, then that’s a security incident waiting to happen.
The shrewd security engineer reading this is likely thinking “these were good ideas before agents.” Yep, except they’re way more important now. Platform engineers need to be more security-minded and partner closely with their security engineers on setting up guardrails. A bad security breach could cost you your job, and it’s way more likely to happen with agents in the loop.
The barrier to entry is lower
With coding agents, customer experience team members can take a first pass at fixing bugs reported by customers. Members of the Finance team can build their own Streamlit apps rather than relying on someone on the Data team. Members of the Growth team can tune the outbound lead scoring model and ship changes to the ML pipeline without help from an Applied Scientist. Since more stuff gets shipped, this is mostly a good thing. Mostly. Non-engineers should be able to specify intent and get something working and shipped as long as it’s low risk. Prompts like these should work on the first try:
> I'm just starting in this repo. Set up my local environment, and give me directions on how to build and test Streamlit apps.
> Build a Streamlit app that plots the sale of widgets across our three major hubs. Use the `orders` table in Delta Lake as the primary data source. Also give me the ability to include or exclude customers from a particular industry. Read through the Streamlit docs to find the best a way to do it, and then implement. Execute the appropriate command to run the Streamlit app locally.
> A customer reported a UI bug [pasted screenshot in prompt]. It's supposed to look like <description> but it changed within the last week or so. Please fix it.
Doing this well while staying sane requires easier & more automated onboarding, easier dev & testing environments, stronger opinions, better enablement materials, stiffer guardrails, and direct engagement with new stakeholders.
The quality bar is much higher
Spinning up a local dev environment, provisioning an ephemeral QA environment, running deployments, building Docker images, and other not-so-obvious components and tasks on the platform will become painfully obvious if they’re slow or unreliable. Those flaky tests that developers always ignore because they can easily retry the GitHub Action now become a snag that agents get caught on during an overnight job, or a stampeding herd problem from agents cancelling and retrying when things seem to hang or fail intermittently. (Btw this is a huge reason why Modal is taking off - stuff spins up really quickly, which makes it awesome for agents) Smaller issues become especially pronounced when engineers use long-running & complex teams of agents to ship something complex, one of them gets snagged by a temporary failure, causing all of the other downstream steps to fail. Optimizing provisioning latency and reliability will be a constant task for platform engineers (or their agents, more ideally). Platform teams need to adopt a “If we can do in 10 minutes, why not 5? If in 5, why not 2?” mantra when it comes to quality.
Build more, buy less
The main value props of a managed service are:
Outsourced infra management. They (often) own the infrastructure, and are thus responsible for all of the standard infra duties like patching software, scaling, and ensuring reliability.
Premium support. When something goes wrong, you have someone you can depend on to help out.
Additional features. A polished UI, faster startup times, observability out of the box, or other pre-baked opinions that provide a better user experience.
(if it’s open source) A better version of what you can build for free. Maybe it’s a database product and their forked version provides more scalability or performance.
A good vendor means you need to hire one - or more! - fewer engineers on your team. There are tradeoffs, though:
Data security. Outsourcing the infra often means the data resides off-prem. As mentioned before, security is more important with agents, and it’s heightened by the imminent security risks from more powerful AI.
Product velocity. Vendors build to meet the needs of most customers rather than prioritizing your needs. If you flag sharp edges in the product that slow you down, they’ll likely answer “We’ll get to it in the next few quarters.” With fast things are moving and how quickly your needs likely change, this often isn’t acceptable.
I could go on-and-on about buy vs build (and I will in a future post), but let’s go back to agents. With coding agents much of the vendor value prop erodes. Recurring agent jobs can proactively discover new software patches and automate much of the infra upgrades. An on-call agent can be extremely effective with support when it has full access to the documentation, source code, infra diagnostics, and human-written skills. Most of the nice-to-have features offered by a managed service can be easily replicated by a cracked engineer with Claude Code. With full control of the platform direction, you and your coding agents can compress timelines of new platform improvements from months to days. Over time, the compounding effects add up. Platform teams should choose their vendors carefully and default to building first instead of buying for most problems, at least to understand why the problem is hard enough to warrant a vendor.
Platforms are less sticky
“If we could migrate off of <managed service>, we could save 600k a year.”
“I wish we could upgrade <framework> to version 3.0, but it we’ll need to make small changes in hundreds of places in our codebase.”
“Our data scientists could be flying if we used <other framework>, but we’ve used <current framework> for years and it’s so ingrained. A migration will be painful.”
Platform engineers often groan at these tasks because the work is tedious and the business value takes time to realize or goes unnoticed. As a result, it gets pushed off. This changes with agents. Prompts like these are possible and successful now:
> Find every Kubeflow pipeline in the `pipelines/` subdirectory, and look at the git blame to determine the owner. Create a new "Kubeflow to Prefect" migration Notion Database with these columns: Pipeline Name, Owner Name, Status, PR. Then for each pipeline
1. Spin up an Opus sub-agent.
2. Use the kubeflow-to-prefect-migration skill to rewrite the pipeline as a Prefect flow.
3. Run the flow to test it. Specifically, test its outputs. If it writes to Snowflake, run a parity check between the Kubeflow-derived data and Prefect derived data - row counts, aggregations over columns, etc. If it trains a model, then confirm the version is saved in MLFlow.
4. Raise a PR and assign to the owner.
5. Update the entry for the flow in the Notion database.
Use good judgement on expensive training jobs. If the code runs a large, expensive distributed training job, then either aggressively downsample the dataset or adjust the training parameters to cut down costs. Leave a comment in the PR indicating that you made this change.
or
> Upgrade us to version 2.15.0 of the Snowflake Terraform provider. Use the snowflake-in-localstack skill to run `terraform plan` and `terraform apply` in LocalStack to test changes. "Done" means
- We've upgraded to version 2.15.0
- `terraform plan` results in no changes to the infra, and `terraform apply` runs successfully.
Make changes only in the `snowflake/` and `modules/snowflake/` subdirectories. Raise a PR after you're done.
or similar. Coding agents can automate much of the tedium with large scale migrations. Now, a migration that would take six+ months and 15 engineers from three different teams can largely be done by one or two cracked platform engineers in a single quarter. Agents excel at extremely tedious tasks, especially migrations and upgrades.
Self-managing platforms
The automation that agents provide raise the possibility - and expectations - of how proactive platform engineers can be. With the right prompt and access to code & production systems, agents can manage codebases and platforms for the platform engineer. This part of the platform engineer’s job transitions from managing platforms themselves to providing context and setting up agents to manage the platform for the engineer. Specifically it can look like:
An on-call agent spins up upon creation of each incident and kicks off the initial investigation. Before the on-call engineer has even acknowledged the page and logged in, the agent has spent five minutes investigating the issue, and maybe even raised a PR to resolve the issue. The on-call agent uses skills written by platform engineers to query the correct in the correct way systems for context.
A recurring agent looks for antipatterns in the codebase not caught by static analyzers and proactively fixes them. It makes a code changes, tests them, and assigns PRs to the code owners.
A recurring agent looks at telemetry data on the Spark clusters spun up for each data pipeline and right-sizes overprovisioned clusters to save money.
A recurring agent looks at a Slowest Database Queries Datadog dashboard and identifies the top five queries that are the slowest and frequently run. For each query, it spins up a sandbox environment to iterate on faster ways to query the same data. If it finds one, it makes a change, tests it, and raises a PR.
The context provided to agents in recurring jobs provides a new way to express and enforce the opinions of the platform engineer. When done correctly, technical debt is addressed, costs stay under control, and performance is optimized proactively rather than reactively or never at all.
Tying it all together
With AI, the fundamentals of building and running a platform still matter, and the qualities that make a platform engineer successful still hold. However, like other types of engineering (and other professions) the role is changing significantly. The quality bar is higher, the barrier to entry lower, the platform will be more automated and nimble, the needs will evolve faster, and solid security controls are P0. This makes the engineering (and people) problems around platforms more difficult, but also more interesting.
Thanks for reading!
