Why Your Azure Subscription Looks Like a Teenager's Bedroom (And How to Fix It)
Stay on top of this story
Follow the names and topics behind it.
Add this story's key topics to your watchlist so LyscoNews can highlight related developments and future matches.
Create a free account to sync your watchlist, saved stories, and alerts across devices.
Quick Summary
š¬ The Scene: It's Monday Morning...
You open the Azure portal. There are 47 resource groups. Nobody knows who created 23 of them. There's a VM called test-final-v2-REAL-final running since 2024. Someone deployed a $800/month App Gateway for a dev environment. The tagging strategy? What tagging strategy? Sound familiar? Welcome to Azure Cloud Architecture Therapy ā where we turn your chaotic cloud into something a Principal Engineer would be proud of. Grab coffee. This is going to be fun. Before we fix anything, let's understand the plumbing. Every single thing you do in Azure ā whether you're clicking buttons in the portal or running terraform apply ā goes through one gateway: You ā Azure Resource Manager (ARM) ā The Actual Resource
ARM is the bouncer at the club. It checks: Who are you? (Authentication via Entra ID) Can you do this? (Authorization via RBAC) Should we let this through? (Policies & throttle limits) OK, forwarding to the bartender (Resource Provider) The Error: Status=429 Code="TooManyRequests" Message="The request was throttled. Retry after 37 seconds"
What Happened: A team ran terraform plan on a monolithic root module with 2,000+ resources. ARM limits you to 12,000 read requests/hour and 1,200 write requests/hour per subscription. Their plan consumed the entire read budget, blocking other teams' deployments. The Fix: Split infrastructure across multiple subscriptions (not just resource groups) Break that mega Terraform root module into smaller state files Use terraform plan -parallelism=5 instead of the default 10 Schedule pipeline runs to avoid peak hours š” Principal Insight: ARM throttling is the #1 reason to adopt a multi-subscription strategy. If you think "we'll just use one subscription" ā you haven't hit scale yet. Think of Azure organization like a company org chart, except everyone actually follows it (unlike real company org charts): Tenant Root Group (The CEO nobody talks to) āāā Platform (The boring-but-essential stuff) ā āāā Identity Subscription (AD DS, DNS, PKI) ā āāā Management Subscription (Log Analytics, Monitoring) ā āāā Connectivity Subscription (Hub Network, Firewall, VPN) ā āāā Landing Zones (Where the real work happens) ā āāā Corp (Internal apps ā no internet exposure) ā ā āāā team-alpha-subscription ā ā āāā team-bravo-subscription ā āāā Online (Internet-facing apps) ā āāā public-web-app-subscription ā āāā api-platform-subscription ā āāā Sandbox (The "break stuff here" zone) ā āāā dev-playground-subscription ā āāā Decommissioned (The graveyard. RIP test-final-v2.) āāā old-projects-subscription
Pattern Best For Gotcha
App-per-subscription Large orgs, strict isolation Too many subscriptions to manage without automation
Environment-per-sub Medium orgs Apps from 15 teams sharing a "prod" subscription = chaos
Team-per-subscription Autonomy-focused orgs Cross-team app dependencies get messy
Workload-per-subscription CAF recommended Requires solid IaC automation
What Happened: A fintech startup put everything ā dev, staging, prod, the CEO's demo environment ā into one subscription. An intern with Contributor role on the subscription accidentally deleted the production resource group. Yes, the production resource group. On a Tuesday. The Fix: Separate subscriptions for prod vs. non-prod (at minimum) Azure Resource Locks on production resource groups:
az lock create --name "CannotDelete"
--lock-type CanNotDelete
--resource-group rg-payments-prod-eastus
PIM (Privileged Identity Management) for elevated access ā no one gets permanent Owner Delete locks + RBAC deny assignments for dangerous operations I know, I know. Naming conventions. Exciting as watching paint dry. But here's the thing ā when it's 2 AM and you're debugging a production issue, the difference between rg-payments-prod-eastus-001 and myResourceGroup7 is the difference between finding the problem and updating your LinkedIn. {resource-type}-{workload}-{environment}-{region}-{instance}
Examples: rg-payments-prod-eastus-001 ā I know exactly what this is aks-payments-prod-eastus-001 ā AKS cluster for payments, prod kv-payments-prod-eastus-001 ā Key Vault stpaymentsprodeastus001 ā Storage (no hyphens allowed, thanks Azure š)
Tag Why You Need It At 3 AM
environment "Is this prod or dev?" ā crucial before you kubectl delete
owner "Who do I page?"
cost-center "Who's paying for this $3,000/month GPU VM?"
application "Which app does this belong to?"
data-classification "Can I share this log with the vendor?"
created-by "Did Terraform create this or did someone ClickOps it?"
The Error: Finance escalates that Azure spend jumped $47K in one month. Nobody knows why. Root Cause: A performance test spun up 50 Standard_E64s_v5 VMs (64 vCPU, 512 GB RAM each) with no auto-shutdown and no cost tags. The test ran on a Friday. Nobody noticed until billing closed. The Fix: Azure Policy to deny resource creation without required tags
Cost anomaly alerts at subscription and resource group level Auto-shutdown policy for dev/test VMs Tag-based cost reporting in Azure Cost Management
// Azure Policy: Require 'cost-center' tag { "if": { "field": "[concat('tags[', 'cost-center', ']')]", "exists": "false" }, "then": { "effect": "deny" } }
Azure networking is where even senior engineers start sweating. Let's make it simple. The Internet ā āāāāāāāā¼āāāāāāā ā Hub VNet ā ā Firewall, VPN/ExpressRoute, DNS āāāāāāāā¬āāāāāāāā ā āāāāāāāāā¼āāāāāāāā ā¼ ā¼ ā¼ Spoke 1 Spoke 2 Spoke 3 (App A) (App B) (Shared)
The Hub = Your security checkpoint. All traffic flows through here. Spokes = Where your applications live. Isolated from each other. NO public endpoints on backend services. Period. Private Endpoints for every PaaS service (SQL, Key Vault, Storage, ACR) Service endpoints are the poor man's Private Endpoints ā use them only when budget is truly tight All traffic stays on the Microsoft backbone network The Alert: Microsoft Defender for Cloud: CRITICAL "Azure SQL Server has public network access enabled" "3,847 failed login attempts from IP: 185.x.x.x in the last hour"
What Happened: A developer enabled "Allow Azure services" on an Azure SQL Server "just for testing" and never turned it off. This essentially opens your SQL to any Azure IP ā including attacker VMs running in Azure. The Fix:
Disable public access
az sql server update --name sql-prod --resource-group rg-app
--public-network-access Disabled
Use Private Endpoint instead
az network private-endpoint create
--name pe-sql-prod
--resource-group rg-app
--vnet-name vnet-spoke-app
--subnet snet-data
--private-connection-resource-id /subscriptions/.../sql-prod
--group-id sqlServer
--connection-name sql-private-connection
When you create a Private Endpoint, you need DNS to resolve the service name to the private IP, not the public IP. This trips up EVERYONE. What should happen: sql-prod.database.windows.net ā CNAME ā sql-prod.privatelink.database.windows.net ā A record ā 10.0.5.4 (Private IP in your VNet)
What goes wrong: "I created the Private Endpoint but my app still connects to the public IP!" ā You forgot to create the Private DNS Zone and link it to your VNet
The checklist: Create Private Endpoint ā Create Private DNS Zone (e.g., privatelink.database.windows.net) ā Link DNS Zone to your Hub VNet (and spoke VNets) ā DNS records auto-populate ā Test from inside the VNet: nslookup sql-prod.database.windows.net ā This is 2026. If your applications are still connecting to Azure resources with connection strings that have passwords in them, we need to have a serious conversation. š Tier 1: Managed Identity (BEST ā no credentials at all) App ā Azure Resource, zero secrets involved
š„ Tier 2: Workload Identity Federation (K8s pods ā Azure) Pod ā Federated Token ā Azure Resource
š„ Tier 3: OIDC Federation (CI/CD ā Azure) Pipeline ā Short-lived token ā Azure Resource
š Tier Last: Service Principal + Client Secret "We rotated the secret and broke prod at 4 AM"
The 3 AM PagerDuty Alert: CRITICAL: Deployment pipeline failed Error: AADSTS7000222: The provided client secret keys for app 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' are expired.
What Happened: A service principal secret was set to expire in 6 months. Nobody set up a reminder. 6 months passed. Production deployment pipeline stopped working. Release blocked for 4 hours while someone figured out how to rotate the secret without breaking other services using it. The Fix: Stop using client secrets entirely.
For pipelines: Use OIDC federation (no secrets!)
az ad app federated-credential create
--id <app-object-id>
--parameters '{
"name": "github-main-branch",
"issuer": "https://token.actions.githubusercontent.com",
"subject": "repo:myorg/myrepo:ref:refs/heads/main",
"audiences": ["api://AzureADTokenExchange"]
}'
For Azure resources: Use Managed Identity
az webapp identity assign --name myapp --resource-group rg-prod
Every week someone asks: "Should we use AKS or App Service?" Here's the cheat sheet:
Need Use This Why
"We have microservices and K8s expertise" AKS Full control, service mesh, custom operators
"Simple web app, REST API" App Service Managed, easy, cost-effective
"Containers but no K8s pls" Container Apps Serverless containers, KEDA built-in
"Event-driven, sporadic traffic" Azure Functions Scale-to-zero, pay-per-execution
"We need GPUs"
AKS (GPU node pools) Only K8s gives you GPU scheduling flexibility
"Legacy .NET app" App Service Or containerize it for Container Apps
The Situation: A 4-person startup with one API and one frontend deployed to a 3-node AKS cluster with Istio service mesh, Prometheus, Grafana, Kyverno, and ArgoCD. Monthly cloud bill: $2,800. Total users: 47. The Fix: Migrated to Azure Container Apps. Monthly bill: $12. š” Principal Insight: The right tool depends on your actual needs, not your resume aspirations. AKS is the right call when you have the scale and team to justify it. For everything else, there's simpler options. Cloud cost isn't someone else's problem. At the Principal level, cost optimization is part of your architecture decisions.
Action Typical Savings
Right-size VMs (Azure Advisor recommendations) 20-40%
Reserved Instances (1-3 year commit) 30-72%
Spot VMs for batch/test workloads 60-90%
Auto-shutdown for dev/test 40-60%
Storage lifecycle policies (hot ā cool ā archive) 50-80% on storage
Delete orphaned disks, IPs, load balancers Immediate savings
Find orphaned resources (no associated resource)
az disk list --query "[?managedBy==null].{Name:name, Size:diskSizeGb, RG:resourceGroup}" -o table az network public-ip list --query "[?ipConfiguration==null].{Name:name, RG:resourceGroup}" -o table
I guarantee you'll find at least 3 orphaned disks you're paying for right now. Go check. I'll wait. ā ARM throttling is real ā design for multi-subscription from the start Management groups + Landing Zones = the foundation of enterprise Azure Tag everything or drown in mystery costs Private Endpoints everywhere ā no public backends, no exceptions Managed Identity > Workload Identity > OIDC > ... > secrets (secrets are the worst) Pick the right compute ā don't bring AKS to a Container Apps fight FinOps is architecture ā cost is a first-class design requirement Run the orphaned disk command above. Screenshot the results (I dare you to have zero). Check if ANY of your production SQL databases have public network access. Fix them. Find one service principal with an expired or expiring secret. Replace it with Managed Identity or OIDC. Next up in the series: Kubernetes: The Drama of Pods, Nodes, and the Scheduler Who Hates Everyone ā where we decode K8s internals, real production meltdowns, and why your pod keeps getting OOMKilled at 2 AM. š¬ Drop a comment if you've survived any of these disasters. Bonus points if your war story is worse. (I know it is.)