What a Cross-Account Migration Forces You to Confront

Mon, Aug 15, 2022
11-minute read

A cross-account database move is the diagnostic, not the work. It surfaces architectural decisions a team has been making implicitly for years: who owns the encryption key, what each account is for, which services depend on the database that nobody documented.

The technical sequence is well-known: snapshot, encrypt, share, restore, cutover. What surfaces around it isn’t, and the surfacing is what determines whether the migration is a routine project or the start of a useful conversation about how the environment really works.

flowchart LR A[Pre-migration
audit] --> B[Migration
window] B --> C[Post-migration
window] C --> D[Architectural
conversations] A -.surfaces.-> A1[KMS ownership
Account topology
Client dependencies] B -.surfaces.-> B1[Network reality
Hidden integrations
DNS hardcoding] C -.produces.-> C1[Backlog with
named owners] D -.outputs.-> D1[Data governance
Account architecture
Service ownership]

Figure 1. The technical work runs along the top. The architectural surfacing, which is the actual value, runs along the bottom and is what’s worth keeping after the database has moved.

The audit you didn’t ask for

The first phase is supposed to be planning. You write the runbook. You diagram the network. You list the steps. What happens is you start asking simple questions and getting uncomfortable answers.

Who owns the KMS key? It’s not a question about which IAM principal can encrypt and decrypt. That’s the easy version. The harder version is who decided what gets encrypted with this particular key, what the rotation policy is, who can revoke access, and who would notice if the key were compromised. In a healthy environment, you have answers. In most environments, you have a key that someone created two years ago, with a name that hints at its original purpose, and a policy that’s been incrementally extended to support workloads nobody is sure should still be using it.

A SaaS company I worked with earlier this year, one that had built a solid product and was consolidating accounts to tighten their security posture, was preparing exactly this kind of move. Production RDS instance from one account to another, encrypted at rest, target migration window of a couple of weeks. The first conversation in week one was about the KMS key. The customer-managed key encrypting the source database had been provisioned by a contractor years before. The key policy had been modified many times since then, mostly to add cross-account access for analytics and reporting workloads. By the time we audited it, the key was reachable from several accounts that nobody on the current team had heard of. A couple belonged to a sister team. One belonged to a vendor deprecated the year prior. Nobody had revoked the access because nobody had remembered it existed.

flowchart TD KMS[Customer-managed KMS key
created by a contractor
two years ago] SrcDB[(Source RDS instance
production)] subgraph Known["Known accounts"] Acct1[Sister team analytics
active and owned] end subgraph Unknown["Forgotten accounts"] Acct2[Old vendor account
deprecated last year
access never revoked] Acct3[Marketing tooling
nobody on team aware] end KMS --> SrcDB KMS -.cross-account.-> Acct1 KMS -.cross-account.-> Acct2 KMS -.cross-account.-> Acct3 style Acct2 fill:#fdd,stroke:#ef5753,color:#464646 style Acct3 fill:#fdd,stroke:#ef5753,color:#464646

Figure 2. The KMS audit at one SaaS company. The key was reachable from several accounts the current team did not track, including a vendor account deprecated the previous year. None of this was visible until somebody asked who owned the key.

The migration didn’t proceed for another month. The KMS audit became a data governance project. The data governance project became a conversation about which accounts the company should still have, which it should retire, and how to do that without breaking the analytics dashboards the CFO looked at every Friday. That’s a flavor of governance debt the team had been carrying for years without naming.

That’s the diagnostic value. The migration could have proceeded technically without resolving any of this. The encryption would have worked. The new database would have stood up. But the trust map of the data would have been even more confused, and the next migration (there’s always a next migration) would have started from a worse place.

The audit also surfaces what the runbook never asks about: trust relationships you didn’t know you had, client dependencies nobody wrote down, and a backup story that’s never what the policy document says.

The destination account had a trust path to a corporate IT account established for SSO federation, never revisited, broader than the documentation suggested. Dozens of services were connecting to the database, only a fraction documented, several on hardcoded IPs, a handful on DNS with TTLs that wouldn’t survive a fast cutover. The backup policy said daily with month-long retention; the configuration said daily with about a week; actual snapshots were weekly with no defined retention. Three stories, only one of them true.

The right move at this phase isn’t to fix everything. It’s to capture every uncomfortable answer in a backlog with named owners, and to be honest with leadership about what the audit uncovered. The migration can still proceed. The conversation has changed.

During the migration, the dependencies surface

If the pre-migration phase was about uncomfortable answers to questions you asked, the migration window is about uncomfortable answers to questions you didn’t think to ask.

The technical sequence is well-defined. Each step has documented edge cases. None of them prepare you for the integration tests that don’t pass for reasons you couldn’t have predicted.

Below is the AWS CLI sequence that exposes where the hidden dependencies sit. Each step is a potential surface for something that wasn’t in the runbook:

# Step 1: Create a snapshot of the source RDS instance.
# If the instance uses a customer-managed KMS key,
# note the key ARN now — you'll need explicit cross-account
# share grants before the destination can decrypt.
aws rds create-db-snapshot \
  --db-instance-identifier prod-core-db \
  --db-snapshot-identifier prod-core-db-migration-$(date +%Y%m%d) \
  --region us-east-1

# Step 2: Share the snapshot with the destination account.
# This step commonly surfaces undocumented KMS key policies:
# the share will succeed but the restore will fail if the
# destination account lacks kms:Decrypt on the source key.
aws rds modify-db-snapshot-attribute \
  --db-snapshot-identifier prod-core-db-migration-20220815 \
  --attribute-name restore \
  --values-to-add 123456789012   # destination account ID

# Step 3: In the destination account, copy snapshot
# and re-encrypt with the destination KMS key.
# This is where VPC route gaps and SG rule mismatches surface.
aws rds copy-db-snapshot \
  --source-db-snapshot-identifier \
    arn:aws:rds:us-east-1:999999999999:snapshot:prod-core-db-migration-20220815 \
  --target-db-snapshot-identifier prod-core-db-dst \
  --kms-key-id arn:aws:kms:us-east-1:123456789012:key/mrk-abc123 \
  --region us-east-1

# Step 4: Restore from the copied snapshot.
# IAM role drift and hardcoded endpoints surface here,
# during connectivity testing before the cutover decision.
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier prod-core-db-dst \
  --db-snapshot-identifier prod-core-db-dst \
  --db-subnet-group-name prod-data-subnet-group \
  --vpc-security-group-ids sg-0abc123456def \
  --region us-east-1

Picture a data operations team that had done the planning carefully and wasn’t taking shortcuts: a cross-account RDS migration, several-hour cutover window. The technical migration completed in well under an hour. The cutover took most of a day.

A pair of undocumented VPC peering connections ate the first chunk. They reached into analytics services in a third account, set up by an engineer who’d left over a year earlier, never documented, load-bearing for the daily operational reports the team relied on. The team discovered them when the analytics service couldn’t reach the new database and the operations team paged asking why their morning report was missing.

A Lambda function in the source account ate the second chunk. It was tightly coupled to the source database via a hardcoded endpoint. Nobody knew it existed until the database moved and the Lambda started failing in CloudWatch with an error that took over an hour to track to its source.

The third chunk was an IAM role attached to one of the application servers, depending on a permission only granted in the source account. The application started up, connected, ran queries, and silently failed on writes for the first stretch after cutover. The write failures only surfaced when a researcher noticed her records weren’t saving. That’s the part of the migration that mattered, and the part that almost slipped through.

None of these dependencies were in the planning document. None of them showed up in the runbook. All of them existed before the migration started. The migration just made them visible.

Two things are worth taking from this. Your runbook needs slack for things you couldn’t have predicted. If your migration window is sized for the happy path, you’re working in real time during the surprise path, and the surprise path is where most of the actual work happens. The more useful conclusion: the surprise path is genuinely valuable. The Lambda nobody knew about is a service-ownership problem the migration just made urgent. The undocumented VPC peering is a network-topology problem your security team has been wanting to address for two years and now has evidence to fund. The IAM permission gap is the audit your access-review team has been asking for, delivered for free. Don’t waste the finding.

After the migration, what to do with what surfaced

The most common failure mode in cross-account migrations isn’t technical. The team executes the migration successfully, files a “completed” ticket, and loses the diagnostic value within two weeks. I’ve watched this happen more than once, including with teams that fully understood the risk going in: the pressure to move on is real and it doesn’t always yield to good intentions.

The post-migration window is short. Engineers who have the context move to other work. The findings get filed in a Confluence page nobody returns to. The “we should really clean up that KMS access” item never gets prioritized because nothing is actively broken.

flowchart TB subgraph Before["Before the migration"] direction TB A1[KMS owners: unclear] A2[Account trust:
undocumented] A3[Service deps:
tribal knowledge] A4[Backup story:
three versions] end subgraph After["After the migration
if you capture the value"] direction TB B1[KMS: data team owns,
quarterly rotation] B2[Account topology:
prod / dev / data
with named purpose] B3[Service catalog
with named owners] B4[Backup verified to
match stated RPO] end Before ==>|"migration's
deliverable"| After style B1 fill:#eaf2fa style B2 fill:#eaf2fa style B3 fill:#eaf2fa style B4 fill:#eaf2fa

Figure 3. The actual deliverable of a cross-account migration is not the moved database. It’s the explicit map of trust boundaries, ownership, and integrations you didn’t have on Monday morning.

The work that holds in the two-week post-migration window is small but specific. Capture the surfaced architectural questions as tickets with named owners, not as a Confluence page. The items need to compete for priority alongside other work. A separate document quietly ages out.

The pattern that holds across these migrations is consistent: it’s almost always KMS ownership, account topology, and the long tail of undocumented integrations that carry the most architectural weight. Those three items, addressed in the next quarter, are worth more to the architecture than the migration itself was.

Schedule the architectural review within four weeks of the cutover, while the context is fresh. The agenda is the load-bearing items. The output is decisions, not more documentation. Then write the runbook for the next migration. The next one is coming, and the team that runs it deserves what you learned.

What each phase tends to surface

The surfacing isn’t random. After running several of these, a pattern emerges for which phase surfaces which class of problem:

Migration phase	Common findings	Architectural category	Typical owner
Pre-migration audit	KMS policy drift, forgotten cross-account access	Governance debt	Security team
Pre-migration audit	Undocumented trust relationships, stale service dependencies	Knowledge debt	Platform team
Migration window	Hardcoded endpoints, DNS TTL mismatches	Operational debt	App teams
Migration window	IAM permission gaps, VPC route holes	Governance debt	Security team
Post-migration	Backup policy vs. configuration mismatch	Operational debt	Platform team
Post-migration	Account topology questions, retirement candidates	Optionality debt	Architecture

Knowing which phase surfaces which class of problem helps with resourcing. If the security team isn’t available during the pre-migration audit, the KMS and trust findings land without the right owner in the room. That’s when they get deferred, and deferred findings are the ones that surface again in the next migration.

The architectural conversations the migration enables

A cross-account migration gets you a hearing for conversations that have been waiting for the right moment. KMS ownership becomes a data governance conversation: who owns the encryption keys for customer data, and what do we want the answer to be? Account topology becomes a security architecture conversation: what are the accounts for, and which should be retired? Cross-account integrations become a service-ownership conversation: who owns the services that depend on the database, and who is responsible when they fail? Each was theoretical six months earlier. Each is concrete now, because the migration produced specific evidence in front of the people who can decide.

Evidence has a half-life. Two months after the cutover, the conversations are still possible. Six months later, they sound abstract again, and the next migration will surface the same evidence over.

The SaaS company I described finished its migration about a month late. The database moved cleanly. What stayed with them was the map: the keys, the trust paths, the services nobody knew owned what. Within a couple of months after cutover, the team retired one account, named owners for two more, and revoked KMS access from a vendor a year gone. None of that was the migration. All of it was the migration’s real deliverable.

If your next cross-account move is on the roadmap as a routine task, the question worth sitting with is this: have you resourced the follow-through, or just the move itself? Done while the work is fresh, the cost of follow-through is small. Done never, the same evidence surfaces for the third time in the next migration, and nobody remembers it was ever captured.

cloud migration architecture