Why All Happy People Stopped Seeking Pleasure

During my teenage years, I struggled with a huge sense of hopelessness. Life just seemed meaningless. I was cynical and felt nothing good was going for me. I was overindulging on the pleasures of…

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转

Migrate Bank from one Cloud to another

The Cloud Journey!

Yes! We did something which most organizations would stay away from — Cloud to Cloud migration!

Cloud migration, or on-premise migration, is something people countermand and frown on. Here, we will talk about the Why, What, and How of our migration from one cloud to another.

· Why?
∘ Background
∘ Constraints
∘ Risks

· What?
∘ What helped?
∘ Phased migration vs Lift and Shift
∘ Managed vs self-managed services

· How?
∘ Checklists
∘ Pre-prep
∘ Phases and feedback cycle: POC → dev → staging → pt →prod
∘ Rollback plan
∘ High-Level Steps
∘ Migrating Compute layer
∘ Migrating Data layer
∘ Infra
∘ Security considerations
∘ Testing
∘ External parties — Intranet communication
∘ Order of migration for non-related systems
∘ On the D-day
∘ Prod support/ SRE

· What went well?
· What could have been improved?
· Conclusion

The below timeline depicts Bank App’s journey across different cloud providers.

Timelines of foo, bar, baz Cloud-Providers

With the delay in the Largest Cloud Provider-(foo) footing in the Indonesia region (the team had more expertise in this); we started understanding, designing, and eventually started development effort in Asia’s Largest Cloud Provider-(bar). And then a new player, let's call this Cloud Provider-(baz), entered the regional(Indonesia) market in June 2020. The team got excited about multi-cloud and some clear benefits we got out of the box with running some services in Cloud Provider-(baz):

(Disclaimer: All provider names are undisclosed to maintain objectivity. This is based on our understanding and the setup we did with a multitude of constraints. We by no means intend to compare which cloud is better than the other; situations and constraints vary)

Being in the business of the most regulated market tied our hands on multiple fronts, namely:

With these existing regulators and compliance constraints, came the technical constraints:

Anything in technology and distributed systems come with tradeoffs.

Hence, the risks:

With a unanimous decision of multi-cloud and migrating key services to new cloud, lots of questions pop:

Cloud-agnostic Technologies: Choosing cloud-agnostic and open source tech stack for persistent Datastores, ephemeral Datastores, artifact management, CICD pipelines, Gateway, CDN/ WAF solutions, Inverted index stores, Compute resources, Deployments, Containerization, stateless applications, Clear demarcation between compute and persistent store, generalist Message Queues for event-driven use cases, security tools, APM, etc. helped us with minimal to no code changes when we migrated.

Seeders: We have seeder services (which caters and abstracts cross-cutting concerns like CICD pipelines, Static Quality Code analysis, SAST, DDD folder structure, Linters, Test frameworks, Open-tracing, middleware functions like Auth, swagger, common utilities(retries, circuit breaker, payload validator, enums, etc), data store connections) from which all of the MS(micro-services) are forked. So any change in seeder is super easy to pull from upstream remote of respective service.

Reusable Modules: We have wrapper modules for anything and everything that’s done more than once in different services like common configs as code, making HTTP calls, logging, retrying connection to data services, producing/ consuming messages from MQ, message validation, auth utilities, middleware functions, Caching utils, Serializing/ deserializing things around, etc, etc

We came to the consensus that Lift and Shift is the approach to be taken because:

Since our approach was Lift and Shift, the next question was How long will the downtime be? Let’s answer below with feedback from migrating to lower environments.

Other types of migrations are well explained in the article below.

The next major question we had to answer was around should we go with managed services vs self-managed for the out-of-the-box ones not available in the new Cloud’s ecosystem (services like Kafka, MongoDB, Elasticsearch, Redis, Postgre). It was super clear that we wanted to be cloud-agnostic and don’t get married to any cloud. At least as much! (choosing pub-sub, Big Table, Cloud Datastore, etc. wasn’t an option)

With that, we need to look for accelerators, experts on these services if we had to self-manage. That meant more roles, more hiring like DB admins, Domain experts, System Engineers, etc

This is where the Cloud marketplace was a boon where there were multiple offerings for the tech stacks we needed like, for example, Bitnani managed Mongo or Kafka etc. in the Cloud being migrated to, and connected via VPC peering.

Sigh! Need to take a deep breath. Lot went in here. Let us try to break in smaller chunks on things we did.

This served as a feedback cycle and go/no-go document for stakeholders and overall health, TODOs for migrating from one environment to another.

We came up with MECE (mutually exclusive, collectively exhaustive) checklist related to Infra components readiness, application services, Monitoring, CICD pipelines, Communications, Access matrix, DNS changes, Step by step order of data sources migration, and operation timeline.

Labs: This was the crucial step. This is where we answered a lot of unknowns. We did lots of POCs, pilots to know things we don’t know. We had a labsenvironment for the Infra team to test their IAC, pipelines.

Configs: We structured configuration as code and maintained multiple versions of dependent lock files (ex: yarn.lock) referencing previous Cloud’s and new Cloud’s configs in module-config (single source of truth for all common configs as code).

Pipelines: Until we migrated completely in one environment, we deployed both to old and new Cloud at the same time in parallel. Catered BAU (Business as Usual) where things were deployed and tested out in baby steps and kept the sprint goals and feature building going on.

POC → dev → staging → pt (perf test) →prod

Since we understand how big of a change this is, the feedback cycle being as short as possible was critical.

With that in mind, the approach we took with staged environments migration with labs (especially for IAC and Infra components to do POC), dev taken as test environments and staging and above same as simulations for prod.

Each phase/environment was given a time of X days to be sure of anything we missed out (configs as code, automation scripts to switch between old and new Cloud’s components, CICD pipelines for any red flags, data stores). For dev, this X day was quite long, we had to fix things out. We did see some red flags as well from integrations with external partners, routing traffic to on-premise use cases, firewall rules tuning, assets movement, etc.

Feedbacks from this was crucial on many facets for many squads to → making checklists extensive → regroup to multiple sections → PICs/ owners were added for driving/owning each item in the list.

Drills for exactly how migration would happen in production were conducted in staging and pt environments and a mock Drill in prod as well to be sure of the migration time followed by cleaning the environment later. This helped in:

We structured our plan in such a way, that after migration we followed a quick smoke test (special app build) conducted by the internal selected team. Checking if there were red flags and if after X minutes things didn’t improve, we would switch back to the previous Cloud.

The key for this was when Lift and Shift were triggered, we only turned off processes (by firewall rules and DNS mappings) and could turn them back on in a matter of time if things went south.

This was quite straightforward since most of our Compute layers are stateless and containerized. We have a single pipeline called Bulk-deploy for deploying all services at once. The scripts were tested in lower environments and deployed the entire stack in a matter of a few seconds.

The artifacts were zeroed on which version to be deployed for each service tested already multiple times including complete regression suites in lower environments.

This is the elephant in the room! To get this right we came up with:

Entire Infra in new Cloud was built from scratch with the Network design being completely different from that of previous Cloud(VPCs wide constructs, Interconnect, Secondary IPs, Security groups over firewall rules, Logical NAT GWs, etc)

Once Network, Org/ folder/ project (a construct specific to new Cloud) structure were created each Compute, Data resources were POC’d in labs kind of setup where team understood nitty-gritty, best practices, what’s possible — what’s not possible, IAC tooling mapping to the specific technologies, etc

After reaching the comfort level of understanding/ implementing different components as IAC and Infra deployment pipelines for different environments were created in stages with feedback from each environment.

It’s awesome to look at this!

Though the new cloud offered better security constructs in multiple dimensions like REST/ Transport encryption, it was necessary to re-do the pen-tenting for Infra and Application before heading migrating.

Our existing test model prevailed as tests were run from one environment to another through multiple cycles from a functional perspective.

One thing we were quite paranoid about was non-functional pieces on the behavior of when we scale, fault tolerance, zone failure handling, etc, or just anything we are missing which will only show up post scale. To address this paranoia, with help from our PT (performance automation team) we did extensive performance tests in our PT environment to feel comfortable as much.

Also for testing the app post-migration as smoke/regression tests, the team came up with a special build to allow only traffic only from that build to test before letting the full-blown customers traffic in.

These changes were coordinated with communications across multiple channels and exercising the same change in at least one lower environment. Also, as precheck validation communications between different external systems, we interacted with to and fro well in advance with simple pings/telnet.

Some systems like the official website and some other streams of Business completely decoupled from Bank App journeys were migrated N days before the D day to increase the comfort level.

Here it comes!

D-Day was chosen with some objective and subjective considerations because of various reasoning. As you can guess we were doing this at midnight! Teams were asked to rest well enough for the single largest collaboration we were doing in a single go. For some, this was the first time they were involved in any kind of migration. Some were anxious, some were relaxed, some were doing last-minute prep!

As the time hit the clock, communication was sent out to customers for X hours downtime. The final go/no-go checklist was reviewed by stakeholders and pulse check of the team on how they felt.

We created a special war-room channel for everybody to jump in and off the calls since everybody was working remotely. Aside from helping the team to coordinate better, the channel helped record things for regulatory requirements.

It was a go! Run books were followed with a detailed checklist for each item. Migration was completed, smoke tests were completed with a special build well before the estimated time. The team posted test results on the collaboration page. Things looked ok, no blockers at least(some config mismatches, quickly corrected and deployed the services). Wait…

Did I tell you there is another go/no-go to open for public traffic or rollback? There was some monitoring/ alerting setup that got done at the last moment because of a multitude of reasons(will cover this in a while). Eventually, this was resolved with short-term/ long-term resolutions. It was a nail-biting go/no-go since some of the alerts were supercritical. It was a go!

The team was fully awake, watching monitoring, testing with full excitement!
It was almost dawn, we heard in the group call someone’s parent asking “How come you woke up so early today?”. We all chuckled!

Traffic ingress was enabled back for the public and monitoring continued with most of us opening dashboards with eyes glued. So far so good!

We did come across a few platform-specific issues and missed whitelisting which caused issues as traffic grew in the morning. One such highlight was about one critical API becoming ridiculously slow. Upon quickly doing an RCA we found that the VM NAT ports available were getting exhausted which meant any new requests(post exhaustion of ports) had to wait for the ephemeral ports to become free to fire an API to the external SAAS service we interacted with. This got quickly resolved. The reason these were not caught in Performance tests is that external entities were mocked using mock-servers that meant traffic did not go out and didn’t use ephemeral VM NAT ports at that scale.

Another functional issue that popped, failed in API response validation for a character length check which got exceeded for signed-urls for cloud storage object access in some cases.

And another issue was with referring to static assets in-app in which the Blob storage treated extra backlash (/) in the URL differently between the old and new Cloud

We did not have a dedicated technical prod support/ SRE team. Few services were still in the previous cloud while most migrated to the new one. Even during development time, the team did the heavy lifting in supporting/troubleshooting issues of their peers for both clouds.

Teams were given enough time to get used to changes in APM metrics as a whole. The toolings were kept the same as much as possible to reduce the learning curve.

Overall the migration went smoothly. The learnings were awesome from unscheduled learning opportunities. How the team members came together to solve and step in on ad-hoc issues was inspiring. The orchestration was well planned, tested across systems. Teams and stakeholders understood the tradeoffs of time.

Stakeholders gave kudos to the team congratulating them on the most effective migration they witnessed in their careers.

Communications! There is always something or somewhere that gets missed or misunderstood. I can’t stress enough the 3 Cs (Clear, Concise, Consistent) and proactive engagement.

T-1 day some surprises came in, one missed communications regarding alerting on log errors not working (the tool we used could no longer be able to snoop and alert from GKE logs) and missing Consumer lag metrics and alerts for MQ. Both risks were mitigated with quick alternative solutions from the team as short and long-term resolutions.

Our systems were migrated successfully. We did have hiccups, we did sip cold water and resolved! Cannot thank enough the entire team who helped us in this journey and the unplanned learnings!

So not all migrations are painful!

Why All Happy People Stopped Seeking Pleasure

Migrate Bank from one Cloud to another

The Cloud Journey!

Add a comment

Related posts:

Design a perfect icon in 4 steps

Bloguer ici ?

Deploy Spring Boot Web Application to Azure Virtual Machine