On a recent government project, the DevOps team I was leading was tasked with spearheading the migration of our applications from a datacenter to the cloud. It did not go well. It almost fell into many traps that I imagine plague many cloud migrations, ranging from configuration management downfalls, to communications failures and delayed security involvement. In the blog below, I will go over some of the troubles that plagued our migration.
Background and Configuration Management
Our team was originally brought on to create a CICD pipeline for the currently on-premise system. We were also tasked with helping a separate Cloud team with setting up a similar CICD pipeline in their new cloud pipeline. We chose to create all of our infrastructure and configuration management with Chef, so that when the cloud was ready for it, they could just run our recipes and could easily recreate the same pipeline we had already been using. In reality, the cloud team could never get chef approved for use in the cloud, so everything we built, they rebuilt with Powershell scripts, and lengthy build documents.
When migrating to the cloud, or really migrating your system anywhere, I would ALWAYS recommend to be using a configuration management tool like Chef, Puppet, or Ansible to maintain consistency in your pipeline. In our case, if the cloud team had put more effort into getting Chef approved, they could have saved months of their time that they spent recreating the work that we had already done in Chef. Short of that, having a better understanding of our limitations in tooling would have been helpful to reduce rework.
Once the pipeline and infrastructure in the cloud had been created, it was time to demo it to high level agency management. The cloud team created a very good demo. It showed an application go through a pipeline, in a simple way, explained many of the steps in ways that many of the less than technical management could follow, and showed how deployments could work. Unfortunately, this demo was mostly smoke and mirrors; the cloud team built their own, very basic example application, and created their own pipeline process to push the application through. So, basically, they proved that a made up app can be pushed through a pipeline that didn’t matter. This wasn’t the way our AppDev Teams were going to use the pipeline, and many of the plugins and tools that were going to be needed were still not installed.
The government loved the demo. Even though they were aware it was only a demo application, I believe that they were left with the impression that the pipeline was more ready than it actually was: that once the cloud received an ATO (Authority to operate), all of the applications could easily be migrated over. This was far from the truth, not only because of the reasons stated above, but also because the security compliance team hadn’t been involved yet.
There are two real problems here, a failure to communicate with the AppDev teams, and a failure to communicate with management. The first issue, was that, in creating a new application and pipeline(even though they are using the same tools suite like Jenkins, Sonar, fortift etc.) without working closely with the application development teams that already had an established pipeline made the demo ineffective; we were demoing tools and apps that would never actually be used. Now, the simplest solution would have been to use the configuration management tool (in this case, run the Chef cookbooks we had already used to create the pipeline on premise). Had this not been an option, the cloud team should have understood that a pipeline is more than just a tool stack of applications like Jenkins, Fortify and Sonar. When they re-created a pipeline there are many questions they needed to have asked and resolved, including: are they using pipeline as code or a traditional freestyle Jenkins job, what are the plugins that Jenkins and the other tools needed, what are the technologies contained in the applications, how are things compiled, what are the gates you have set up, how are applications being deployed, and more.
Delayed Security/Compliance Involvement
The security team took several weeks to scan through the pipeline. It was known that this was going to be a destructive scan, so no matter what, the cloud team knew they were going to have to rebuild a large portion of their infrastructure, but, on top of that, security came back with a LOT of things that needed to change. From complicated network fixes, to updated authentication methods within the pipeline applications themselves, these remediations took months worth of work. They were forced to go back to the drawing board and re-do a lot of the work they had initially done to set up the environment in the first place.
When designing anything, security needs to be involved from day one. It is ALWAYS better to think about security at architecture time, rather than trying to squeeze it on top.
Backup and Cutover
The cloud team and our team eventually worked through all of the issues, got an ATO, and we were able to demo the simplest application we had, to successfully go through the cloud pipeline. The pipeline was not ready to run any other applications, and several technologies were still not installed that many of our other apps would need. (.Net was available but we still couldn’t migrate any code that used Java or Node and some of our functional and user experience testing tools weren’t available yet). But because the government had 1 successful demo, and communication was still an issue, the high level agency management decided to completely shut down access to the on-premise dev environment, with one month notice. The government direction that trickled down to our contract management(a week later), was something along the lines of: “No need to tell developers that dev is going away, we don’t want to worry them, we will just be replicating your dev environment for you, no need to do anything.” Fun fact, they didn’t replicate or backup anything for us.
With around 2 and a half weeks left, we had to make a plan to backup all of the CICD pipeline things that we couldn’t lose when they turned off the environment, but they had given us nowhere to put anything. First, we reached out to the cloud team and were able to move our code repositories over to their TFS server. Then we had to figure out where to backup all of the historical data we would need to keep, focusing mostly on the Sonar and Fortify databases, and the Nexus Repository manager(which holds all of the artifacts we had ever built). With the short deadline, and not being given a place to put anything, we had to get creative in finding places to store everything, and it was a mess.
While having a hard cutover date isn’t a bad idea, the government should have confirmed that everything was ready to handle the cutover on the cloud side. On top of that, a sufficient place to backup any historical data from the pipeline, is a must. That way a real backup plan could be created, and if the actual migration puts the developers in a limbo state where on-premise is gone, but the cloud isn’t quite ready, we will be able to recover eventually.
While I could go on forever about all the downfalls of the cutover, and how to avoid them, I think you should take 4 key points with you when migrating to the cloud.
- Use a configuration management tool, it just makes everything easier.
- Communication is your friend, make sure everyone is on the same page, and when giving a demo of your progress, be clear what is still needed to be done.
- Get security involved early and often, its always easiest to create things in a secure way than it is to have to go back and try to wiggle security on top of what you have, or in many cases, force you to re-do things you thought were done.
- Come up with a real cutover and backup plan. Make sure the teams have ample time, space, and access to migrate and backup all their data and tools into the new cloud environment before cutting access to your on-premise environment.