A Brief Summary on DevOps in Ackee 2020
2020 wasn't a year we would all like to remember that much. Although our team did not have that many opportunities to work in person, I think we did a lot of great work. Let me explain what we are dealing with: We are a team of roughly three DevOps engineers with the fourth honourable member, our team leader. All the non-honourable members are part of the support rotation. That means a week in a month you have to be available for developers if they are in trouble. We also have 24x7 on-call duty for one project. It sounds like something that is more & less the SRE team's responsibility, and under normal circumstances in a larger company, you would be right. Somehow, we had to find a way to make everything work with our DevOps duties.
But wait, there is more! Our tech stack spans from Node.js to Android apps. If you are interested, check our stack at https://stackshare.io/ackee_2/ackee (yes, it's `ackee_2`, somebody lost the credentials to `ackee`). Around seventy people are working in Ackee, four development teams, each for different platforms. Our work would be much harder if developers wouldn't be helpful.
Standardize your stack
In some cases, we have to spend a lot of time on support. Those situations are fortunately rare. Our unified, standardized stack we provide to developers can somehow manage. I guess that could be one of the takeaways from this article. If you find yourself in the DevOps department providing support for many developers and projects, standardize your stack. That also means you have to kill plenty of dreams project managers have. Once you lose focus and promise many new fancy features, be aware your team has limited time to spare. Especially for new things, you are almost always extending support time spendings in the future. Remember the bible of SRE given to us from Google itself? You have only up to 50 % of DevOps engineer time to dedicate to support.
So in retrospect, I would say that 2020 was at least for us the year of improvement. For example, we used Terraform before, but in 2020, we migrated the pipeline to GitLab CI, added TFsec and plenty more. Pretty amazing, huh? I will expand on it by summarizing the things we started to use in the year 2020. I will only talk about the points that were interesting to me and to my fellow work horses. I guess these things will be something you might already know and use daily. In that case, leave a comment, I'd be pretty happy to read about your experience.
Going for atomic deployments
We faced a few edge cases where we wanted to have a deployment that switches all the traffic to the new version. Zero downtime, zero overhead on our stack, just a switch once our containers were running. I was amazed by how surprisingly difficult it was for our GKE cluster. Rolling updates did not help us because there was a short time when users were offered both new and old content. Kubernetes' recreation strategy also created downtime (obviously). Switching labels at service also made downtime. Changing backend service in GKE's ingress, you guessed it, also created downtime. In a way, our deployment should have been without any visible problems and should have switched to a new set of pods as atomically as possible. So it seemed to us that service mesh was the next likely step we need to take.
A service mesh is a great, fantastic thing. But! And there is always a challenging very important "but" you have to face:
- Our stack works on helm chart deployments, that's not merely friendly with service mesh where you switch weights in virtual services.
- GKE could manage ISTIO, but it is still in beta and creates many issues. See release-notes and find all the ISTIO references in the release notes.
- Any other service mesh would need to be managed by our team. Remember my words from the introduction. You have only up to 50% of your time for the support.
- Integration with GKE application load balancer is not as straightforward as we would like it to be.
- Plenty of well-known tools work in GKE ISTIO in a very mysterious way. In our case, the deployment of Kiali failed, then it worked after a fix of a few issues by Kiali developers in upstream, and then it stopped working again.
This situation is not something we would like to present to the customers. We are looking forward to a more stable ISTIO environment in GKE. Until that, we would rather wait.
Terraform, terraform and again, terraform
Terraform made such a great leap forward in 2020: first 0.13 in August and then 0.14 in December. Each version was very promising. But let's face the hard truth, ever since terraform stopped to suck and created version 0.12 in May of 2019, we no longer needed to update that quickly. The reason is that 0.12 offered the features people needed the most, e.g. for_each, appropriate interpolation syntax, ... With that, we started to migrate to 0.13 at the last quartal of the year. The most significant improvement for us was the unified place in the configuration for provider versions.
Of course, we were happy once we saw the introduction of count into modules in 0.13. But by that time, all our modules already utilized maps that allowed us to create multiple resources before introducing the count keyword. Since we used statements for_each, referring created resources was just a matter of using the map's correct keys. The most significant improvement for us was the readability which count can introduce into the configuration.
We also faced plenty of issues with cooperation in the Terraform projects. The state files grew larger because our projects grew as well. Sooner or later (or next Tuesday) we reached the time when state management was impossible. We investigated Terragrunt, and we did not find any best practice that would fit our use-case.
Having passed the terraform exam this year, I was thrilled about the Terraform Cloud. It covers a lot of our use-cases. It wouldn't help make our state files smaller, but at least we could apply configuration without dealing with possible conflicts. Using a remote backend for Terraform also helped with state management. We were no longer responsible for the storage where the state file resides.
Of course, there is always a catch. Our company workflow is to develop an app and move forward. In some cases, we need to add customers to our GitLab to share the terraform configuration files. Currently, terraform cloud offers up to 5 users in its free tier and 20 USD a month for a Team plan (at 3.1.2021). Having just three members of the team, that wouldn't be an issue. Once we add more users, billing could become a huge obstacle. That was why we decided that we could improve our CI/CD system instead and left the Terraform Cloud for 2021.
Gitlab CI and leaving Jenkins
This part of the blog post is probably a bit controversial. I personally know many colleagues that would swear that Jenkins is the only CI/CD system: mature, robust, stable. My feeling is that once your pipeline fails with a null pointer exception from the code of the pipeline, not from the code of an application, you are not using the best practices for writing pipelines. Those situations can happen to you in Jenkins more often than in GitLab CI. Once your codebase grows old and people who wrote the code have left, you bet that you will have those issues very often. I gave a talk at DevOpsCon last year about our path to GitLab CI (see gitlab-ci-pipelines-for-a-whole-company). I struggled so hard not to blame Jenkins for the issues we had. I know that writing your pipeline in YAML could also become challenging, but it's enough for our use-case. With that we rewrote slowly every pipeline we managed from groovy to YAML for GitLab CI.
A few pipelines got a bit slower. That's mainly due to the non-existent workspace directory which is present in Jenkins. The nature of GitLab CI runners, which is based on concurrent independent runtime, could be challenging. The issue is large node_modules folders which we share in between jobs through caches. Every challenge is an opportunity, so we decided to tackle the issue with the following steps:
- Cache everything. Cache node packages in a local network, use Minio for GitLab CI caching and cache packages on runners in local directories mounted to the docker container.
- Don't be afraid of Docker BuildKit (see build_enhancements). Tomáš Hejátko, our senior DevOps engineer, tested the BuildKit, deployed it into our GitLab CI runners and we had no troubles ever since.
- Think twice about the cache key. For example, we cache node_modules by `CI_PROJECT_ID` variable (ID of a project). We had used `CI_COMMIT_REF_SLUG` before. That led to creating new node_modules for each branch or tag. It turned out that it wasn't necessary.
With those improvements, we were able to have the same pipeline runtime duration as we had with Jenkins. But was that enough? We did not want to have the same. We wanted more! And there was, of course, more:
- Now we can scale the pipeline runners. That wasn't that simple with Jenkins.
- We manage pipelines directly in GitLab UI.
- We no longer struggle with merge request builders.
- Our codebase for pipelines improved.
- We spend less time on support.
So yes, if you find yourself in our shoes, I will strongly suggest implementing the pipelines in GitLab CI instead of Jenkins.
Hosting React apps in GCS vs Firebase Hosting
Hosting React applications should be simple. It should be covered in the following steps:
- build the app,
- sync the app to the bucket,
- set correct permissions,
- profit!
Well, what if you want to set up TLS to your application? Following the guidelines from Google, you should preferably create an application load balancer and provision managed certificate for it. It's no longer as simple as you would like it to be.
Also, having the bucket named after a domain name means it has to be authenticated by you. Sure, it's not that hard, it's just a few clicks. But could this be more simple? Could any developer do it as well instead of you? Remember, you have only up to 50% of your time to sit on support.
For us, the answer is Firebase Hosting. We are currently testing it to see if it fulfills all our requirements. Firebase SDK is very friendly to developers. Developers now manage the things managed by us before (TLS, caching, ...) in a single configuration file (see full-config).
But this approach can be tricky too. There was a reason the caching was managed by the DevOps team in the first place. Somehow, developers tend to forget that caching could be painful. To give you one example: setting TTL for 404 larger than 5 minutes can lead to disasters. We, by misunderstanding, set the TTL for 404 to half a year. That works fine in case you use hashes for newly deployed files directly in their names. The problem begins once you deploy a new version. Few clients could be unlucky, call the new index which refers to the content which is not fully uploaded yet. For these reasons, we keep a close eye on our Firebase Hosting experiments and hope we can iron out all the kinks.
2021 here we are
It wouldn't be a summary blog post if I did not thank my team. All those points above I wrote about were possible because we could focus on our job and work hard.
The best feedback on my position was the kudos I received during Christmas (see the article about sustainable Christmas from our HR, Czech version only). Somehow (especially if you work on support), you sooner or later find yourself in a position of being a NO man. You have to stop a lot of dreams project managers have. Sometimes you have to disappoint developers. Your time is limited. In the end, you find out that this position is not that hard on others; it's hard just on you. You are the one, which is there to limit the dreams to provide reliability. Having received those kudos helped me understand that others are familiar with DevOps engineers' situations, and there are no hard feelings. At least for the year 2020, I guess we did alright.
We are looking forward to plenty of the steps we need to take:
- We can’t wait to dive deep into the service mesh. We would be thrilled to find somebody who would give us a lecture. If there is any interested reader, don't hesitate to reach out.
- We can't wait to improve our Terraform further. Just a migration is not the same as fully utilizing all the new features.
- We are ready to combat all the support issues we are going to face.
And one last thing, I hope we will soon return to the office.