I'm posting about a technical issue in our recent 1.2.0 App Store release on June 25th, that caused some of our customers to experience a crash when they opened the Coconut app until they deleted and reinstalled the app again.Within 5 hours of seeing the issue we'd managed to locate, identify, fix and reissue a new version of the app (1.2.1) to the App Store, which is now available (June 27th).We're very focussed on creating a high quality engineering processes to ensure that our service–both the app and its backing infrastructure–is stable and reliable for our customers. In fact, the stability of the application in terms of crash-free uses of the app is the only key measurement I currently report on as CTO, such is the importance of a great customer experience to Coconut.So this was really disappointing to me personally, and for all of us, as we've been used to 98-100% crash-free sessions for a while.
The importance of quality
There are lots of ways you can think about quality: of experience, of aesthetics, of feature breadth, of performance and so on. We choose to care about all of these and more, but I want to talk about quality of experience of which stability and the reliable functioning of our service for customers is part, and especially in the context of software engineering.Being a startup is extremely fun and also very hard work, if you don't believe me please do try it once in your life! With a limited set of resources everyone wears different hats and juggles a multitude of priorities on an hourly basis. It often feels a bit like building a house on quick sand. As such, startups need to continually make tradeoffs and hard decisions they can at a point in time, monitor, rinse and repeat.One decision we took from the very start was believing in following processes and best practices no matter our size. That comes from previous experience and seeing what happens when you don't.However, it doesn't need to be black or white. You can take the most lightweight process in the world for, say, task management, or a simple approach to code testing and you are immediately better off than not doing those things. You don't need to be Netflix doing crazy amazing things in QA throwing spanners into all your systems as your modus operandi to test their ability to self-heal etc.; you just need to do something and evolve as needed over time - you'll know when.
Our current approach
So, the best practices we have operated and evolved to date are:
- We operate a copy of the entire service called the User Acceptance Testing (UAT) environment so that we can ensure all our work is tested before it gets into customers' hands. This UAT environment (on AWS) is created by Terraform and deployed to via Ansible which means it is always repeatable and our infrastructure is declared in files we can look at, reason about and change vs. pointing and clicking in AWS and forgetting how we got our platform running.
- At a code level we use unit and integration tests using the standard Python unittest library and Factory Boy which are really great. I am not dogmatic about it - I simply ask my team to ensure we have a fair level of coverage and especially for important core elements of our platform. We run tests manually prior to releasing and this will soon become an automated git hook-based continuous integration approach because now feels like a good time to remove the cognitive load from our growing developer team and as we move even faster.
- We document everything we know should or could be better into a Technical Debt wiki so we remember to revisit in future and improve.
- We have approaching 1000 user acceptance tests documented in TestRail so that anybody can perform full scripted regression tests before we release new versions of the app. Of course, the test cases are only as good as what we have thought about testing and of course we miss things from time to time - the key is for every miss to ensure you capture it into the test suite and if it needs a test - add a test. Before you know it, stability will emerge.
- We leverage a version release strategy called Git Flow which is also married into how we plan product feature development in Jira. This gives us a clear way to plan releases, ensure they move through QA and UAT flow and allow us to rapidly react to issues should we need to on deployment to Live. We try not to over-commit releases, ensuring we move our feature set forward whilst not introducing lots of risk which is what happens when you release too much at once.
- For major releases we test via Apple's pre-flight testing platform called TestFlight. This enables us to give a selection of our customers the chance to trial our new version.
So why did we have a crashing release this time?
You might be wondering, if you are still with me, if their processes and QA is good then how did the crash make it to the App Store?Well. There's some very technical reasons and some procedural reasons.
- On the technical side, the new release improved the filenames that receipts can be saved out of the app with. Whilst developing this feature we found we had a missing primary key in our mobile app local database for receipt records. A primary key is a unique identifier for a record. We fixed this up by adding a primary key even though it had never caused an issue in previous releases before as it's just good housekeeping to do so. When we released the new app, we needed to retrospectively ensure all customers with receipts would re-download the receipts to get the new filenames (since we cache receipt records on the mobile device, we had to force them to re-sync).
- What we ran into, however, was that after this forced sync, users on older versions of the app would also be triggered to re-sync. Because we had no primary key on receipts this meant the app saw the updated receipts but stored them as brand new since it did not know they were the same record; this is why primary keys are useful since if you have them the database will choose to update rather than create duplicates if it sees an existing identifier. Consequently we had caused duplicate receipts in the pre-1.2.0 apps. When those customers then upgraded to 1.2.0 they ran into the new primary key mandate and the app crashed instantly. Our fix in 1.2.1 is to ensure we strip out receipt duplicate rows first.
- Another key learning is we must ensure we run all migrations (i.e. the updating of receipt updated timestamps used to trigger a re-sync) on UAT, too, to give ourselves the best chance of finding these complex issues before release.
- Finally, we did not release to TestFlight for 1.2.0, a process we should have followed but skipped. We shall return to doing so as this further would have increased our chances of seeing this issue.
Quality is super important to us at Coconut and is something we try to do really well.We are disappointed that we did not capture the issue prior to the release hitting the App Store and apologise to our affected customers.Our processes were setup to give us a good chance of catching the issue but the complexity of the specific issue found gaps that will prepare us for next time:
- Pay especial attention to databases, syncing and primary keys!
- Improve the UAT deployment process by ensuring all migrations required for live are considered for UAT, too.
- Always use TestFlight.
- Consider using App Store's Phased Deployment rollout to limit speed of exposure potential issues.
We hope our transparency on this issue has been insightful to those of you who got this far, and we welcome any and all questions and suggestions.