Yesterday we released Turris OS 6.0 and it didn’t went as smoothly as we hoped. In this post we would like to explain what happened, how we reacted and what we want to do in future to avoid same problems.
We have been working on Turris OS 6.0 for quite some time. As there were some huge changes like new network configuration we did countless tests to make sure it all end well. As it took quite some time, we got community to test it even in hbl branch and when we moved to the testing, we had even more community testers. Thanks to all those test we were able to fix various scenarios. And we thank them all.
At 9:45 AM we released Turris OS 6.0. From that moment on, routers started automatically updating. Not to hit our infrastructure all at the same time, updates are pulled randomly in some time window. We knew about few low-priority issues. LED migration script sent out an error. It was trivial problem, not having any real consequences apart from being a little bit ugly. We talked to our tech support team about that, that people might notice and report it and decided to release it anyway.
Around 1PM we started receiving error reports. There were two main issues. Turris 1.X not seeing Wi-Fi cards and being somehow limited in functionality. And firewall being open to the Internet. We immediately started working on reproducing both issues and trying to find a solution. Meanwhile we got some more reports to complete the picture.
Issue with Turris 1.X quite obvious after testing it. For some reason kernel on boot partition wasn’t successfully updated. Updating it manually fixed the issue. What needed to be done was to run the script once more before router reboots. we had an idea how to fix it, all that was needed was to implement it and deploy it
Firewall problem was trickier. And much graver. What happened is that as part of upgrade firewall was restarted but somehow wasn’t able to assign interfaces to individual firewall zones. That kept router completely open to the internet. Apart from that not all services restarted correctly which was both bad and lucky at the same time. The bad part is that our new common login gateway wasn’t running and thus anybody had access to reForis interface. Even people from the internet. The lucky part was that other services were broken as well. Both ssh and LuCI were broken and thus inaccessible from the internet. That meant that the only thing that people could do was mess with reForis, but luckily there is no way to install any custom software or backdoor through it. Also during additional investigation we found out that everything was back to normal and working after reboot, which might partially explain while nobody from our community noticed it during testing.
At 16:45 we had a first version of fix for the firewall issue ready to send out. Simple script that checked automatically whether there is a rule in firewall that sends some traffic to a WAN firewall chain. If not, then block all incoming connections to the router. This way nobody could try to access the router and abuse it. Not perfect, but while in damage control, the simplest and fastest thing we could release to minimize the damage. This didn’t fixed the Turris 1.X issue but closed the firewall while also locking people out of their routers till the reboot.
At 19:45 we had a solution even for Turris 1.X and released it. We just needed to run once again the script that deploys the kernel before router reboots. Solution was simple. But as we were talking about it and people rebooting, we realized that actually rebooting the router would be even better solution than the previous fix we had. And that when considering two alternatives - blocking all input connection to the router and letting it be or restarting all the routers without peoples consent and getting them up and fully functional again, we decided to go for hopefully the better alternative and we rebooted all the routers with broken firewall, getting them back to a state with fully working firewall and all services including ssh, LuCI and common authentification gateway working.
We are still invesigating what went wrong, we have a suspicion that some post hooks in updater didn’t fire - not restarting lighttpd with new configuration with new authentication gateway. It would also explain not deployed kernel on Turris 1.X But we need to investigate deeper about what happened in the OpenWrt firewall. Hotfixes I mentioned above are by no means the solution, but they are the hotfixes that are there to prevent the worse and give us more time to investigate further and fix this properly.
There were multiple independent mishaps that lead to the release being so disastrous. I would like to mention some counter-measures that we are going to take, starting from the easiest and thus fastest to implement and ending with those that will take quite some time and effort.
6.0 was a huge release. Apart from migration to newer OpenWrt base, we dropped Foris, introduced new authentication gateway, new diagnostics page, moved Pakon to a standalone application, incorporated Nextcloud into reForis and much more. Many of those changes (Foris obsoletance, Pakon separation, …) were driven by underlying changes in the distribution. We couldn’t release 6.0 without them. But in the end, we could have released those changes upfront. We need to do feature releases more often but keep them smaller so we can hunt one bug at the time. Next time, when we will be releasing Turris OS 7.0, it will contain just a migration to a newer OpenWrt and no other new features.
We have a system for automatic tests. It tests various functions of the router to prevent failures. It is still work in progress, has to be run manually and only few guys know how to access and run it. This is something we have been working on for quite some time, but the test-suite is not feature complete yet and on top of that, it is different system from the rest of the development.
We need to spread a knowledge internally so people can experiment with it more and integrate it better into our processes. Long term goal was to make it part of CI, this incidents shows that depending on manual runs and passing information between systems is not that fool-proof and that the explicit and automatic integration into release process is quite important.
There are two options to minimize the impact of errors. But both will take quite some time and require big changes into how updates are done.
Obvious thing people do when they grow big enough is to split devices into groups and send updates to a small portion of the audience first. If everything goes well, then to the larger audience while keeping some time between updates so when some errors appear, there can be fixed before update hits everybody. We have testing branches and plenty of volunteers on our forum testing the update apart from us. But the above mentioned errors slipped unnoticed through that.
We do feature releases and bugfix/security releases. One of the cool features of our routers is that we constantly improve your user experience and continue to add new features even to the devices sold years ago. But bringing new features inherently means bringing new potential bugs. We should somehow differentiate between those two types of releases and while bug/security fix release needs to be installed asap and ideally without any user interaction/notice, bigger feature release can wait a little giving users time to decide when to do them.
Both above mentioned features would be quite some work and we have to decide how to implement them. Even the workflow will take some time, not mentioning the implementation itself. But we believe it makes sense to start thinking about how to achieve that.
Yes, we messed up. But in the end we were able to roll out hotfixes quite fast. We are going to learn from this experience and take some measures to avoid similar problems in the future. We are sorry for all the troubles we caused and we thank you for your great support that we receive even in times like this.