Don brings up the subject of chaos, based on a book he's been reading, Antifragile, by Nassim Nicholas Taleb. We discuss the merits of test driven development, unpredictability, and how technical managers can work towards a more resilient product in the face of inevitable failures.
- Antifragile represents a system that improves under chaos
- In a well-tuned org, with test cases and adaptive practives, software systems can improve under chaos
- Netflix and Amazon built their systems with chaos in mind. Expect failure and build with failovers.
- For a system to be antifragile, its parts must be fragile
- To build a resilient backend can be very costly
- Asking the executive team about what's feasible for downtime vs the costs of a completely redundant system is a good starting point
- Understand what the business needs are during a failure event
- Test-driven development can help a firm quickly visualize how changes made cause butterfly effects across a system
- The conversation moved towards the subject of complacency, which will be expanded upon in a later episode
- Nassim Nicholas Taleb website
- Antifragile: Things That Gain from Disorder (on Amazon)
- Nassim Nicholas Taleb on Amazon
- Chaos Monkey
- Amazon Web Services EC2
Thanks for listening to the CTO Think Podcast. If you liked what you heard, please share a link to the podcast with your friends.
Reviews on iTunes are always appreciated and help us spread the word about the podcast.
Shownotes and previous episodes can be found on our website at www.ctothink.com
For questions, comments, or things you'd like to hear on future shows, please email us at firstname.lastname@example.org
For notifications of future episodes, please sign up to the CTO Think newsletter on www.ctothink.com
We'll keep talking next week!
- Transcripts by Rachel VanDemark
Randy: Welcome to CTO Think. I’m Randy Burgess.
Don: And I’m Don VanDemark. Randy, what’s been going on in your world this week?
Randy: Normal work for the most part. Nothing extravagant. Mainly for the two clients I am working for, getting their new projects kind of lifted off. One is adding a new nationwide shipping. I talked about that before. And then a different client is reestablishing their ability to push their product to kind of a third party distributor and hoping to gain revenues with that, so that’s a lot of communications with people more than code at this point, which is kind of part of the gig. So, doing that, and then I have gotten back to working on the HOA Done application, which is a small tool for small condo associations, homeowners’ associations, and so that’s kind of been the side project I have been working on. What about you?
Don: Same as you - same old, same old. We’re working through everything we have been talking about the past few weeks. Right now, for AspirEDU we’re in the middle of discussions with another developer, looking to bring someone on who is working his way into the development field. Been doing a lot of side projects, been doing a lot of small stuff. But his skill set that he is learning is right in with what we need, so been talking to him and that may be something we go forward with here in the future.
Don: So, we’re going to do book club this week. So, I read most of - I am not going to say I read all of it and we’ll talk about that in a minute - read most of a book called Antifragile by Nassim Nicholas Taleb, who is the same author who wrote Black Swan. So, the term “antifragile” is meant to mean a system, a person, anything that gets better, that thrives under chaos. So, it is different than resilient, which can stand up to chaos and absorb it. Antifragile is the opposite of fragile in that it thrives under chaos. The problem I had with the book was essentially that he made the same point 10 different ways in the first half of the book and it didn’t seem to be progressing, so I didn’t finish the book, but the point came across just very clear in a very interesting concept. So, while I was reading it I am sitting there going “you can equate this to the software systems we build”, and it sounds weird to say that software systems get better under chaos; but in a well-tuned organization, where you’re writing a lot of test cases and when you’re adapting your system based on what is being thrown on it, software systems DO get better under chaos. So, I wanted to talk a little bit about that today. How do you improve your software systems when things come at it, and how much do you invest in testing and that is the general direction I wanted to go with this today, so talk a little bit about how you approach those things.
Randy: Well, I’m curious about the book. Does he bring up Netflix at all?
Don: Not in the part I read, no.
Randy: The reason I ask is Netflix - they’re kind of at the forefront, at least publicly, with chaos testing, chaos management, because they introduced a library called “Chaos Monkey” which actually takes down pieces of the production environment randomly, and it forced them - I guess it was one of those ideas of we expect for pieces of our production to go down randomly, and therefore we will build a system where it just happens naturally and we will cause it. We will be part of the cause of that randomness, in theory. So, I have not used this library, but it has come up a number of times on some other podcasts I have listened to and just other things I have read that the idea is expect things to fail in kind of what the book is talking about, I assume, is building a system that can handle that and at the same time build the most resilient system.
And I think Amazon built their EC2 platform on the same idea because at the time, before we got AWS and The Cloud going full steam, the approach of all enterprise was to build huge racks of very expensive servers and build their entire platform on that with the idea that these expensive servers wouldn’t go down, and Amazon came in and Google as well and said that cheap servers that are expected to eventually fail is the more cost-productive way to build a cloud or a large server back end. So, their approach, like they just started building out tons of cheap server boxes and they build a redundancy level system that would take over when one dropped or one failed, and the life expectancy of these servers wasn’t very long and they found it to be much more cost effective. So, without having read the book that’s the first thing that comes to my mind - are those two scenarios of Amazon and Netflix and how they have approached Chaos.
Don: And it’s like he did study them and I didn’t come across the mentions of them, or they read - Well, they couldn’t have read the book. The book is not that old. But, one of the core tennants is for a system to be antifragile, most of its parts have to be fragile. So, pieces have to be able to break and you have to be able to work around that and be able to deal with that in order for the whole system to survive. So, those were two awesome examples. How do you put this idea together in testing? You can use Chaos Monkey but how have you done it in the past, just with plain testing, and how much time do you dedicate to test cases and making sure you’ve got test coverage and is 100% test coverage what you aim for or what are you aiming for when you’re building systems?
Randy: Well, I am going to admit that I don’t spend a ton of time on failure management. I feel like I outsource that to Heroku, who handles it on the AWS side for me. But what I have done, and this is what I did in the past - one example was I was working for a company that had a client, and they basically said we are going to put up this very small, simple app that will run for the weekend, but if it goes down the entire thing will fall apart. You have to keep this thing up. And so I approached it with the idea that I was using Heroku for the hosting of this particular app, and my approach was I am going to build a second, use a backup platform, that I expect Heroku will fail. I have no reason to expect it, but I just said it is going to fail and I am going to randomly switch the domain name to point to this other server and have that take over, and I had it do that like three or four times, just kind of like within the next hour pick a time, transfer, and then I was watching both sides - the Heroku and I think I used Engine Yard at the time. So, that weekend everything ran smooth. Everything ran on Heroku. Heroku didn’t have any problems, and this is maybe like three or four years ago, but that was how I approached that one scenario, and if you were to say, okay now take that one scenario and now you’ve got a longstanding system you need to deal with, there are a lot of big questions that come out of that because what I did for a weekend with Engine Yard and Heroku hosting was pretty cheap, but to build a resilient back end that completely is failover and sitting there ready is a completely different question on the cost factor.
So, that’s I think, as everything does, it starts at the very top. I think you go to your operations team or the executive level and say, in the scenario that our platform goes down for a specific amount of time, what are our thresholds as a company for that downtime? And the first instinct of every non-technical person is, it has to be up all the time. You can’t have downtime, but that is when it is upon you to say, okay, if we build this entire second platform to back up the first one, these are the costs, and I have had this discussion with executives that I have worked with, and they immediately change the tone of, well, I don’t want to spend double all the time for what may be an hour of downtime or even a day. That doesn’t fit, so it changes the conversation to what is the backup plan? Do we have a website that just states “we are down. Here is how you can reach us manually to get business done.”? Very cheap to do that kind of backup. Do you have a system that only has half the features or doesn’t persist data but allows interaction. There are a number of ways to mitigate downtime or things that are going wrong and still keep business going, and that is the discussion you have to have at the very top, I think, versus the whole idea of, let’s just have two systems running all the time. But it changes based on the business. Emergency systems don’t have that same kind of leeway. They have to be running all the time for a large part of what they do, but that is how I would say that that discussion with executives is how I first approach it - before I touch any code, before we set up any systems. Understanding what the business needs are at the level of failure on the main product and the main line, what is the business’s ability to keep operations going without a completely redundant system? So, that’s my long answer.
Don: That’s great. So, I want to step back to the beginning of that answer, and you said you don’t do a lot of testing on your own, you outsource a lot of that, and I want to make sure we’re delineating here. The type of testing, and the discussion we just had was around hosting, back end, hardware, that sort of thing. The question is also relevant to the test case’s side of the code itself. So, how much do you commit to testing on the code side? And that is where we start to write test cases before we even write the code with test-driven development or behavior-driven development. How much of that do you do with your projects? How much do you invest in testing in order to build what’s essentially and antifragile system?
Randy: Let me just put it this way: I am a TDD guy. Part of that comes out of me now working in the Ruby on Rails community, which is really strong on that, but I actually have a complete contrast in my career. For the first seven years, when I did a lot more PHP.net type of development, I didn’t do any test-driven development. I didn’t write any automated test in my code. If it came to, like, having an automated process walk through the code, like a person on a browser would, and test things, I didn’t do any of that. And I wrote code that broke. It took me forever to build and it wasn’t very maintainable. A change I would make on one part of the system would have a butterfly effect through the rest and I had no idea of knowing unless I tested everything to go alone with it, so that is the contrast to what you’re talking about now, which is a more automated software approach to development. And so now what I do, and I pretty much feel - I don’t believe in 100% coverage. I think is it almost impossible, especially with the use of API’s, which are difficult to test, that you don’t handle. But what I definitely do is strive for like an 80% coverage, where if I am writing a method or I am adding a feature there is some automated process that makes sure that, and in most scenarios, this code keeps running. And then every time I push a change, we test. We run all the tests, like all of them, even if it doesn’t pertain to that change necessarily, the idea is to catch any butterfly effects that you may cause and then every time we push to our server to staging, same thing. All the tests run. So I am huge on that and it is still a big debate in the software community that I am a part of about the value of it and how hard it is, and almost every CTO, technical manager that is in a discussion about it, always if they haven’t come up in that world, the pain of starting learning how to do it is really steep.
Don: It’s interesting if you think about it, and I think this is where age, experience plays a factor, you and I both Gen X-ers, mid career, that sort of thing. So, we did come up in the development world before there was test-driven development, before it was as widespread as it was. Your early years are very much like mine. “I don’t write software that has bugs in it; therefore why would I test for it ahead of time, right?” We have the luxury of living in that world and going “wait, I see so many benefits out of this test-driven development, and I do understand the pitfalls of not doing it.” Whereas, developers that have come up and have learned in this world and don’t know another world, it is ingrained in them to do test-driven development, so in one way it is easier for them because they are not fighting biases, they are not fighting all those bad habits. But they also necessarily don’t have the battle wounds, the scars…
Randy: So What has been the philosophy with of AspirEDU and testing? Because you are not the one doing the development, so what is your team’s philosophy?
Don: We do the same thing. We have automated tests that we go out there and run every time something gets committed. We’ve got a code coverage robot that sits there and checks how much our code is covered, and you’re right, the 100% code coverage is unattainable, likely, because of all the external factors. And it comes back to what you were discussing earlier.
Randy: Does that bother you?
Don: It’s also a cost benefit ratio, right?
Randy: Does it bother that whole lack of 100% coverage?
Don: Does it bother me? I won’t say it bothers me. Certainly when you have something fail that was not covered you start to Monday-morning quarterback it a bit and go “well, should we have had it”? That’s an easy answer - yes, probably. There are the “should we have had it?” to where the answer is “we might have been able to foresee that and we might have been able to write a test for that, and then there are the ones that are like “I am not sure how we would have envisioned putting that together in order for it to fail in just that way.” So, I think we learn from every time we have an issue pop up, and it gives a different flavor to the test cases that are written in the future but certainly there is no drive to get to 100%, and no, I don’t know that I lose sleep that we don’t have 100% coverage.
Randy: So then, going back to the book. You’ve read it, so going away from test driven. In a way, test-driven development is to prevent chaos from happening. Which I think it does. I feel like I build the code that my teams have built since I have implemented TDD as a philosophy into the developments, product development has been huge. I have no measurable way to talk about it. I can just say the code quality that my teams have been able to build with it is immense, and problems on the back end, the chaos, has been minimized and tremendous amount compared to what I worked on prior. I don’t know. What else did the book talk about that happens; like, you can’t test for it, things still happen out of your control, what did the book talk about, and then what have you been able to gather that maybe you should do more of or - I don’t know - maybe you’re like “this is out of our reach” kind of thing? I am not sure.
Don: The book doesn’t talk about it because the book didn’t really draw the parallel between being antifragile and software systems. Didn’t really hone in on that a whole lot. But it was that revelation in my head that yes, test-driven development aims to prevent chaos, but the way I see it is, when we’re building these systems and we’re using test-driven development and we’re building these test cases that come up, we are making our system more antifragile as it goes because we start with a set of test cases, we believe we’ve got good coverage, something happens. A bug happens, we have an issue; therefore, that’s chaos has entered the system, we’ll say. We then go fix the problem. So that’s the resiliency part of it and just fix the problem and then analyze how the problem arose. Write test cases to cover that area plus maybe peripheral areas around that, and in that way the software system thrived from the chaos. So, as long as we didn’t have a hard outage, to where whatever the issue was completely took our system down and we’ve got clients that are unhappy because of all that, as long as that didn’t happen and it was more of a “Hey, we had this bug when we hit this little thing so it popped a 404 page” or whatever the issue was. As long as it’s a minor impact that is certainly a good example of a software system was in one state, chaos entered the system, and upon completion that system is now stronger. That’s the parallel I am trying to draw here, so you’re absolutely right that test-driven development reduces chaos, but that’s kind of the point. It also helps you react to chaos and helps make the system stronger in the end, so that’s what I was trying to draw from it.
Randy: Makes sense.
Don: So like I said, book club for this week. I wanted to introduce the book. I will pull a quote out of the book because I think this is a topic we will come back to. We can spend maybe five minutes on it here and then decide if it needs its own episode later. A quote out of the book is: “The three most harmful addictions are heroin, carbohydrates, and a monthly salary.” So, what he is driving at there is - and I have experienced this in my career - that monthly salary allows you to get comfortable. It allows you to not worry about improving yourself. Now, this is not necessarily the attitude you and I take towards things, but it does allow someone to say, “hey, I am feeding my family, have a decent lifestyle. I don’t need to worry about doing anything else. I do not need to worry about improving.” But when that monthly salary is under attack or you are a freelancer or consultant and you are having to earn every penny, you as a person start to become a little more antifragile because you sit there and you go out and you learn more things, you investigate more avenues. Like I said, this is probably something we are going to want to take a totally separate episode on, but I was sharing with you just last week - there was a sale on books from a couple of different publishers and I picked up five different ones and I shared that list with you, and it was a varied list. It was not five books of all the same one topic. It was five books about five very different topics. Because they were all things that interested me. They were all things that will make me better, make it better for me to understand different things that I have touch on in my jobs. But they were all very different. That’s just another way of getting away from software systems being antifragile, more to people being antifragile.
Randy: The term is complacency. That’s what the fear, I think - I mean, this is beyond technology. It’s leadership and management for a strong part. The fear of every manager or business owner is that their team gets complacent. Maybe it’s the result of the salary and the structure or the stability that comes as part of that, but the idea that people will quit pushing the limits that a business needs to compete and for a software team it would be, “hey you are getting used to these error messages coming through the log system and they are not a big deal” so you just kind of keep letting stuff go past until one day you just let a huge issue go through, and because you just quit paying attention or caring about “hey, the framework we are using is upgraded, but that is a steep task to upgrade to a new version every year. We’ll just let this one ride.” And I think people will knock a manager - I don’t manage through chaos. I aim for stability of people’s anxieties and stuff, but there are managers that their philosophy is a chaotic environment breeds, like you said, resiliency and innovation, and that’s when they promote an environment that fights complacency in that case. Like you said, I think complacency on an engineering team is a huge topic that we can definitely tack on the future. I think it’s a great subject that I have talked to managers about. I haven’t talked to really engineer peers about it too much, but I have seen it discussed at the management level of things many times. And sometimes it comes out in just a frustrated comment about “nobody cares. This week the energy is low,” and I am not always certain if complacency is a cause for that. Sometimes, well last week was crazy town. This week is, everyone is catching their breath. But it is easy for a manager to confuse those, so I think it is a pretty deep topic we can definitely hit in the near future.
Don: Yeah, let’s tease the listeners to say we’ll come back, because another angle I want to put on this, and it could end up being just too huge to even do in one episode, there has been - I won’t call it a movement, but I will call it a movement just for ease of use - there has been a mini movement lately among the people that I read and I follow towards not contributing to open source in your spare time, not working on your job in your spare time, to have different interests in your spare time, to get completely away from your job in your spare time. And that the idea is valid and there is validity to the point, but there is a balance there as well, and when we talk about this in the future I want to tie that into it as well because to me that’s a huge part of it as well.
Randy: Makes sense. Well, we hit the 30-minute mark. I think we have got some great things to talk about coming up next week too. What’s your next week looking like?
Don: So, haven’t mentioned this here yet, but I had shoulder surgery a few weeks ago, so next week I get to get out of the sling, so that’s what I’m looking forward to next week. That sling is a pain to sleep in.
Randy: I asked you about this yesterday through Slack because we had an Uber driver because, you know, you get in a car with a rideshare and sometimes the stories go all over the place, but this particular driver had rotator cuff surgery and he talked to us about all he went through, and I was like “Oh man, Don had this surgery two weeks ago. That sounds worse than even he said. I need to ask him about this.” So I definitely sympathize tremendously and look forward to you getting rid of that sling as well.
Don: Yeah, the physical therapist said that the rotator cuff would be a more painful one in some ways, but this one’s more difficult in other ways, so they are two different ones. So, anyway, enough medical 101 on CTO Think. So what’s coming up in your week?
Randy: I am going to the gym so I don’t have to have rotator cuff surgery in the future. That’s next week every day.
Don: Do it properly then!
Randy: Right now I am trying my best to reduce meetings and keep coding and working with the folks on my team that are doing coding in the sense that I do need to be doing networking and marketing, but right now I am in that mode of I am just getting things done really well and I want to stay as much as I can in the office, reduce as many meetings as possible and keep cranking out the work for people right now that I need to get done instead of doing that usual “oh, the holidays kind of keep you in. I want to go out and spread out and network.” Every January seems to be that whole new resolution of networking, and maybe it’s due to the cold too. I just want to stay inside, but I am really focused on kind of cranking out work without a ton of discussion if not necessary and then pushing that kind of network burst that I typically do at the beginning of every year to later. So next week, I think, I am just going to be managing and development and minimizing meetings. That’s the goal.
Don: That is a worthy goal, so let’s see how well you achieve it.
Randy: Alright, man, we will talk next week.
Don: Very good. Thank you. See ya.
Thanks for listening to the CTO Think podcast.
If you like what you heard, please share the link to the podcast with your friends.
The show music is “Dumpster Dive” by Marc Walloch, licensed by Premiumbeat.com.
Show notes and previous episodes can be found on our website on ctothink.com.
For questions, comments for things you would like to hear on future shows, please email us at email@example.com.
For notifications of future episodes, please sign up for the CTO Think newsletter, also on our website.
We’ll keep talking next week.