Skip to main content

I could never have designed WhatsApp

I finished a project with a wonderful group of people and part of that involved working directly with WhatsApp, unofficially though, and through that I think I have gotten a good idea of how it works (I could tell from metrics when their daily deployments happened… if that is enough confidence) and the design is brilliant and stupid all at once… but importantly it is not something I would have designed; so I thought it might be nice to share how I think it works. Before we start, to be clear, I have no proof of this (except for a few thousand hours of working on it) and I never spoke to anyone from Meta… this could be all wrong.

Getting to the first tick

Let’s say Alice is sending a message to Bob, the message goes from her device to the WhatsApp server, where the metadata on the message is checked and ultimately the message is stored on their server. The message itself is encrypted, so if you could access the database of messages that is useless as only Bob can decrypt it, but the metadata is not encrypted. This does include information like who is involved in the conversation, message type (for example, image or voice note or edit or just a text message) and version info (and you can get rejected for incorrect version info). This (I assume) is stored encrypted at rest on WhatsApp’s side, but if you could use their tools, it would be available to you.

The second tick

WhatsApp now uses the established connection with Bob’s device and pushes the message to the device. The message and any attachments are now on the device that responds to WhatsApp and initiates sending the second tick to Alice. At this point, WhatsApp removes the message from their servers. This is why you cannot restore your messages from WhatsApp unless you have a backup because they do not keep them. This is so brilliant because it lowers disk space usage on the WhatsApp side and reduces the risk of a breach or someone asking WhatsApp for messages because they do not have them.

WhatsApp Web can restore your messages

If you have ever used WhatsApp web, you know when you connect to it you see all your existing conversations and messages; it is live info! Yet, I just told you that WhatsApp does not have this info on their servers… so where do these come from? They come from the device! This blew my mind, but when WhatsApp Web starts, it uses the WhatsApp servers to establish a proxy connection to the device. The device then does an export of data and WhatsApp web imports that, and that is how you get the data.
(I do think Signal works the same way because if you do run a transfer to a new device, it needs to connect to the original device to ingest the messages). What did surprise me and is the stupidest thing, the export format for Android and iOS is different! In other words, what you get (and how many messages) depends on what device you have, and what version of WhatsApp you have on that device. This means that WhatsApp web needs to support multiple import formats because there is no standardisation. Once the import is done, the source device can be turned off and everything will still be available which tells us that WhatsApp web is storing all your messages in your browser (likely encrypted at rest).

End-to-End encryption

One of the most common questions I had when I talked about my work, was “Isn’t WhatsApp end-to-end encrypted?” and it is. There is transport-level encryption, like TLS in your browser and the message is encrypted in a way that only the recipient can decrypt; but once the message is delivered then it is in plain text for a bit while it is shown to the reader. It is also encrypted in a way for storage that the recipient can decrypt. This means that end-to-end encryption prevents a man-in-the-middle attack but does not prevent a man-at-the-end attack. A man-at-the-end is where you either do something which works directly with the WhatsApp app (accessibility tools are a great example of a legitimate example of this) or you build your WhatsApp client which lets you do anything you want.

FAQ

  1. How does this work with multiple devices? WhatsApp stores the message on the server until it is delivered to all devices. This is likely why there is a limit on how many connected devices you can have at once; which is 5 at the time of writing… if you add a sixth then one of the others will be disconnected.

  2. Are groups any different? No, they are not and again the message seems to remain on the WhatsApp server until each member of the group receives it.

  3. Do messages ever die on the server? My gut says yes since there is some wordage in WhatsApp about logging in every 30 days, but I never tested this. It would make sense that they should expire.

You should always be “Open to Work”

In 2022 one of the best senior Java developers I knew reached out to me about leaving his current company, where he had been for a decade, and asked if I knew anyone hiring. Instantly I let the recruitment team where I work know and got him into the process to work with us. I had worked with him before, I had seen him on stage at events, and I knew he had the right personality to make a big impact on the business. I KNEW he would get hired.

Except he failed the coding interview.

I genuinely was shocked; so shocked that I did something that in 20 years of work, I had not done before… I assumed the interviewers were wrong, and escalated so that he could get a second chance. And he did get a second chance, and wouldn’t you know it, I was proven to be a fool – because he failed the coding interview again… it was the same question too!

I spoke to my friend about it, I read the interviewer’s notes and concluded that despite everything I knew about my friend, the skill that he lacked was the ability to interview. We were his first (two) interviews in a decade, and he just handled it in the way one would handle a meeting.

As I am a fool, should’ve expected this since I have a belief that I have held since 2013, always be open to an interview. You can love your job, but if you get an offer to interview – take it. I have lost track of how many interviews I have done in the 11 years since this realisation, and I have had the full gambit of experiences from rejection to offer letters… in that time I have only taken 2 roles. It sounds like a waste of time to do hundreds of interviews and get nothing for it, even if you love your job, but I do not think it is.

First, as with my friend, interviewing is a skill and like any skill it needs to be practiced being kept up to date. This skill includes not being stressed, and doing more interviews helps with that which leads to a better understanding of what is being asked and allows you to demonstrate your skills better. In tech, especially, the language we use, and tools change often, and being able to speak to the state-of-the-art is important; especially if you have been in a company and working on one tech stack or architectural design for a long time. Equally so as you grow, you need to know what questions will be asked of you so you can practise your own answers to them and find the anecdotes and examples you will share to show your experience.

The second reason is market data cause like anything, it is only worth what someone will pay for it and interviewing will give you real-world data on how what you earn today compares to the market. The same is true for demand as you can see how many roles for your preferred skill or level are there. When it comes to discussions about salary in your own organisation, being armed with real-world data will help you have realistic expectations.

The third reason is improving your own company’s hiring process. I cannot explain how many terrible interview processes I have seen, and it totally has put me off working with that company, but I have seen many good interviews too. And I have taken those ideas and shared them where I work. This has allowed the companies I work with to have amazing interview processes.

This obviously is not without risk, people you work with may jump to incorrect conclusions about your happiness based on hearing that you are interviewing. To solve this, I have always been open with my managers always about and told them that they will never be blindsided by me leaving; if I ever find something better, I commit to discussing it with them ahead of time. Some of my best managers have totally bought into it, and I think that talks strongly to their confidence in providing a great environment to work and effort in building trust with their reports.

So, in summary, I do encourage you to get your CV on OfferZen, set yourself to “Open to Work” on LinkedIn and sharpen that interview skill… just tell your manager first.

Thoughts on Rust

I finally caved into the pressure from all the smart people I know who have told me for at least the last 2 years that the Rust programming language is wonderful and we should all jump on it. I finally caved and decided to follow some tutorials over a week and then attempted to build a small program with it.

If you do not know my background, I have never coded in C/C++ - the closest I did was Turbo Pascal, and later Delphi, but I have mostly worked with a memory-managed runtime for 15 years (.NET, JVM, JS) and so caring about memory deeply this is a big change for me. For interest, I did all my coding either in the Rust playground (which is nice) or VSCode, which worked flawlessly.

Before I get to my thoughts on it, I want to call out 5 interesting aspects which were front of mind for me.

Included tooling

The first thing that jumped out was the lovely tooling, called Cargo, which is included as standard. In theory, you could avoid it and just do the pure Rust language but literally every tutorial and discussion assumes it is there, and it makes life so easy. It is beautiful and handy to have a universal tool stack. Cargo does several things, at first, it is a capable package manager, similar to Node/Yarn package manager. In addition, it has a formatting tool and a linter built in as well. This ensures that the bikeshedding around the language is nerfed and that is wonderful. Rust/Cargo also has its own testing framework. It lowers barriers and makes it so easy to get started. Cargo also has a documentation generator which takes the comments from your code combined with your code to build very useful docs.

Delightful error messages

When I screwed up the code, which happened often as I got to grips with the language, the error messages it raised were absolutely beautiful and useful! For example, I tried to do ++ (to do increment), which does not exist in Rust and the error message very clearly told me it does not exist and what I could use as an alternative.

Standard Error
   Compiling playground v0.0.1 (/playground)
error: Rust has no postfix increment operator
 --> src/main.rs:3:10
  |
3 |     hello++;
  |          ^^ not a valid postfix operator
  |
help: use `+= 1` instead
  |
3 |     { let tmp = hello; hello += 1; tmp };
  |     +++++++++++      ~~~~~~~~~~~~~~~~~~~
3 -     hello++;
3 +     hello += 1;
  |

Documentation unit tests

Having “built-in” tooling, especially for testing means you can do amazing things knowing it is there and Rust does one of the most interesting things I’ve seen in any language: include unit tests in the documentation. The tests are executable in VSCode and they run with Cargo but they do not need to be far away from the code - it is awesome from that side and they end up visible in the documentation that Cargo can build too. It is brilliant.

Shadowing immutable variables & redeclaring variables

Moving from the good to looking at the bad; redeclaring variables is something supported and it baffles my mind that it is even allowed. A small primer on this. Variables by default are immutable - this is great; you can make them mutable if you want and changing the value of immutable variables is not possible - for example, this fails:

fn main() {
    let aMessage = "This is the initial message";
    aMessage = "this will error";
    println!("{}", aMessage);
}

But we can redeclare the variable even if it is immutable, and I understand some scenarios it will be useful but it feels dirty.

If this was limited to mutable variables it would make sense, but 🤷‍♂️ In the above example, it prints “this will error” which is at least logical but when we add changes to scope, all that goes out the window. It feels like a mess to me and I hope the linter gets more strict on not allowing it.

Passing ownership of variables to functions

This is not bad, but it is easily the biggest shift in thinking needed. If you can grok this, you will be ok in Rust. In C#/Java/JS if you are in a function, create a variable and pass said variable to another function… that variable still exists in the original function. Passing by reference or pointer or anything… it does not matter.

In Rust, unless you opt-in (and meet some other requirements) if you pass a variable to a function - that function owns it and it goes away from the original function. Here is an example of what I am talking about:

fn writeln(value: String) {}

fn main() {
    let a_message = String::from("Hello World");
    println!("{}", a_message);
    writeln(a_message);
    println!("{}", a_message); // this line will error cause `writeln` now owns `a_message`
}

It is confusing initially but there are solutions and I do like that helps push the right design patterns.

Overall thoughts

Rust is excellent. It is modern and smart. It makes a lot of sense for high-performance systems or where you need to be running on bare metal. I think the intentional limits will make it stay a specialised tool, compared to something like Kotlin or TypeScript which lets you mix & match FP/OOP/shitty code styles.

Build a Google Calendar Link

I recently created an event website and needed to create links for people to add the events to their calendars - the documentation for how to do this for Google cloud is a mess so this is what I eventually worked out.

Start with https://www.google.com/calendar/render?action=TEMPLATE and now each piece we will add is an additional parameter.

  1. Title is specified as text, so append that to the URL if you want that. For example if I wanted the title to be Welcome, then I would add &text=Welcome, e.g. https://www.google.com/calendar/render?action=TEMPLATE&text=Welcome
  2. Date/Time is next, starting with the keyword dates. Here we specify the date as YYYYMMDD, e.g. 1 July 2022 is 20220701, and times are formatted as HHMMSS, e.g. 8:30 am is 083000. The date and time are seperated with the letter T and the starting and end pieces are seperated with /. For example if we start the Welcome at 8:30am on 1 July 2022 and it ends at 10am - the value would be 20220701T083000/20220701T100000.
  3. Timezone is optional, and without it you get GMT. If you want to specify a timezone, use ctz and the value is a tz database entry. For example, if you want South Africa it will be Africa/Johannesburg.
  4. Location is also optional and it is the key location with free form text in it.

If we put the above together as an example you get https://www.google.com/calendar/render?action=TEMPLATE&text=Welcome&dates=20220701T083000/20220701T100000&ctz=Africa/Johannesburg&Location=Boardroom

Notes:

  1. You must URL encode the values you use. For example, if you had a title of Welcome Drinks, it needs to be Welcome%20Drinks
  2. There are other parameters for description etc… but I never used them so I do not have them documented anymore.

How do you, and other companies, handle tech debt?

I was asked the question a while ago, How do I handle tech debt? and I am not sure I ever put it down in a form that makes sense; so this is an attempt at trying to convey several tools I use.

Pay back early

The initial thinking to handling tech debt is not my idea, but stolen from Steve McConnells’ wonderful Code Complete. Steve shows in his book that is if you can tackle tech debt earlier in the lifecycle of a project, the cost of tech debt is a lot less. Basically the quicker you get back to repaying that debt, the cheaper it will be.

One aspect to keep in mind is that while the early part of a project may refer to greenfields/new projects in your case, it is not limited to that idea. The early part could also mean epic-level-sized pieces of work that are started on an existing project, thus the “catch it early and it is cheaper” applies to existing teams just as much as it does to teams starting a new project.

Organisation

When I do think of what to do specifically to handle tech debt, the one that comes to mind first is also the only one that won’t fit into an existing team easily and that is team organisation.

I’ve seen this at Equal Experts, and previously when I worked at both AWS and Microsoft. The simple answer is that teams above 10 people fail. Why is 10 the magic number though?

  1. First is a bit of how we are wired, 15 is the number of close relationships we can have and a closer team performs better; but the eagled eyed reader you are will note that 10 and 15 aren’t the same and this is because you need to allow team members develop close relationships across an organisation, not just in their team.
  2. The second reason why 10 is the magic number is that as we develop increasingly complex systems, the ability for people to hold all the information in their heads gets increasingly difficult. If we maintain teams at about 10, it forces limits on the volume of work they can build. This natural limit on size limits the complexity too meaning you end up with many small teams.

When I say team, I am not referring just to engineering resources but the entire team; POs, QAs, BAs, and any other two-letter acronym role you may come up with. The entire team is 10 or less - so you may find you only have 4 engineers in a team.

Cross-skilled teams

There is also something just right about having teams of 8 to 10 people who have the majority of the skills they need to deliver the team end-to-end deliverables. The idea of a dedicated front-end team or dedicated back-end team, where they own part of a feature should be more unique in an organisation. The bulk of teams in a healthy organisation should own features end-to-end. This will force teams to work together and when you have teams holding each other responsible for deadlines and deliverables that helps trim fat in many aspects of delivery.

Having a team responsible to other teams, equally empowers the team to push back on new work because making sure their debt doesn’t overwhelm them and prevent them from actually meeting the demands of the teams which they are responsible for.

The focus in an area also helps people become more a master of the tech they use, and less a generalist and that mastery means the understanding of required trade-offs that lead to tech debt are better understood, compensated for and implemented right and that will, over time, lower the tech debt.

DevOps

I always encourage clients to adopt the DevOps mindset. The above cross-skill, end-to-end ownership hints at that, but one of the best pillars in DevOps to get in early is “You build it, you run it”. This term is simply the idea a team can write code, deploy it, monitor it and support it.

This mindset might feel like it goes against the above idea of a team having the skills they need and being empowered for a whole feature because how can a team own everything from first principles? But we will get to that solution later on.

Where I want to focus on how DevOps helps in regards to lowering tech debt, is that while a lot of serious outages are not caused by tech debt; how long it takes to recover from an outage is often directly related to the tech debt of the team. Recovery from an outage is not just “the website is back”, but includes all the post-incident reviews and working out how many clients were impacted etc… Nothing I have found motivates a product owner and empowered team to cut down on tech debt than the risk they will be woken up for incidents at 2 am and not spending a lot of time after trying to root cause it.

Happy I can even share more specific information from my latest project, as I have a YouTube video the client made with us on this I very aspect.

Reign in tech

A powerful tool in large organisations is to limit technology choices across an organisation, as tech which is kept up to date and used is a lot less of an issue for tech debt, compared to an old system written in a language or tooling that few understand.

Bleeding edge is called that cause it hurts

A small solution to tech debt is to pick stable tech for your organisation. Nothing builds up tech debt faster and is more painful to deal with than bleeding-edge tech.

Horizontal scaling teams

That cost of getting started and frustration just leads teams to not invest which is yet another major worry. This is also the solution to how a small team deals with the first principles that I mentioned earlier.

To solve this we often build horizontally focused teams that have a single feature or set of features that other teams build on top of. IDPs are a great example of this. Another example is having a team that handles all the web routing, bot detection, caching etc… and other teams plug into their offerings. In this case, the team building the web tech might say “We only support caching with React ESIs” so teams in the business can choose to use React and get the benefit of speed and support. They are still empowered to choose something else, but they now need to justify the trade-offs of lost speed compared to using the “blessed tech” from the horizontal teams.

A great example of this is covered in the Equal Experts playbook on Digital Platforms.

Trickle down updates

An interesting side effect of the horizontally scaled teams is that they also become their mini places which force other teams to keep their tech up to date. This happens naturally as the horizontal team updates and forces new updates to consumers of their solutions.

I was looking at that recently where the team responsible for the deployment pipeline runners we use, issued new runners and that meant we needed to take on operational work to migrate from the old runner system to the new runner system. This forced work meant cleaning and fixing how we worked with the runners; it wasn’t a great time for the team but the system coming out at the end is in better shape.

Bar raisers

An aspect that was unique to AWS which I loved, and also is easier to adopt than organisational changes, is the concept of bar raisers.

The idea is the bar raisers are a group of people who give guidance to others to improve them, but they are not responsible for the adoption of that guidance.

For example, at AWS if you wanted to do a deployment which was higher than normal risk, you would be required to complete a form explaining how it will happen, how you will test and how you will recover if it does wrong. You would then take that to the bar raisers who would review the document and give you feedback. This is great because they are not gatekeepers, they are not there to prevent you from doing a deployment (again teams need to own what they build), but they bring guidance and wisdom to the teams.

We had set times for the bar raisers and set days when each of us would do it, which helped the senior people not be overwhelmed with requests. The concept of bar raisers was used in all aspects, including security and design. This sharing helped teams find out about each other’s capabilities, and shared knowledge and helped teams from falling into holes others had found while not bringing in the dreaded micromanagement.

Tech debt is normal work

The last two concepts are two of the easiest to adopt in any organisation. The first is just to capture all tech debt as normal work in your backlog. This helps teams prioritize and understand the lifecycle of their projects better.

We have done some experiments recently to measure avg. ticket time, and when coupled with operation tickets (as we call them) they drag your avg. down if they not getting attended to. This helps the product owner to prioritize correctly and understand the impact.

Even if a team doesn’t pick up the work immediately that is ok because an important aspect of teams that do adopt “you build it, you run it” is they will have natural ebbs and flows in their work. For example, the festive season might be very quiet since you’ll have someone on call in case something goes wrong, but a lot of the team is not there. This quiet time becomes a great opportunity to get tech debt resolved.

Lastly on capturing it; you can’t fix what you can’t see - so shining a light on it, and just going “Well that is worse than we expected” is a great first step.

Tech debt sprints

The last one is the idea we had from my days at Microsoft: tech debt sprints. I spoke of this at Agile Africa, in case you want to watch a video. The idea is to add an extra sprint into every feature at the end and just allow the team to tackle tech debt. At Microsoft, this let us go fast, ship MVPs to customers, get feedback and make trade-offs all knowing we were piling up the tech debt, but also gave the team confidence that it would be fixed sooner, rather than later or never.

Keeping dependencies up to date

If you work with JavaScript or TypeScript today, you have a package.json with all your dependencies in it and the same is true for JVM with build.gradle… in fact, every framework has this package management system and you can easily use this to keep your dependencies up to date.

In my role, every time I add a new feature or fix a bug, I update those dependencies to keep the system alive. This pattern originates from my belief that part of being a good programmer means following the boy scout rule.

I was recently asked if I believe that these dependency upgrades are risky and should we rather batch them up and do them later since it will make code reviews smaller and our code won’t break from a dependency change.

I disagreed but saying “the boy scout rule” is not enough of reason to disagree… that is a way of working, the reasons I disagreed are…

Versions & Volume

All dependency version upgrades have the chance to fail. By fail I mean they break our code in unexpected ways.

There is a standard that minor version changes should be safe to upgrade, which is why I often will do them all at once with minimal checks while major version changes I approach with more care and understanding. Major changes normally happen by themselves. This is because the major version change is the way the dependency developer tell me and you there are things to be aware of.

Major vs. minor will never be rules to rely on perfectly, rather they are guidance of how to approach the situation. Much like when you drive a car, a change in speed of the road is a sign that you need more or less caution in the next area.

As an example that the type version changes and also the volume of changes are not factors let me tell you about last week. Last week I did two minor version updates on a backend system as part of normal feature addition. It broke the testing tools because one of the dependencies had a breaking change. A minor version, with a breaking change.

It was human error on the part of the developer of the dependency to do a minor and not a major change; which impacted how I approached updating, and that will always increase the chance of issues.

Software is built by humans. Humans, not versions will always be the cause of errors.

Risk & Reward

I do like the word “risk” when discussing should you update, because risk never lives alone; it lives with reward.

How often have you heard people saying updating is too risky, focusing on the chance of something breaking… but not mentioning the reward if they did update?

Stability is not a reward; Stability is what customers expect as the minimum

When we do update we gain code that performs better, is cheaper and easier to maintain and is more secure. The discussion is not what will it break, it is why do we not want faster & safer code for cheaper?

I have an inherited piece of code from a team that did not update the versions, it has a lot of out of date dependencies. It is a high chance of breaking when we start to upgrade those dependencies because it was left uncared for.

However, if I look at the projects my team has built where we all update versions every time we do a change, we only ever going to be doing one or two small updates each time. It is easy to see when issues appear which makes fixing the issues easy too.

Death, taxes and having to update your code.

As a developer there is only one way of escaping updating your code: You will hand the code to someone else to deal with and changes teams, or eventually, you will need to upgrade - doing it often and in small batches is cheaper and easier for you.

Using the backend system example again from above. I only had two small changes to dependencies, so my issue was one of them. I could quickly check both of them and I ended up in the release notes for one of them within 15min where the docs clearly showed the change of logic. That let me fix the code to work with it and thus we could stay on the new version. If I had 100 changes… I would’ve rolled it all back and gone to lunch and future me would hate past me for that.

Architects & Gardeners

Lastly, our job is not to build some stable monument and leave it to stand the test of time. I deeply believe in DevOps and thus believe the truth that software is evolutionary in nature and needs to be cared for.

We are gardeners of living software, not architects of software towers.

In our world, when things stop… they are dead. Maintenance and fixing things that break is core to our belief that it is the best way to deliver value to customers with living software.

Tenets of stable coding

  1. Build for sustainability
    We embrace proven technology and architectures. This will ensure that the system can be operated by a wide range of people and experience can be shared.
  2. Code is a liability
    We use 3rd party libraries to lower the code we directly need to create. This helps us go fast and focus on the aspects which deliver value to the business
  3. Numbers are not valuable by themselves; We focus on meaningful goals and use numbers to help our understanding
    We do not believe in 100% code coverage as a valuable measure
  4. We value fast development locally and a stable pipeline
    We should be able to run everything locally, with stubs/mocks, if needed. We use extensive git push hooks to prevent pipeline issues.
  5. We value documentation, not just the “what” but also the “why”
  6. We avoid bike shedding by using tools built by experts, to ensure common understanding.

We acknowledge that there are the physics of software which we cannot change

  1. Software is not magic
  2. Software is never “done”
  3. Software is a team effort; nobody can do it all
  4. Design isn’t how something looks; it is how it works
  5. Security is everyone’s responsibility
  6. Feature size doesn’t predict developer time
  7. Greatness comes from thousands of small improvements
  8. Technical debt is bad but unavoidable
  9. Software doesn’t run itself
  10. Complex systems need DevOps to run well

From Tom Limoncelli; his post goes into great detail

OWASP TOP 10

Recently been talking a lot about the OWASP Top 10 and have created some slides and a 90 min talk on it!

So if want to raise up your security, this is a great place to start.