Robert MacLean

16 November 2019

Here is the story of all the work I did this week, and it is so odd I feel it needs to be shared… but lets talk about the world the problem can be found in.

The world

It is a µservice (that is the ohh, look how smart I am to use a symbol for micro) written with DropWizard and deployed in a Docker container, with Traefik in front of that. To hit it you go through a ingress controller and a load balancer.

 +---------------+
 |               |
 |   Internet    |
 |               |
 +---------------+
         |
         |
         |
 +---------------+
 |               |
 | Load Balancer |
 |               |
 +---------------+
         |
         |
         |
 +---------------+
 |               |
 |    Ingress    |
 |               |
 +---------------+
         |
         |
         |
 +---------------+
 |               |
 |    Traefik    |
 |               |
 +---------------+
         |
         |
         |
 +---------------+
 |               |
 | Microservice  |
 |               |
 +---------------+

The microservice acts a BFF (backend for front end), so it does some auth fun and makes calls to an internal API and manipulates them (e.g. changes the data structure). We have a number of different REST style calls across GET, POST, PUT and DELETE.

In terms of environments, we obviously have a production environment and we have an integration environment which are setup the same way. We have a stubs environment where we fake out the internal API. Lastly we can run the microservice on our laptops, but there is just the microservice… no traefik etc…

The Problem

When we run our load test from outside everything works except 1 call which fails 98% of the time with a HTTP 500. Other calls (even the same method) all work. Load tests run against the stubs environment and the same call works perfectly on our laptops, in integration and production.

We can even run the fake internal API on the laptops, with the load tests and it works fine there. Basically, one call fails most of the time in one environment… 🤔

Grabbing logs

When we pull the logs for the microservice in stubs things get weirder… it is returning an HTTP 200 😐 This is the same experience we get everywhere else, it works… except in stubs to the load tests it gets a 500

+---------------+   500 Here
|               |
|   Internet    |
|               |
+-------+-------+
        |
        |
        /
        |
        |
+-------+-------+   200 Here
|               |
| Microservice  |
|               |
+---------------+

Further pulling of logs show 500s in the load balancer and the ingress so somewhere between the microservice and traefik the HTTP 200 becomes a HTTP 500… but we don’t have logs on traefik we can pull which makes this a bit harder to determine…

Logging onto Traefik

Next we logged onto Traefik and decided to curl the microservice directly and lo and behold… we get a 500 🤯 and to make it more interesting that the microservice logs still show it is returning a 200 - like what the actual?! Could there be an network issue or magic?

Interesting the 500 came with an error saying “insufficient content written”

Insufficient content written

This led me to looking at the content and looking at what we are sending and I see we are sending Content-Length and the body and guess what… the length of the body does not equal the Content-Length… oh 💩

This is a client side HTTP error… where the server sends the incorrect amount of body, so the client goes, well the server is wrong and raises a 500. I’ve always thought 500 errors were server error and thus could only be raised on the server.

The fix

The fix is simple, in our server we were using the Response.fromResponse to map internal API to the public API and so it was copying the Content-Length from the internal API and we were sending that along.

This meant the fix was to delete the Content-Length header before we call fromResponse to ensure it would rebuild the header and be correct.

The reason why it didn’t fail else where, the version of the mock API we use added Content-Length but newer versions and the real APIs used chunked encoding which never set the header so there is no issue there.