Don’t fail silently, scream!

Your business logic most certainly has the error, not someone else’s library. Use error messages, your best friend.

While working on a recent project, while I was making remote calls from a client application to a server running on the JVM everything worked until after certain number of requests to the server, the application kept failing.

The server got irresponsive after too many requests, irrespective of the frequency. It just kept stopping.

I was presented with a very specific and helpful message: TSocket: timed out reading 4 bytes from {hostname}:{port}

Initially I thought I may have a very short timeout indeed, so I just bumped up the read and write timeouts to 60 seconds to see if it made any changes. It didn’t. It kept stopping after a certain number of simultaneous requests, and it wouldn’t start until I restarted the server.

Something didn’t make sense, I had tested every edge case of every unit properly - yet while trying to integrate it with the rest of the modules, something simply stopped working.

I kept looking at the client side, and never bothered to look at the server side - because everything seemed “just fine”. Needless to say, I failed and failed hopelessly for a week.

I ended up asking the mailing list, opening a Jira ticket with a critical bug report and had long discussions with developers and users. I was surprised no one had this issue, and they couldn’t come up with a proper answer to this.

Let’s just say the community support isn’t that great for this specific framework.

After a week of forgetting it, working on the business logic, writing more tests I finally decided to fix this issue. I kept reproducing the bug. Then, I noticed a pattern.

It wasn’t just a random number of requests - it was exactly 10 requests. It kept failing after 10 requests, which was odd. Any server should be able to handle 10 requests without any sweat. And why 10? It seemed too human-istic.

For a human being to choose the number 10 (in decimal), very common. For a computer? Not so much.

Maybe I wrote a magic number 10 somewhere? Probably.

Then I started checking out the logs my java server was appending, and the persistence layer likes egesting the SQL queries being executed to the logs for some reason. Which ended up being very useful.

For every request that I made, 2 queries were being executed. Wrongly, they shouldn’t be executed in the first place (which really wasn’t the cause of the problem, but it did help me fix the problem).

So, 2 queries, multiplied by the 10 requests gave me 20 queries in total. That number seemed familiar.

I started checking the connections in the server, if I had left anything open. Nope, every socket connection was closed properly.

Then maybe the database was blocking my connections? 20 queries are too much? Is there a MAX_CONNECTIONS limit somewhere? I was pretty sure I didn’t specify a limit like that anywhere, yet. It’s my development machine.

I checked out my MySQL server with the handy command:

mysql> show processlist;
+-----+------+-----------------+---------------------+---------+------+----------+------------------+
| Id  | User | Host            | db                  | Command | Time | State    | Info             |
+-----+------+-----------------+---------------------+---------+------+----------+------------------+
| 831 | root | localhost       | NULL                | Query   |    0 | starting | show processlist |
| 847 | root | localhost:34782 | Application Service | Sleep   |   54 |          | NULL             |
| 848 | root | localhost:34784 | Application Service | Sleep   |   54 |          | NULL             |
| 849 | root | localhost:34786 | Application Service | Sleep   |    2 |          | NULL             |
| 850 | root | localhost:34788 | Application Service | Sleep   |   54 |          | NULL             |
| 851 | root | localhost:34790 | Application Service | Sleep   |   54 |          | NULL             |
| 852 | root | localhost:34844 | Application Service | Sleep   |   52 |          | NULL             |

AH-HA!

My connection pooler has been spawning way too many processes.

Then I went to my persistence config file specifically looking for my connection pooling config. And here it was - maximum pool size was set to, guess what? 20. All the dots fell into the their correct places, they finally connected.

My connection pooling agent was spawning (or leasing) a new connection for every query I made - and that was expected. What was not expected was that the connection pooler never released them.

Then, I finally looked at my business logic. And guess what? All that pain for a week was for a silly, silly error.

I didn’t commit or rollback the transactions I started. Uncommitted transactions don’t leave the pool, and hence for every request- I had 2 connections.

As soon as I reached 20 connections, things broke. Fixing the transaction worked like a charm.

Lesson Learned:

Close your connections.
Commit your transactions.
Your business logic has the error, not the library.

Here’s where error messages could have been super helpful.

Thrift doesn’t like to carry forward the exceptions thrown by the server to the client which it doesn’t know about.

You have to explicitly tell it, “hey, these are the exceptions that might occur if something goes wrong. Ask the client to handle them”. And if you don’t, well it throws a generic exception with generic messages. Not very helpful.

When I check the server console, when it’s being run with Thrift, it doesn’t fail with an exception either! Java is usually super verbose, but in this case - it failed silently. Surprising.

If I hadn’t run it by the server remotely, but with a custom caller within the server, maybe it would have failed while screaming something like “MaxConnectionsReachedException” (I don’t know what it really throws) with an elaborate stack trace. Now, that would have easily led me to the path I really needed to walk on.

That’s why exceptions are your best friends. Just a simple error message or a stack trace could have pointed me to exactly what went wrong, but it simply wasn’t visible to me.

Well, this is one of the shortcomings of my RPC framework. There are downsides to everything, but it taught me a valuable lesson:

In a development environment, always fail while screaming. NEVER fail silently.

Also, I need to look at my business logic more and integration test more often. And you should too. Stop blaming.

Dedipyaman

Don't fail silently, scream!

About The Author