Connection timeout for HTTP request to a particular server causing connection timeout for the all the request to the other healthy servers
Tech Stack:Deployed as a Docker Container using OpenJDK 17 on Ubuntu MachineApplication code written in Kotlin using KTor library
Observations
We observe Connection time out every 2-3 days
Deployed using Docker (have deployed using the bridged network as well as host network)HTTP Endpoints are called using Ktor Client with CIO engine(uses java.nio.socket under the hood)
At any time, max FDs(File descriptors) used by the deployed service is ~250 (ls -la /proc/1474228/fd | grep 'socket' | wc -l)
Restarting the deployment (just docker restart ps_id not system restart) fixes the issue but then it resurfaces after every 2-3 days
System FD limit (ulimit -n) is set as 65565
So far we have checked
The weird thing is that we can reach our logging service from the same service(it’s over HTTP only though hosted on the same machine). PS: It uses KTOR+CIO but an independent instance.
Another weird thing is that we are able to reach the postgres database, the DB is on a different machine. So process level TCP buffer saturation can be ruled out I suppose.
Moreover, we tried reproducing the same on our QA machine(s) with 10x of the load we have in production (1500 concurrent HTTP requests, 120000 aggregate)
We didn’t get a single connection timeout issue.