Irrational Number About

Using select(2)

This started out as a Hacker News comment about a blog post about select being fundamentally broken After that the same author wrote why epoll is broken as well. I don’t know what problem author is trying to solve but let’s pick this one:

Let’s say you run an HTTP server, serving a large number of short lived connections. You want to accept() as many connections per second as possible. Doing accept() in only one process will surely be CPU-bound. How to fix it?

Now I’m no network programming expert but I have written enough of networking code in C (and C++, Python, Java) over the last 17+ years and I have thought about the C10K problem and read papers on architecture of servers, proxies, load balancers and using socket API efficiently. People working on libev and HAProxy definitely have a better understanding and might find flaws in my arguments.

Using select the correct way

Literature usually says that fd_set is limited to 1024 descriptors but some mention that you can increase that by defining FD_SETSIZE to a larger number. Last time I checked it did not work on Linux although it did on SunOS and AIX. However fd_set is just an array of bits and you can allocate memory for as many bits as you need. And turns out that all select implementations I worked with accepted larger arrays of bits than the standard fd_set allocated. I still had to put #define FD_SETSIZE 65536 in my code for SunOS to use select_large_fdset function instead of select.

fd_set *wfds_out = calloc(FDS_BYTES(65536), 1);

To find out which sockets have events some loop through data structures and check each socket they contain. But a much better way is to iterate through fd_set as array of long and check 32-64 bits at a time. If any of bits is set only then check them individually in a nested loop. Some implementations even check read, write and exception sets at the same time by ORing bits - Linux kernel does that in fs/select.c.

unsigned long *wfds = (unsigned long *)wfds_out;
for (unsigned i = 0; i < FDS_LONGS(maxfd); i++, *wfds++) {
  unsigned long bits = *wfds;
  if (bits != 0) {

    unsigned long bit = 1;
    for (unsigned j = 0; j < FDS_BITPERLONG; j++, bit <<= 1) {

      if (bits & bit) {
        int fd = i * FDS_BITPERLONG + j;

Once a notification is received for listener socket the best approach is to call accept multiple times up to certain limit (10? 50?) or until EAGAIN or EWOULDBLOCK error is returned indicating there are no connections pending. This way more useful work is done per iteration amortizing the cost of “expensive” select call. I found this reading “Scalability of Linux Event-Dispatch Mechanisms” by Abhishek Chandra and David Mosberger.

while (1) {
  int client = accept(server, (struct sockaddr *)&sin4, &sin4_len);
  if (client == -1 && (errno == EAGAIN || errno == EWOULDBLOCK)) {
    break;
  }

Another useful idea is to trade latency for throughput with interval control. It works by limiting how many select calls are performed per second and enforce it by sleeping before the next call to select. During that time more events will occur on sockets and the next select call will return more events to process. Again, more useful work will be done per iteration. The source for this is paper “Efficient Network I/O Polling with Fine-Grained Interval Control” by Eiji Kawai, Youki Kadobayashi, Suguru Yamaguchi.

while (1) {
  tnow = usec();
  if (tnow - tprev < 1000) {
    struct timespec spec;
    spec.tv_nsec = 1000 - (tnow - tprev);
    nanosleep(&spec, NULL);
  } else {
    tprev = tnow;
  }

  int nready = select(server + 1, &rfds_out, NULL, NULL, NULL);
 

All the small things

select modifies descriptor sets passed in. A common mistake is to create descriptor set from zero before calling select by looping through data structures and adding all sockets they contain. I’ve seen that to take huge amount of time due to locking and synchronization between threads. The correct way is to maintain a descriptor set all the time and create a copy of it and pass the copy to select.

select returns number of file descriptors in resulting sets. You should count events you have processed and stop once you reach number returned by select.

Most programs will need to store some state associated with the socket. Map seems to be a good choice at first but you should use an array indexed by socket (it’s an int). The POSIX API says that kernel always has to use the lowest unused file descriptor when creating a new one. So the array will be as big as many connections your program handles and kernel will take care of finding a free slot in your array.

And if you use the hard limit returned by getrlimit you can always pre-allocate both the array for state of connections and descriptor sets for select. Which means no re-allocations during runtime and no synchronization between threads. If the state of your connections is really big, allocate array of pointers.

Results

So I wrote sample code using most of the ideas I described above. I wrote it from scratch not using any of my previous work so it “works on my computer” and there are bugs. I used a a small Linux virtual machine to run it with just 1 CPU and 512 Mb of RAM. This is how I started the server:

./accept-server 1025 > accept-server.log

And client in a different console:

./accept-client 100 127.0.0.1 1025 > /dev/null 

And it gave me following results:

awk '{count[$1]++} END {for (word in count) print word, count[word]}' accept-server.log | sort -n
1496608453 40154
1496608454 46292
1496608455 44843
1496608456 47250
1496608457 48005
1496608458 47204
1496608459 45685
1496608460 45992
1496608461 41870
1496608462 44020
1496608463 45392
1496608464 44592
...

First column is the timestamp (seconds since epoch) and second is number of connections accepted.

accept-server was consuming 24% of CPU time and accept-client 71% of CPU time. I think results are pretty good: 40,000 accept calls per second. Passing new sockets to worker threads would take some time but it’s still possible to handle tens of thousands of accept calls with a single thread.

select vs others

poll transfers 8-16 bytes per descriptor to the kernel compared up to 3 bits for select. poll is more efficient for very sparse descriptor set and I would use it if I had a listener thread (poll for few sockets) and separate thread for connections (select for thousands of sockets).

epoll uses system calls (“slow”) to add or remove descriptor to the set. select does the same thing in userspace. If you have to add and remove many descriptors per iteration which is typical for short-lived connections then select might be more efficient. In fact it was for synthetic tests I did years ago. Just search for EVBACKEND_EPOLL in libev documentation

Closing words

I usually ignore topics about the fastest networking code since for problems I work on application logic and database access code is the bottleneck. I have seen badly written networking code that caused problems but not code where the accept call is the bottleneck. The other thing is once you reach certain number events per second you should distribute work among several processes and machines anyway. That may differ if you work for Google or large ISP or try to develop a load balancer or a proxy.

Martin Thompson often talks about mechanical sympathy - that you have to understand how the hardware works and take that into consideration when you design software. Well, you should have sympathy for your Operating System and it’s APIs and find the best way to use it.

Use libev, Boost.Asio or other libraries. I do. Just because I know how to use select efficiently does not mean I want to write low-level code.