Meditation, The Art of Exploitation: It's hard to write safe and sound C++ code

Recently I found an issue with my C++ network socket library. It's leaking file descriptors. After an application linked to this
library runs for a while, it starts to fail to create new file descriptor. lsof indicates that there are 1000+ sockets opened in
the state of 'can't identify protocol'. Using valgrind file descriptor leak check shows that there are open file descriptors in
the state of <-> . This kind of socket is typically a result of non-connected socket. i.e. A socket is created
on the client side but it's not connected to any server or failed to connect to any server.

Now this would seem to be a strange thing at first given that a C++ class destructor will always release the resource acquired
through its constructor. But as I investigated deeper into the issue, the revelation came to me 'It's hard to write safe and
sound C++'.

Let's first check the code that's not working:
socket.hpp:



#ifndef MY_SOCKET_HPP
#define MY_SOCKET_HPP

#include "utils/network/exceptions.hpp"
#ifndef THROW_SE
#define THROW_SE(cond, msg) do {\
if(cond) throw socket_connect_exception(msg); \
} while(0)
#endif

class basic_socket {
private:
    int sockfd;
public:
    basic_socket(int sockfd) : sockfd(sockfd) {}
    ~basic_socket(){
        if(sockfd > 0) ::close(sockfd);
        sockfd = -1;
    }
};

template 
class client_socket {
private:
    socket_type m_sock;
    int connect(string, int) throw(socket_connect_exception);
public:
    client_socket(string ip, int port) : m_sock(connect(ip, port)) {}
};

template 
int client_socket::connect(
    const std::string & hostname, unsigned int port){
    int sockfd = -1;
    struct sockaddr_in serv_addr;

    sockfd = ::socket(AF_INET, SOCK_STREAM, 0);
    THROW_SE(sockfd < 0, "client_socket open socket");

    int ilen = sizeof(int);
    int itrue = 1;
    int r = ::setsockopt(sockfd, SOL_SOCKET, SO_REUSEADDR, &itrue, ilen);
    THROW_SE(r == -1, "basic_socket::basic_socket Failed to set socket for reuse");

    struct hostent server;
    struct hostent *hp;
    char hsbuf[4096];
    int herr;
    ::bzero(&server, sizeof(server));
    THROW_SE((::gethostbyname_r(hostname.c_str(), &server, hsbuf, 4095, &hp, &herr) != 0),
        "ERROR client_socket::connect::gethostbyname_r no such host");

    bzero((char *) &serv_addr, sizeof(serv_addr));
    serv_addr.sin_family = AF_INET;
    bcopy((char *)server.h_addr,
         (char *)&serv_addr.sin_addr.s_addr,
         server.h_length);
    serv_addr.sin_port = htons(port);
    int ret = 0;
    ret = ::connect(sockfd, (const sockaddr *)&serv_addr, sizeof(serv_addr));
    THROW_SE(ret < 0, "ERROR client_socket::connect::connecting");
    return sockfd;
}

#endif

As you can see, once you leave the ideal and conceptual language realm and delve into the real world, in
this case, Unix networking, things start to get more complex to manage. First as a library writer, you
generally shouldn't just terminate the program if your client fails to establish a connection to another host.
Instead you should report the error the client and let the client user decide based on the information the
library provided. Since a connection failure is not a logical error (as opposed to use assert), ideally it should
not cause a program shutdown.

In the spirit of that, there are 2 ways to propogate the error condition back to the client, exception or
error code. I've opted to use exception since the very approach can also handle signal as detailed in my previous
blog entry.

Now at this point, the bug is hidden in the code. Why isn't basic_socket cleaning up the socket if there is a failure
in client_socket establishing a connection? Think about it before you scroll down for the answer.

As it turns out, it's possible that a basic_socket is never constructed if there was failure in any of those system
calls to operate the socket. For example, ::socket succeeds, but connect fails and generates an exception. And once
the exception leaves the connect subroutine, the C++ runtime will look for its handler. But neither client_socket
or basic_socket should handle it because they can't make decisions on an connection failure thus there is no exception
handling code for them. As socket_connect_exception leaves basic_socket constructor, m_sock is never constructed. And
when the default client_socket constructor starts to destroy its members, the default m_sock object has no socket to
destroy. Thus leading to a file descriptor leak.

To fix this, I wrapped the statements in client_socket::connect in a try and rethrow block, but just before rethrow,
I test and close the socket. Viola, this fixed the file descriptor leak issue. During investigation of this issue,
I found that the following sutle points dealing with network sockets in a multi-thread application in C++.

1) double close can cause sutle and difficult to track bug. The way unix handles file descriptor creation is that once
a file descriptor is closed, the same file desciptor number will be immediately reused. e.g. int fd = 10; close(fd); fd = open(...);
at this point fd will equal to 10 normally. This has implication in multi-thread application, because a double close may end
up closing a working file descriptor on the 2nd close. Thread #2 may open the file descriptor number closed on the first
close by thread #1. As you can well imagine, this leads to inconsistent, non-producible bug behavior.

2) pay close attention to copy constructors, copy assignment operators the compilers may create for your system resource
management class. The solution, never allow direct copy semantic on this kind of class. Inherit from boost::noncopyable. If
you absolutely need copy semantic, wrap your class creation through boost::shared_ptr and copy the pointer around, this way
the resource is reference count managed and released automatically. This also provides a solution to the double close fiasco.
Since the resource is never double released.

Meditation, The Art of Exploitation

About Me

Wednesday, August 22, 2007

It's hard to write safe and sound C++ code