Thursday, September 12, 2013

Realloc, wherefore art thou realloc?

A few years ago, I found a pretty significant hole in the C standard that could allow strange behavior in a real world program.

The program in question was a worklog. It received a string over the network from it's sister program, and then after doing some trivial processing on that string, stuck the result wholesale into a database for later retrieval and display.

The implementation was in straight C (C99 standard with GNU extensions), and used the unixODBC package to communicate with an SQLite database. All pretty straight forward, and fully functional... unless someone gave me an empty log book entry.

The logbook backend was coded using defensive programming tactics, where every possible input was checked, null pointers were checked, input lengths were validated, arguments were properly verified to be known valid values, etc. Also, to save space in the programs that were calling this library, I was very judicious about wasting memory. I never allocated anything on the heap that I didn't free as soon as I was finished with it, and whenever possible I used stack allocation. But my library wasn't the point of allocation of the user's log entry... that was always allocated outside of my libraries control, so as to share the same memory buffer with other operations.

Here's how things got crazy.

On first initialization, when no logbook operations (or any operations) had been conducted yet, the network client program would query my library through a series of API calls that were standardized years before I joined the project, with null arguments to verify connectivity. This ultimately would result in an attempt to insert a record into the database that was of length zero, and would cause a call to realloc a buffer of length zero.

Realloc would happily oblige, and give me back a pointer that I would never reference, because the size of the buffer was zero. So the function would zoom through the rest of the logic of my library skipping basically everything, and then return a success to the client program.

But the trick here is that the pointer returned by realloc WASN'T a null pointer. It was a special pointer that points to a region of memory of length 0.

Subsequent calls to my library with that buffer would barf, because they were acting on the assumption that the buffer was non-null, and thus valid to write to. Causing all kinds of havok that took weeks to track down.

void * foo()
{
    void * ptr = NULL;
    ptr = realloc(ptr, 0);
    ptr = realloc(ptr, 0);
    ptr = realloc(ptr, 0);
    return ptr;
}

What's returned by foo()?

Did you guess NULL? You're wrong.

The behavior of realloc() is very peculiar.

If realloc() is given NULL as the first argument, then it behaves exactly like the implementation of malloc() does on your platform (in all likelyhood, it just calls malloc() internally.).

If the first argument isn't null, and the second argument is 0, then it calls free() on the first argument, and returns NULL.

However, according to the C standard, it is left up to the implementation of the standard library whether malloc returns NULL, or something else when it's given 0 as it's argument.

So in this case: realloc(NULL,0) -> malloc(0) -> returns a special address with a size of 0.

A second call: realloc(specialaddress, 0) -> free(specialaddress) -> returns NULL

A third call: realloc(NULL,0) -> malloc(0) -> returns a special address

My advice is to guard your calls to realloc so that situations where you would have called it with NULL and 0 result in the behavior that you expect.

Trials and tribulations of not freezing your Qt event loop via a lengthy destructor.

I ran into an issue not too long ago where the application that I was writing needed to allocate a huge number of very tiny data structures and then delete them all at once. Now, obviously, there's something wrong with my program's design that it needs to accomplish this... but the point of this particular program was to solve a very specific problem without spending a huge amount of time on development.

The problem here was that even though the performance of the program was acceptable in all other respects, deleting all of that data took a long time. It took so long that the eventloop froze, and the rest of the program locked up until it was finished.

Here's my quick 15 minute solution. Absolutely do not use this technique in any mission critical code... but for a quick one-off, well, it's your project.

 Firstly, to save on having to manually manage cleaning up after myself, I initially create the data containers as a QSharedPointer
QSharedPointer<QStandardItemModel> data(new QStandardItemModel);


Because my data was being read from disk and the data structures created on another thread (so as to avoid freezing the gui), I was already moving the data containers from that alternate thread into the main thread using

data->moveToThread(QApplication::instance()->thread());

 From there, the QSharedPointer's are returned to the main thread as the contents of QFutures using the QtConcurrent module
QFutureWatcher<QSharedPointer<QStandardItemModel> > * future = new QFutureWatcher<QSharedPointer<QStandardItemModel> >;
future->setFuture(QtConcurrent::run(dataParserClass, dataParserFunction);

Which runs the parsing function on the parsing object that has been set up previously, on another thread to avoid blocking the event loop. Ultimately returning, through the future, the QSharedPointer to the container that we've re-assigned to the main thread.

Now, create a thread to hold the containers you wish to asynchronously delete

QThread * deleterThread = new QThread;
deleterThread->start();

Define a new object to hold the shared data, probably put this in a .h file:
class DataContainer : public QObject
{
Q_OBJECT
public:
explicit DataContainer()
: QObject()
{
}

QSharedPointer<QStandardItemModel> data;
public slots:
void populate(QSharedPointer<QStandardItemModel> var)
{
data = var;
emit populated();
}
signals:
void populated();
};


Then move the data to this new thread:
DataContainer * container = new DataContainer();
container->moveToThread(deleterThread);
connect(deleterThread, &QThread::finished, container, &DataContainer::deleteLater);
And finally put your data onto the new thread when it's finished being generated. Due to strange behavior of cross-thread signals, i ended up settling on this method of calling the function... but i rewrote enough of the rest of the code that this might not be necessary anymore. Who knows.

QMetaObject::invokeMethod(container, "populate",
Q_ARG(QSharedPointer<QStandardItemModel>, future->result()));

Now your data is referenced by the datacontainer object, which will keep your QSharedPointer active until you stop the deleterThread with deleterThread->quit(). The deletion will happen on the deleter thread, preventing your eventloop from freezing!

Like I said initially. This is a complete architectural hack, and you should avoid it, but sometimes we all make compromises for the sake of expediency.

Systemd and NTP on Gentoo

Gentoo currently doesn't have systemd support for the NTP package. So I built some to tide me over until it's officially available.

It's not perfect, but at least it's functional!

Here's the main long-running service:
[Unit]
Description=NTP Daemon
After=network-online.target ntp-oneshot.service

[Service]
ExecStart=/usr/sbin/ntpd -n -x -g -u ntp:ntp -c /etc/ntp.conf
Type=simple

[Install]
WantedBy=multi-user.target
In case your system doesn't have a hardware clock, here's a one-shot service that will override the systemclock to match the value of the network time.
 [Unit]
Description=NTP Daemon
After=network-online.target

[Service]
ExecStart=/usr/sbin/ntpd -q -g -x -u ntp:ntp -c /etc/ntp.conf
Type=oneshot

[Install]
WantedBy=multi-user.target

Don't forget to tell systemd-timedated about your ntp service
#echo "ntpd.service" > /usr/lib/systemd/ntp-units.d/60-ntpd.list
#systemctl enable ntpd.service && systemctl start ntpd.service

Running Atlassian Jira, Confluence, and Stash using systemd unit files

One of the things that I've always been annoyed with when trying to run the atlassian products is the lack of proper integration into the linux init systems. They provide some support for basic sysvinit, but nothing for systemd or others.

Here are some unit files that I whipped up for them. Obviously nothing fancy but hopefully it helps some wayward souls.

Jira:
[Unit]
Description=Atlassian Jira
After=network-online.target

[Service]
Type=simple
ExecStart=/opt/atlassian/jira/bin/start-jira.sh -fg
ExecStop=/opt/atlassian/jira/bin/stop-jira.sh
Restart=always
RestartSec=30

[Install]
WantedBy=multi-user.target

Confluence:
[Unit]
Description=Atlassian Confluence
After=network-online.target

[Service]
Type=simple
ExecStart=/opt/atlassian/confluence/bin/start-confluence.sh -fg
ExecStop=/opt/atlassian/confluence/bin/stop-confluence.sh
Restart=always
RestartSec=30

[Install]
WantedBy=multi-user.target
 Stash:
[Unit]
Description=Atlassian Stash
After=network-online.target

[Service]
Type=simple
ExecStart=/opt/stash/bin/start-stash.sh -fg
ExecStop=/opt/stash/bin/stop-stash.sh
Restart=always
RestartSec=30

[Install]
WantedBy=multi-user.target


P.S. Don't forget to enable the appropriate network-online.target on your system. The Atlassian products don't appreciate being started with no network available.  If you use NetworkManager, you can do this with "# systemctl enable NetworkManager-wait-online.service"