We have created a Proof Of Concept (POC) to transform DOS-style C programs into web services for the French National Museum of Natural History. This very first article of our new “Case studies” series explains what we did and why.

A Second Life for (very old) C programs

At TailorDev, we are polyglot developers and scientists: we are comfortable with numerous programming languages, and we use our Le lab sessions to play with new programming languages and alternative paradigms. Our respective backgrounds are also helpful to discuss with researchers because we understand underlying science as well as their needs.

In the following, we describe the xper-tools project, in collaboration with a research team (LIS team) hosted in the French National Museum of Natural History in Paris. The XPER tools are a set of very old programs (~30 tools), written in C, which are used for taxonomy purpose. By old, we mean almost 30 years old. You read it well, that is older than the first Standard C published by ANSI!

VOID main(argc, argv)
int argc;
char **argv;
{
   VOID loadkb();
   VOID freekb();
   VOID help();
   VOID evalxp();
   char *gets();
   fprintf(stderr, "EVALXP V1.02 (02/08/1987) J. LEBBE & R. VIGNES.\n");

   ...
}

Even though these programs work pretty well and are still used on a daily basis, there are some limitations. First, these software have been written for MS-DOS and therefore require a rather old computer to use them. This leads to two more issues: not everyone can easily use them, and it is nearly impossible to interface them with other software.

We have been asked to build a Proof Of Concept (POC) to transform these programs into web services in a week (it was a nice to have at the end of a bigger project we will likely present in another blog post). Challenge accepted! :sunglasses:

Analysis

We started by analyzing the different programs and decided on a first program to use in our POC. The source code of these numerous tools also bundle different Makefile and some documentation. Luckily, these programs are well written even though some parts are cryptic. All tools are designed to be run from the command line, use the same set of data as input (knowledge bases), and some of them have options (flags) with the following DOS-like syntax: /B. In addition, all programs respond to the /H help option, providing interesting information for each program:

$ bin/chkbase
CHKBASE V1.06 (22/05/1988) J. LEBBE & R. VIGNES.
Syntax: CHKBASE name-of-base [/H] [/V]
/H Help
/V Verbose mode


Nom de fichier absent

Every time we work for/with a customer, we make sure that what we produce is easily reusable afterwards. In this context, we designed the POC as the foundation of a production-ready software, which could leverage all the existing programs. Hence, we decided to focus on two main tasks:

  • being able to compile and run the programs on different platforms;
  • proposing a unified solution to expose the programs over HTTP.

Hello Autotools!

Instead of having to deal with many Makefile and other files to build the different tools, why not using a common tool that would do most of the job for us? Wouldn’t be super cool if we would only have to run make to build all the tools at once? The Autotools (not to be confused with the Autobots) are the solution!

If you do not know what the Autotools are, you may already have installed software from source with the following commands:

$ ./configure
$ make
$ (sudo) make install

The first line executes a shell script to, first, determine if all requirements are met to build the software, and second, to create a Makefile based on a template (Makefile.in). If a mandatory dependency is missing on your system, the script will abort, forcing you to install that dependency. That is very useful to ensure reproducibility. The configure script has not been written by hand, but generated by autoconf, using yet another template file named configure.ac:

AC_INIT([xper-tools], [1.0.0], [author@example.org])
AM_INIT_AUTOMAKE                    # use `automake` to generate a `Makefile.in`
AC_PROG_CC                          # require a C compiler
AC_CONFIG_FILES([Makefile])         # create a `Makefile` from `Makefile.in`
AC_OUTPUT                           # output the script

The Makefile.in template is also generated thanks to automake and a Makefile.am template. That is also why we had to use the AM_INIT_AUTOMAKE directive in the configure.ac file above.

A Makefile.am template usually starts by defining the layout of the project, which should be foreign if you are not using the standard layout of a GNU project (which is likely the case). In the example below, we provide global compiler flags with the AM_CFLAGS and AM_LDFLAGS directives. Next, we tell automake that the Makefile should build the different programs using the bin_PROGRAMS directive:

# Makefile.am
AUTOMAKE_OPTIONS = foreign

# Global flags
AM_CFLAGS = -W -Wall -ansi -pedantic
AM_LDFLAGS =

# Target binaries
bin_PROGRAMS = chkbase \
               makey \
               mindescr

...

The bin prefix tells automake to “install” the listed files into the directory defined by the variable bindir, which should point to /usr/local/bin by default (/usr/local being the “prefix” directory).

The PROGRAMS suffix is called a primary and tells automake which properties the listed files own. For instance, PROGRAMS are compiled. Hence, we must tell automake where to find the source files (we also add per-program compilation flags):

# Makefile.am
...

# -- chkbase
chkbase_CFLAGS = -D LINT_ARGS
chkbase_SOURCES = xper.h det.h loadxp.c detool.c chkbase.c

By adding more similar lines to the Makefile.am, we can support all the existing programs, leveraging a simple and uniform way to build all the tools. Now that the configuration templates/files have been written, we can use the Autotools to generated the ready-to-use files. Let’s start with the configure script:

$ autoreconf --verbose --install --force

Various files have been generated, but the most important one is the configure script, which will be useful to generate the final Makefile. You can pass some options to this script such as --prefix to specify the prefix directory. For instance, to install all the files into your current directory, you could run:

$ ./configure --prefix=$(pwd)

We can run make to compile all the tools at once, and make install to “install” the binaries into the <PREFIX>/bin folder. But we also get a distribution solution for free by using make dist. This target builds a tarball of the project containing all of the files we need to distribute. End users could download this archive and run the commands below without having to worry about the Autotools:

$ ./configure
$ make
$ (sudo) make install

After having successfully ported one tool to this new(-ish) build system, we wrote a procedure to port the other programs and we tested it by asking someone else to port another program. Fortunately, compiling these programs was not too difficult as soon as we figured out which encoding was used (hello CP 850), found all the required header files, and performed minor code changes such as adding proper exit codes and removing a case '/': line used for parsing the (DOS-style) options because it caused an incompatibility with UNIX absolute paths.

Naturally, we added some smoke tests to ensure the compiled binaries were behaving correctly (based on the outputs given by the old computer in the research team’s lab) and automated the building and testing phases with GitLab CI. With little effort, the different XPER tools can now be compiled and executed on any new system. The first goal is therefore satisfied and we can now present how we designed an API to expose these tools over HTTP in the next section.

RPC-style HTTP API

The different source codes are very application-oriented and not library-oriented, which prevented us to compile C libraries that we could have imported in Go or Python. Hence, we decided to “wrap” the C tools to integrate them with the API code. We chose the Python programming language as it is usually a good choice in Academia (and also because it is fast).

We wrote a generic yet smart wrapper that is able to:

  • execute any C program and return its output thanks to the Python subprocess module;
  • determine the options of any C program it wraps by invoking the program with the help (-H) flag (cf. the Analysis section);
  • validate the supplied options. Since the wrapper knows which options a program can accept, it can easily reject invalid options and prevent invalid calls;
  • provide a nice and simple programmatic API:
from api.wrappers import ToolWrapper

makey = ToolWrapper('makey')
cp = makey.invoke('/path/to/data', B=True)
# cp.stdout contains the output result

Hat tip to Julien for this clever wrapper. Once we were able to call a XPER tool from Python, we started to write a HTTP API using a Python web framework such as Flask. At TailorDev, we like to write pragmatic HTTP APIs, and we always adopt a documentation-first approach. Apiary and API Blueprint are our favorite tools for that.

We drafted a HTTP API that speaks JSON and exposes two main endpoints:

  • /knowledge-bases to manage the data for the different XPER tools;
  • /tools.run to call the XPER tools.

The former responds to the GET and POST methods to return a set of data (a knowledge base) and create such knowledge bases respectively. The latter is a Remote Procedure Call (RPC) endpoint, which is perfectly fine for representing what we want to achieve: calling a function (over HTTP).

Each knowledge base is identified by a UUID, and the bases are persisted on the filesystem (which may evolve in the future). With both the tools ready to be executed and the data on the server, we only had to glue them thanks to the /tools.run endpoint, which can be triggered by the POST method:

POST /tools.run/chkbase
Content-Type: application/json
Accept: application/json

{
  "knowledge_base_id": "27d99b56-9327-4d28-a69c-31229bf971aa"
}

Nevertheless, the different programs do not output JSON content but formatted plain text. In order to reach interoperability, it was not conceivable to keep the output as is, hence the concept of parsers. Each program gets its own parser for transforming the plain text output into a Python data structure we can later serialize as we wish. Using this approach, we were able to write a lot of unit tests based on different realistic outputs, and guarantee enough flexibility into the application. We then created a configuration file for the supported tools and their associated parsers and options:

from .parsers import MindescrParser, ChkbaseParser

supported_tools = {
    'mindescr': {
        'parser': MindescrParser(),
        'options': []
    },
    'chkbase': {
        'parser': ChkbaseParser(),
        'options': [
            ('verbose', 'V'),
        ]
    }
}

The controller logic behind the /tools.run/<name> relies on this configuration to determine which tools (and options) are allowed, but also which parser to use. When all conditions are met, it runs the program with the knowledge base as input thanks to the wrapper, it parses the output with the appropriate parser, and returns the result as a JSON response.

Adding support for a new program only requires to write a parser for the output of that program and update the configuration. As you may have noticed, the options array contains tuples (option_name, tool_option) that map more meaningful option names (e.g., verbose) to their corresponding tool options (e.g., -V). That way, we can completely hide the program details behind the API, which might also be handy in the future.

We ended this part by writing a small Node.js CLI to demonstrate how this API could be used, but also to give non-technical people a way to consume this API and understand what has been done.

Conclusion

Tackling technical challenges is usually not a problem. In this case, the most interesting yet complicated task was to strike a happy medium between a good software architecture and an easy way to upgrade all the existing XPER tools. All in all, it took us 8 days to design, implement, test and document this solution, including the CLI. We ported three programs to the new build system, and exposed two different tools on the HTTP API.

This project was awesome because we felt really proud of giving a second life to these very old C programs. It was challenging to come up with a production-ready Proof Of Concept that could be easily improved in the future, in a short amount of time.

That is the kind of things we do and like to do! :wink: