Friday, March 31, 2017

Documenting With Types

I've said this before: elegant code is pedagogical. That is, elegant code is designed to teach its readers about the concepts and relationships in the problem domain that the code addresses, with as little noise as possible. I think types are a fundamental tool for teaching readers about the domain that code addresses, but heavy use of types tends to introduce noise into the code base. Still, types are often misunderstood, and, as such, tend to be under-utilized, or otherwise misused.

Types are means of describing to the compiler what operations are valid against data. Different programming languages will have their own concept of data type and their own means of defining them. However, the type system in every language will have these two responsibilities: to allow the programmer to define what operations are valid against typed data, and to indicate failure when the source code contains a prohibited operation. (Of course, type definitions in most languages do more than this, such as also defining memory organization for typed data. These details are critical for translating source code into an executable form, but are not critical to the type-checker.)

The strength of a programming language's type system is related to how much it helps or hinders your efforts to discourage client error. It's my general thesis that the most important kind of help is when your types correspond to distinct domain concepts, simultaneously teaching your users about these concepts and their possible interactions, and discouraging the creation of nonsensical constructs. But this doesn't usually come for free. I won't go further into this here, but some hindrances are:

  • Error messages that don't help users understand their mistakes.
  • Excessively verbose type signatures.
  • Difficulty of representing correctness constraints.
  • More abstraction than your users will want to process.

Your types should pay for themselves, by some combination of keeping these costs low, of preventing catastrophic failure, or of being an effective teacher of your domain concepts.1

How does one document code with types? I have a few guidelines, but this blog will run long if I go into all of them here, so I'll start with one:

Primitive types are rarely domain types.

Consider integers. A domain concept may be stored in an integer, but this integer will usually have some unit bound to it, like "10 seconds", or "148 bytes". It almost certainly does not make domain sense to treat a value you got back from time(2) like a value you got back from ftell(3), so these could be different types, and the type could be used to prevent misuse. Depending on language, I might even do so, but consider the options:

In C, you can use typedef to create a type alias, as POSIX does. This serves to document that different integers may be used differently, but does not actually prevent mis-use:

#include <inttypes.h>
typedef uint64_t my_offset_t;
typedef int64_t my_time_t;

void ex1() {
  my_offset_t a = 0;
  /* variable misuse, but not an error in C's type system. */
  my_time_t b = a;
}

You could use a struct to create a new type, but these are awkward to work with:

#include <inttypes.h>
#include <stdio.h>
typedef struct { uint64_t val; } my_offset_t;
typedef struct { int64_t val; } my_time_t;

void my_func() {
  my_offset_t offset_a = { .val=0 };
  my_offset_t offset_b = { .val=1 };
  my_time_t time_c = { .val=2 };

  /*
   * variable misuse is a compile-time error:
   *   time_c = offset_a;
   * does not compile.
   */

  /*
   * cannot directly use integer operations:
   *   if (offset_b > offset_a) { printf("offset_b > offset_a\n"); }
   * does not compile, but can use:
   */

  if (offset_b.val > offset_a.val) { printf("offset_b.val > offset_a.val\n"); }

  /*
   * but the technique of reaching directly into the structure to use
   * integer operations also allows:
   */

  if (time_c.val > offset_a.val) { printf("BAD\n"); }

  /*
   * which is a domain type error, but not a language type error
   * (though it may generate a warning for a signed / unsigned
   * comparison). One could define a suite of functions against the
   * new types, such as:
   *   int64_t compare_offsets(restrict my_offset_t *a, restrict my_offset_t *b)
   *   {
   *     return (int64_t) a->val - (int64_t) b->val;
   *   }
   * and then one could use the more type-safe code:
   *   if (compare_offsets(&offset_a, &offset_b) > 0) {
   *     printf("GOOD\n");
   *   }
   * but, in no particular order: this isn't idiomatic, so it's more
   * confusing to new maintainers; even once you're used to the new
   * functions, it's not as readable as idiomatic code; depending on
   * optimization and inlining, it's plausibly less efficient than
   * idiomatic code; and it is awkward and annoying to define new
   * functions to replace the built-in integer operations we'd like
   * to use.
   */
}

As far as I can tell, C does not provide ergonomic options for using the type system to prevent integer type confusion. That said, the likelihood of user error in this example (misusing a time as a file size, or vice versa) is pretty low, so I would probably make the same choice that POSIX did in this circumstance, and just use type aliases to document that the types are different, and give up on actually preventing mis-use.

On the other hand, at Circonus we maintain a time-series database that must deal with time at multiple resolutions, each represented as 64-bit integers: Aggregate data storage uses units of seconds-since-unix-epoch, while high-resolution data storage uses units of milliseconds-since-unix-epoch. In this case, the likelihood of user error working with these different views of time is very high (we have a number of places where we need to convert between these views, and have even needed to change some code from using time-in-seconds to using time-in-milliseconds). Furthermore, making mistakes would probably result in presenting the wrong data to the user (not something you want in a database), or possibly worse.

If we were strictly using C, I would probably want to follow the approach Joel Spolsky outlined here, and use a form of Hungarian notation to represent the different views of time. As it happens, we're using C++ in this part of the code base, so we can use the type system to enforce correctness. We have an internal proscription against using the STL (to keep our deployed code relatively trace-able with, say, dtrace), so std::chrono is out. But we can define our own types for working with these different views of time. We start by creating our own [strong_typedef][strong_typedef] facility (no, we don't use BOOST, either):

#define ALWAYS_INLINE __attribute__((always_inline))

// bare-bones newtype facility, intended to wrap primitive types (like
// `int` or `char *`), imposes no run-time overhead.
template <typename oldtype, typename uniquify>
  class primitive_newtype
{
public:
  typedef oldtype oldtype_t;
  typedef primitive_newtype<oldtype, uniquify> self_t;

  primitive_newtype(oldtype val) : m_val(val) {}
  ALWAYS_INLINE oldtype_t to_oldtype() { return m_val; }
private:
  oldtype m_val;
};

With this facility, we can define domain types that are incompatible, but which share a representation, and which should impose no overhead over using the primitive types:

class _uniquify_s_t;
typedef primitive_newtype<int64_t, _uniquify_s_t> seconds_t;
class _uniquify_ms_t;
typedef primitive_newtype<int64_t, _uniquify_ms_t> milliseconds_t;

Or, better, since time types all share similar operations, we can define the types together with their operations, and also split up the types for "time point" from "time duration", while enforcing a constraint that you can't add two time points together:

template <class uniquify>
  class my_time_t
{
private:
  class _uniquify_point_t;
  class _uniquify_diff_t;
public:
  typedef primitive_newtype<int64_t, _uniquify_point_t> point_t;
  typedef primitive_newtype<int64_t, _uniquify_diff_t> diff_t;
  static ALWAYS_INLINE point_t add(point_t a, diff_t b)
  {
    return point_t(a.to_oldtype() + b.to_oldtype());
  }
  static ALWAYS_INLINE diff_t diff(point_t a, point_t b)
  {
    return diff_t(a.to_oldtype() - b.to_oldtype());
  }
  static ALWAYS_INLINE point_t diff(point_t a, diff_t b)
  {
    return point_t(a.to_oldtype() - b.to_oldtype());
  }
  static ALWAYS_INLINE diff_t diff(diff_t a, diff_t b)
  {
    return diff_t(a.to_oldtype() - b.to_oldtype());
  }
  static ALWAYS_INLINE diff_t add(diff_t a, diff_t b)
  {
    return diff_t(a.to_oldtype() + b.to_oldtype());
  }
};

class _millisecond_uniquify_t;
typedef my_time_t<_millisecond_uniquify_t> my_millisecond_t;
class _second_uniquify_t;
typedef my_time_t<_second_uniquify_t> my_second_t;

This is just the primitive basis of our time-management types, and (to help the example fit in a blog post, and because I write the blog to a different audience than I write production code) is implemented a little differently than what we actually have in our code base. With these new types, we can perform basic operations with time in seconds or milliseconds units, while preventing incorrect mixing of types: an attempt to take a difference between a time-point based in seconds and a time-point based in milliseconds, for example, will result in a compilation error. Using these facilities made translating one of our HTTP endpoints from operating against seconds to operating against milliseconds into an entirely mechanical process of converting one code location to use the new types, starting a build, getting a compilation error from a seconds / milliseconds mismatch, changing that location to use the new types, and repeating. This process was much less likely to result in errors than it would have been had we been using bare int64_t's everywhere, relying on code audit to try and ensure that everything that worked with the old units was correctly updated to use the new. These types are more annoying to work with than bare integers, but using them helped avoid introducing a very likely and very significant system problem under maintenance. The types paid for themselves.

(Thanks to Riley Berton for reviewing this post. A form of this post also appeared on the Circonus blog, here.)


  1. C++ is an interesting case of weighing costs and benefits, in that, while the benefits of using advanced C++ type facilities can be very high (bugs in C++ code can be catastrophic, and many of C++'s advanced facilities impose no runtime overhead), the maintenance costs can also be extremely high, especially when using advanced type facilities. I've seen thousands of characters of errors output due to a missing const in a type signature. This can be, ahem, intimidating.

No comments:

Post a Comment