Design, Dependencies, and Developer Responsibility

Software developers are in a position of power. Most of the software that gets professionally developed is proprietary or is hidden away…

Design, Dependencies, and Developer Responsibility

Software developers are in a position of power. Most of the software that gets professionally developed is proprietary or is hidden away on remote servers, so end users are not allowed to see or modify its behavior; even were they allowed to change its behavior, we often use complicated tools that require years of experience to use effectively. While there may be a ‘market’, end users’ choice is limited: an end user may be able to choose between several implementations, but there are no software choices developed outside of the software-developer monoculture. In practice, user choice is even more limited: users may be forced by metcalfe’s law to choose software that has interoperability with some large walled-garden system despite having no other redeeming qualities, or they may be limited by hardware or bandwidth constraints, or they may be completely free to choose the software they use in their limited free time but spend eight hours a day using software chosen by distant and anonymous managers. As developers, we should not kid ourselves that people who do not like our technical decisions can “just use something else”. Instead, just as a manager must be careful not to abuse power over subordinates, we must be careful not to abuse power over potentially millions of anonymous strangers.

Users are forced by circumstance to put their trust in us. They are forced to trust us to make technical decisions they do not have the background to make, in the same way people are forced to trust doctors and lawyers, but they are also forced to trust us to make social and political decisions that they are not permitted to make. Interoperability and accessibility are political decisions, technically enforced. Many apparently-internal, apparently-purely-technical decisions are actually political decisions — and while it is often easy for us to ignore their political ramifications in the short term, our users often cannot.

Performance, for instance, is a political decision. If your software is not suitably performant, you have chosen to exclude people with slow hardware from using it — which is to say, you are excluding the poor. Similarly, many UI/UX decisions are also political: if your software doesn’t work with screen readers, or works awkwardly with them, it is not accessible to the blind; a choice between text labels and icons is a choice between catering to the literate and catering to people with poor vision; different tasks take different numbers of steps, and this privileges one path over another. Difficult political trade-offs cannot be avoided in software development any more than than technical ones, and responsibility for them cannot be passed off to management any more so than technical responsibility can.

Some specific types of apparently-technical decisions that are really political are dependencies and formats/protocols. These two concerns are entangled in an interesting way.

Software developers often treat dependencies as black boxes. Surely, they think, a widely-used package must be of professional quality — created, audited, and tested by people who know what they are doing! Actually, libraries tend to be written when a suitable library is not found, and actually being qualified to correctly solve a problem is not correlated at all with having written a library to solve it. Since there are norms in place against “reinventing the wheel” (seen as a waste of developer time and investor money), even very poor quality off-the-shelf solutions are typically preferred over an in-house implementation. The more popular a dependency is, the harder it is to change the exposed interface (or, in fact, any behavior that some existing code might conceivably depend upon), so even though the original developers of some library may have, through experience, learned how to solve the problem better, they are often locked in by other people’s expectations and cannot effectively apply that knowledge.

It has been my experience (in about ten years of auditing both open and closed source packages for capability and stability — not even correctness) that for many common problems, there is no production-quality third party implementation, although there may be literally hundreds of high-profile amateur-quality ones. It has also been my experience that writing and maintaining an edifice of glue code around a third party dependency is often more difficult and expensive than “reinventing the wheel”.

I have a couple dependency horror stories, from maintaining legacy code. In one case, I discovered that not only was a service our company paid thousands of dollars a month for timing out on 60% of our calls, but even when it was functioning, it was doing nothing more than splitting strings by space and removing stopwords — so we stopped calling it altogether, speeding up all calls by a substantial amount. In another case, a popular database layer that purported to be “SQL-compatible” did not, in fact, support subqueries, as a result of an ad-hoc parser implementation that betrayed general unfamiliarity with SQL and how it is used. In more than a few cases, we used popular third party packages and obeyed “best practices” only to discover that abandoning them in favor of a straightforward implementation was many times faster, smaller, and more maintainable.

Even if we eliminate the obviously-amateurish ones — cases where key features are missing, say — dependencies are still dangerous. Non-vendored dependencies are bundled with your code but are controlled by complete strangers (who may decide to change behavior drastically, as we saw with left-pad a few years ago and as we continue to see with the trend of the developers of popular browser extensions being paid to inject spyware). Even vendored dependencies can be difficult to independently audit: if you fully understand the intended behavior in all cases, then it is often easier to implement the intended behavior yourself than to read someone else’s code and become confident that they have implemented it correctly. Vendoring and auditing third party code often ends up with a project-specific fork of that code, for this very reason.

When a developer decides to use a dependency, he is taking some of the trust a user has placed in him and pawning it off to a stranger. This is risky for the user, but it is not (generally) risky for the developer. Sometimes, problems have a high inherent complexity and one has little choice but to defer to experts: for instance, rolling your own crypto is almost always a bad idea, because cryptography is hard to get right, easy to screw up, hard to check for correctness, and involves a great deal of specialized knowledge. Regular developers cannot even productively audit crypto implementations; it is safer to blindly trust a cryptography library vouched for by security experts than it is to implement one yourself. However, most problems we use dependencies to solve do not require any specialized knowledge to implement or audit beyond that provided to a mediocre student by a low-ranked four year CS degree program. Using a third party dependency at all involves basically the same ethical concerns as loaning out someone else’s car, and you should be just as discriminating about who you are willing to loan someone’s CPU to as you would be about their car.

Choice of formats and protocols is a similarly complex and politically-charged issue.

Through one lens, this is a matter of data interoperability: the protocols your software supports determines your user’s social network, because it determines who your user can share data with; it can lock them into your product (by ensuring that the only way they can share data with their own future selves is through “brand loyalty”); choosing a standard format may mean that a large number of already-existing tools can be used to manipulate the data received and emitted by your code. Within an organization, interoperability is often not fully under the control of a single developer (though software architects may be able to push for particular protocols), but the formats and protocols in use are often decided by the first developer to implement two ends of a pipeline, without any discussion with the rest of the organization, and this gives developers a lot of control over how protocols evolve in popularity.

However, protocol and data formats also have less visible political concerns. A bloated or difficult-to-parse format affects performance (and poor performance limits access to people who can afford the waste); a format that is not human-readable limits access to people who have the tools to read it.

Most importantly, for my purposes: a format that is not cleanly designed — one that has many complex corner cases or is internally inconsistent — is difficult to correctly implement. Most organically-grown ‘general purpose’ formats (even standardized ones) are like this: HTTP has endless status codes that can never be used, misspellings in the spec, and path dependent messes like ‘user-agent’ values and the ‘x-’ prefix on MIME; any HTML parser, in order to be useful, must make the best of completely-invalid HTML; XML, though strict, has inherited much of its complexity from SGML, and that complexity serves no purpose in a world where SGML support is nearly unknown.

A complete web browser (which is to say: HTTP + HTTPS + HTML parsing and rendering + CSS + javascript + javascript DOM library + XML) is now so complex that only a few organizations in the world are capable of making one. This is not an ‘essential complexity’: the core function of HTML+HTTP (to send static text documents with special metadata that allows particular words to be ‘hyperlinks’ to other static text documents) is handled suitably by gopher, which any beginner programmer can implement in an afternoon. This is, instead, the worst case scenario of path dependence: an ugly hack is invented that allows some feature to be bolted onto an existing data format or language and barely work, and then other features must be bolted onto other languages in order to support the first feature. The end result: a set of formats and protocols that is used by the whole world but whose only ‘standard’ implementation is controlled entirely by Google.

A difficult-to-implement format or protocol must be handled by a dependency. We must pass the end user’s trust over to the dependency author. I personally do not necessarily trust Google, nor can I audit them, nor can I in good faith say that it’s a good idea to pass an end user’s trust over to them. The same is true of Microsoft and Amazon.

This is where the tradeoff between interoperability and dependency comes in. Many standard formats and protocols are so complicated as to be impossible to reliably implement in house, but most use cases do not make full use of standard formats anyhow. Corner cases only matter if you are capable of hitting them. Often, when a ‘standard’ format is chosen, it is chosen out of familiarity and not because it actually matches what needs to be emitted.

It used to be possible to live your whole professional life in a java-and-XML world, where all code was written in java and all data was stored in XML. XML is a very complicated format, with very strange and specific features. It consists of a tree structure with both named nodes (“tags”) and unnamed ones (“text nodes”), but the named nodes had a second tier of non-nested key-value data (“attributes”). It is possible to use this format to represent flat key-value data (in several ways). It is also possible to use this format to represent nested key-value data pretty straightforwardly, so long as you do not have arrays. Once you have arrays in your data, your life gets complicated: do you use a string with some delimiter? Do you repeat your tag? Tags can be repeated, but attributes can’t — at least not within the same tag. You can have multiple text nodes not divided by any elements, but only in an in-memory representation of an XML tree — they all get merged together if you actually serialize. You can come up with many different consistent ways to store a nested key-value structure (a non-cyclic object), but all of them will look arbitrary and confusing to someone reading example output in order to understand how to process it. Meanwhile, all existing XML parsers, because they are concerned with exposing the XML specification rather than exposing useful data, are awkward to use for any concrete task. Once someone pointed out that JSON existed implicitly in the javascript object definition structure, people flocked to it over XML because it was so much better at handling this common reasonably-general-purpose form of data. JSON has its own corner cases and ambiguities, but they are usually easily avoided. Even better is msgpack, which is easier to parse and generate than JSON while supporting exactly the same kinds of data.

Why does someone choose to represent something like build rules in XML (as ANT does), as opposed to in a line-based format (as make does)? The sunk cost of learning the details of a complex format is probably a factor, along with the sunk cost of already having a dependency on the parser. But still, that decision was not ‘free’. There is a developer cost, in terms of the time spent understanding an ANT build file (which is much harder to understand and edit than a makefile; even though in the wild most build files for both ANT and make are generated through absurdly complicated toolchains, it is much easier to write a portable makefile by hand even with complex build logic than it is to write an equivalent ANT file). There is CPU cost in the amount of time it takes to process XML, and memory cost in the amount of space the ANT file takes up in RAM and on disk. There is a social cost, because projects that use ANT builds are locked into them until someone translates them into another build system. There is a security cost, because ANT has much more complex logic than make even just at the parsing stage, and therefore there’s a greater vulnerability surface.

Another place where formats become political is that understanding a format can change how you think about data. This can be bad: if you spend all your time thinking about how to represent things in XML, you may find it natural to represent everything in XML and fail to understand the cost of that decision. It can also be good: if you are familiar with a wide variety of diverse formats, you can be very specific in your thinking about the attributes of data and make informed decisions about how to organize, represent, store, and serialize it.

The design of formats and protocols can hide unjustified and limiting assumptions. For instance: on the web, URLs generally begin with a hostname. Since URLs point to pieces of data, this is a little bit dangerous — the same piece of data can be on many different hosts, and too much traffic to a single host can take it down. So, when URLs were introduced, something else called the URI was proposed — a permanent address for a piece of data that would resolve to a whole bunch of more-temporary URLs. This URI resolution system was never implemented, and developers took advantage of the absence of any guarantee that a file would remain static in order to provide dynamic pages and web services. However, as the web grew in popularity, it became easier to catch enough attention to saturate a residential internet connection (or even to saturate professional sites); most big sites these days have a complex and very expensive to maintain structure of load balancers, selective caches of static files (as provided by cloudflare), and round-robin DNS rules. All of this would have been avoided if static files were referenced by hash and fetched based on who had them — something done in a centralized way by napster and tracker-based bittorrent, and a decentralized way by DHT-based bittorrent, IPFS, and DAT. Similarly, because HTML has a tree-based structure, it is impossible for tags to overlap — even though, conceptually, we might want to link the first half of a sentence to one document but the entire sentence to a different document (something that pre-web hypertext systems like Xanadu and Intermedia supported).

When we think of ‘standard formats’, we often think of relatively complex ones like XML and JSON. We actually have standardized formats for fairly general-purpose data fit to all sorts of complexity levels.

For tabular or key-value data, line-based delimited formats are convenient. CSV is great if you never have commas or newlines in your data (or else you will need to quote, and quoting rules are a hassle); TSV is the same, but (for small fields or large tabstop values) your tabular data can be made to line up for visual inspection. If you want to support arbitrary ASCII text in your tabular data, use the non-printable ASCII standard separator characters 0x1C (file separator), 0x1D (group separator), 0x1E (record separator), and 0x1F (unit separator). Unless your data is meaningfully nested, these formats ought to be expressive enough, and with standard C escape codes, you can support arbitrary text (and even arbitrary data).

For tree structures, JSON (and its binary brother msgpack) and s-expressions are suitable.

Once you need to represent DAGs or cyclic data structures, the set of simple standard formats dry up — you will need to invent your own. XML wasn’t going to help you here anyway. If you are using a library to serialize cyclic data structures, either you’re using a language-specific one (like python’s pickle) or you are already extremely aware of the problems you’re trying to solve (because you’re using ASN.1 or protobuf or something).

The best format for your data is the simplest one that supports all of its meaningful properties. I have found that, with the data I work with, almost everything can be handled as TSV or msgpack — which is great, because both of those are quite easy to correctly implement.

Very often, the easiest way to eliminate a dependency is to consider whether or not you are making full use of it. An XML library is only necessary if you use XML — so while you may be locked into it by a remote peer, you may also have simply chosen to store something as XML and could just as easily switch to a different format & mass-convert the existing data. A JSON library is only necessary if you use JSON — consider if you can switch to msgpack (or better yet, a subset of msgpack that only supports the kinds of data you need) and roll your own implementation. A web browser is only necessary if you need a remote server backend — so consider binding against a cross-platform GUI toolkit instead and distributing a desktop application.

Note: this piece is a response to some discussion on lobste.rs of a previous piece.