Will Ware's blog: specifications

Thursday, September 16, 2010

Tim BL's talk on Linked Data

I learned a lot from Tim Berners-Lee's TED talk from February 2009 about Linked Data. He talks a bit about his motivation for inventing the Web, which was that the data he encountered at CERN was in all different formats and on all different computer architectures and he spent a huge fraction of his time writing code to translate one format to another. He talks about how much of the world's data is still locked up in information silos -- a million disconnected little islands -- and how many of the world's most urgent problems require that data be made available across the boundaries between corporations, organizations, laboratories, universities, and nations. He has laid out two sets of guidelines for linked data. The first is for the technical crowd:

Use URIs to identify things.
Use HTTP URIs so that these things can be referred to and looked up ("dereferenced") by people and user agents.
Provide useful information about the thing when its URI is dereferenced, using standard formats such as RDF/XML.
Include links to other, related URIs in the exposed data to improve discovery of other related information on the Web.

The second set is for a less technical crowd:

All kinds of conceptual things, they have names now that start with HTTP.
I get important information back. I will get back some data in a standard format which is kind of useful data that somebody might like to know about that thing, about that event.
I get back that information it's not just got somebody's height and weight and when they were born, it's got relationships. And when it has relationships, whenever it expresses a relationship then the other thing that it's related to is given one of those names that starts with HTTP.

It's a very eloquent talk, reminding me in places of David Gelernter's prophetic book Mirror Worlds.

What's remarkable about the Linked Data idea is that, as much as people tend to dismiss the whole semantic web vision, it really is making remarkable progress. The diagram above shows several interlinked websites with large and mutually compatible data sets.

DBPedia aims to extract linked data from Wikipedia and make it publicly available.
YAGO is a huge semantic knowledge base. Currently, YAGO knows more than 2 million entities (like persons, organizations, cities, etc.). It knows 20 million facts about these entities.
Lexvo.org brings information about languages, words, characters, and other human language-related entities to the Linked Data Web and Semantic Web.
The Calais web service is an API that accepts unstructured text (like news articles, blog postings, etc.), processes them using natural language processing and machine learning algorithms, and returns RDF-formatted entities, facts and events. It takes about 0.5 to 1.0 second depending on how big a document you send and the size of your pipe.
Freebase is an open repository of structured data of more than 12 million entities. An entity is a single person, place, or thing. Freebase connects entities together as a graph.
LinkedCT is a website full of linked data about past and present clinical trials.

Berners-Lee has recommended a very small set of Linked Data principles.

Use URIs as names for things.
Use HTTP URIs so that people can look up those names.
When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)
Include links to other URIs so that they can discover more things.

Saturday, April 10, 2010

Learning to live with software specifications

We software developers have a knee-jerk hatred of specifications. Rather than write a document describing work we plan to do, we would rather throw together a quick prototype and grow it into the final system. We sometimes feel like specs are for liberal-arts sissies and pointy-haired bosses. Our prehistoric brains want us to dismiss specifications as a waste of time or even an intentional misdirection of energy.

The truth of it is that specs build consensus between developers, testers, tech writers, managers, and customers. They make sure everybody agrees about what to build, how to test it, how to write a user manual for it, and what the priorities are.

The Agile guys talk about the exponentially increasing cost of fixing a bug. The later in the process you find that bug, the more troublesome and expensive it is to fix it. Fixing bugs in code is hard, even prototype code, and fixing text is easy.

Let's learn to trick our brains to work around our reluctance. The Head-First books always start with a great little explanation about how our prehistoric brain circuitry divvies up our attention, classifying things as interesting or boring, and determines what sticks in our memories. Sesame Street learned how to make stuff sticky by

repetition
lighting up more brain circuitry
infusing the topic with emotional content
relating it to things that were already sticky

One way to infuse your spec with emotional content would be to make it a turf war. That hooks into all our brain circuitry for tribes and feuds. But turf wars are traumatic and damaging to people and projects, so let's not do this.

To light up more brain circuitry, sketch out pieces of the spec on a big whiteboard. Draw a lot of pictures and diagrams. Use different colored markers. Get a few people together and generate consensus (not a turf war), and ask them to help identify issues that you forgot. That meeting is called a design review, like a code review for specs.

Who should write and own the spec? Part three of Joel Spolsky's great four-part (1, 2, 3, 4) article answers this question, drawing on his experience at Microsoft. One person should write and own the spec, and the programmers should not report to that person. At Microsoft, that person is a program manager.

It's important to differentiate between

a functional spec (what the user sees and experiences, what the customer wants) dealing with features, screens, dialog boxes, UI and UX, work flow
and a technical spec (the stuff under the hood) dealing with system components, data structures and algorithms, communication protocols, database schemas, tools, languages, test methodologies, and external dependencies which may have hard-to-predict schedule impacts

Write the functional spec first, then the technical spec, then the code. If you love test-driven development then write the specs, then the tests, then the code.

Joel's article includes some great points on keeping the spec readable.

Use humor. It helps people stay awake.
Write simply, clearly, and briefly. Don't pontificate.
Re-read your own spec, many times. Eat your own literary dogfood. If you can't stay awake, nobody else will either.
Avoid working to a template unless politically necessary.

How do you know when the spec is done?

The functional spec is done when the system can be designed, built, tested, and deployed without asking more questions about the user interface or user experience.
The technical spec is done when each component of the system can be designed, built, tested, and deployed without asking more questions about the rest of the system.

This doesn't mean that these documents can never be updated or renegotiated. But the goal is to aim for as little subsequent change as possible.

I am still sorely tempted by the idea of a quick prototype, an "executable spec" that exposes bugs in design or logical consistency. Maybe it's OK to co-develop this with the spec, or tinker with it on one's own time, or consider it as a first phase of the coding. I'm still sorting this out. The basic rationale of a spec, that fixing bugs in text is easier and cheaper than fixing bugs in code, still needs to be observed.