Courting Eliza: Controlled Vocabularies

In my mind, computer software is a type of tool created to help people perform some task. I use the word "task" in a broad sense to include non-work activities such as shopping, entertainment, etc. The design and creation of any tool requires an understanding of the end user, the task, the domain, and the context of use.

More fundamentally, the designer must ask, "What problem is this tool trying to solve and what aspects of the world are relevant?" Earlier, I posted Things, Properties, Actions and Relationships in which I listed a number of fields of study relevant to answering this question. In this post, I will focus on one such field, controlled vocabularies.

According to Wikipedia, a controlled vocabulary is "a carefully selected list of words and phrases, which are used to tag units of information. ... Controlled vocabularies solve the problems of homographs, synonyms and polysemes by ensuring that each concept is described using only one authorized term and each authorized term in the controlled vocabulary describes only one concept. In short, controlled vocabularies reduce ambiguity inherent in normal human languages where the same concept can be given different names and ensure consistency."

Consistency is a good thing, but it can also become a rigid trap. Clay Shirky makes this point clearly in his article, Ontology is Overrated. Nevertheless, the act of writing computer software code typically imposes its own fairly rigid, formal specification and I would argue that this type of thinking is better done explicitly in the analysis / knowledge engineering phase rather than implicitly during the implementation phase. I'd like the systems that I design to be logical, consistent and embody concepts and language that map closely to that of the domain and end user.

Methods for Constructing Vocabularies
This section will discuss how controlled vocabularies are generated. I'm most familiar with end user interviewing techniques such as contextual inquiry. Concept mapping is another end user interactive techinue, and Boxes and arrows has a nice, web-focused description in their article, Creating a Controlled Vocabulary. When available, written documents and books can be mined. Essentially, these methods all boil down to:

Gather samples of domain language use (through verbal interviewing and other end user techniques or finding written documents and books).
Extract terminology.
Review and refine with subject matter experts.

At this moment, I'm particularly interested in step #2 above. I'd like to learn more about manual and automated terminology extraction.

I'd love to have a piece of software that could analyze a set of documents and produce a set of candidate terms to start my controlled vocabulary. I spent a few hours today surfing the web but didn't find exactly what I need. TermeXtractor is close to what I'm looking for, but it seemed to miss some important an obvious terms from my test case. When I ran the FIFA Laws of the Game PDF through TermeXtractor, it extracted useful terms like "goal line", "free kick" and "official", but it didn't extract some obvious terms like "football" which appears quite frequently and in the title. I also tried a similar online tool for terminology extraction by translated.net, but it had real problems with the hyphenation in my PDF and only returned the top 20 terms.

The Unstructured Information Management Architecture looks promising. In my experience, most software from the Apache foundation turns out to be worthwhile. However, the documentation is for developers rather than end-users and I'm not ready to spend days hacking code, yet. Similarly, while GATE provides more of a graphical user interface, it still assumes a level of technical knowledge (or time commitment to develop it) that I just don't have.

Interestingly, while I can find quite a bit of material on how this is done by a computer, I'm having a hard time finding detailed descriptions of how a human would go about doing this by hand. I guess it just falls under the broad category of reading. In a future post, I'm going to delve into this deeper with some more exhaustive searching and perhaps by attempting to roll my own methodology.

Tools for Organizing Vocabularies
The boxsandarrows.com article suggests the following tools:

a thesaurus maintenance program like Multites, Term Tree, ThManager or Lexico,
Microsof Excel,
Post-it Notes, or even
a wiki or semantic wiki

I checked out the recommended thesaurus software and wasn't too impressed. I've also tried ontology tools in the past like Protege and found most of them to be baffling. Update: I just went to the Protege website and it looks like they've made improvements. However, I don't think that Protege supports some of the controlled vocabulary concepts such as synonyms, etc.

Honestly, the best tool that I've found so far is Wikipedia, but I don't think that you can get that software as a tool for personal use. Luckily, it looks like I'm not the only one that's thought of this. A quick search on Google for "how to develop your own wikipedia" lists at least four articles on the topic. Next week I'm going to check into these and perhaps build my own wikipedia as part of my new controlled vocabulary methodology. Update: I just found that you can get the open source software that runs Wikipedia. It's called MediaWiki and you can even find free sites where you can create your own hosted wiki. I also found an extension of MediaWiki that sounds even more appropriate, called Semantic MediaWiki. More on this in a later post.

Stay tuned!

-Keith

1 comment:

John laPlanteOctober 23, 2008 at 1:48 PM
The bible of controlled vocabularies has to be the Library of Congress Subject headings.

Wikipedia says "comprise a thesaurus (in the information technology sense) of subject headings, maintained by the United States Library of Congress, for use in bibliographic records"
http://en.wikipedia.org/wiki/Library_of_Congress_Subject_Headings

LCSH has several huge books that provides hierarchical subjects for all of human knowledge. Imaging that :).

A big drawback of this series is that it isn't free. I'd expect a organization funded by the US govt to put this out for free. It would be great if my Delicious Firefox plugin had a LCSH browser.

Wednesday, October 22, 2008

Controlled Vocabularies

1 comment: