Wednesday, October 22, 2008

Controlled Vocabularies

In my mind, computer software is a type of tool created to help people perform some task. I use the word "task" in a broad sense to include non-work activities such as shopping, entertainment, etc. The design and creation of any tool requires an understanding of the end user, the task, the domain, and the context of use.

More fundamentally, the designer must ask, "What problem is this tool trying to solve and what aspects of the world are relevant?" Earlier, I posted Things, Properties, Actions and Relationships in which I listed a number of fields of study relevant to answering this question. In this post, I will focus on one such field, controlled vocabularies.

According to Wikipedia, a controlled vocabulary is "a carefully selected list of words and phrases, which are used to tag units of information. ... Controlled vocabularies solve the problems of homographs, synonyms and polysemes by ensuring that each concept is described using only one authorized term and each authorized term in the controlled vocabulary describes only one concept. In short, controlled vocabularies reduce ambiguity inherent in normal human languages where the same concept can be given different names and ensure consistency."

Consistency is a good thing, but it can also become a rigid trap. Clay Shirky makes this point clearly in his article, Ontology is Overrated. Nevertheless, the act of writing computer software code typically imposes its own fairly rigid, formal specification and I would argue that this type of thinking is better done explicitly in the analysis / knowledge engineering phase rather than implicitly during the implementation phase. I'd like the systems that I design to be logical, consistent and embody concepts and language that map closely to that of the domain and end user.

Methods for Constructing Vocabularies
This section will discuss how controlled vocabularies are generated. I'm most familiar with end user interviewing techniques such as contextual inquiry. Concept mapping is another end user interactive techinue, and Boxes and arrows has a nice, web-focused description in their article, Creating a Controlled Vocabulary. When available, written documents and books can be mined. Essentially, these methods all boil down to:
  1. Gather samples of domain language use (through verbal interviewing and other end user techniques or finding written documents and books).
  2. Extract terminology.
  3. Review and refine with subject matter experts.
At this moment, I'm particularly interested in step #2 above. I'd like to learn more about manual and automated terminology extraction.

I'd love to have a piece of software that could analyze a set of documents and produce a set of candidate terms to start my controlled vocabulary. I spent a few hours today surfing the web but didn't find exactly what I need. TermeXtractor is close to what I'm looking for, but it seemed to miss some important an obvious terms from my test case. When I ran the FIFA Laws of the Game PDF through TermeXtractor, it extracted useful terms like "goal line", "free kick" and "official", but it didn't extract some obvious terms like "football" which appears quite frequently and in the title. I also tried a similar online tool for terminology extraction by translated.net, but it had real problems with the hyphenation in my PDF and only returned the top 20 terms.

The Unstructured Information Management Architecture looks promising. In my experience, most software from the Apache foundation turns out to be worthwhile. However, the documentation is for developers rather than end-users and I'm not ready to spend days hacking code, yet. Similarly, while GATE provides more of a graphical user interface, it still assumes a level of technical knowledge (or time commitment to develop it) that I just don't have.

Interestingly, while I can find quite a bit of material on how this is done by a computer, I'm having a hard time finding detailed descriptions of how a human would go about doing this by hand. I guess it just falls under the broad category of reading. In a future post, I'm going to delve into this deeper with some more exhaustive searching and perhaps by attempting to roll my own methodology.

Tools for Organizing Vocabularies
The boxsandarrows.com article suggests the following tools:
I checked out the recommended thesaurus software and wasn't too impressed. I've also tried ontology tools in the past like Protege and found most of them to be baffling. Update: I just went to the Protege website and it looks like they've made improvements. However, I don't think that Protege supports some of the controlled vocabulary concepts such as synonyms, etc.

Honestly, the best tool that I've found so far is Wikipedia, but I don't think that you can get that software as a tool for personal use. Luckily, it looks like I'm not the only one that's thought of this. A quick search on Google for "how to develop your own wikipedia" lists at least four articles on the topic. Next week I'm going to check into these and perhaps build my own wikipedia as part of my new controlled vocabulary methodology. Update: I just found that you can get the open source software that runs Wikipedia. It's called MediaWiki and you can even find free sites where you can create your own hosted wiki. I also found an extension of MediaWiki that sounds even more appropriate, called Semantic MediaWiki. More on this in a later post.

Stay tuned!

-Keith

Tuesday, October 21, 2008

Things, Properties, Actions and Relationships

I'm at a nice place right now on a number of my projects: the beginning.

Right after the proposal is accepted and before you really realize how hard the problem is going to be and how little you'll actually be able to accomplish in the grand scheme of things, it's a wonderful time of hope and promise. This is also known as the analysis phase.

In this blog post, I'm going to delve into a particular aspect of the analysis phase where you try to capture the important things, properties, actions and relationships. In other words, you try to capture the taxonomy/glossary/schema used by the important stakeholders in your problem space.

A number of fields have defined processes to address this problem: HCI uses contextual inquiry, information architecture or cognitive task analysis (CTA), software engineering uses requirements analysis or object oriented analysis, artificial intelligence uses CTA or knowledge engineering, etc. As a side note, I'm continually amazed by how each field uses terms and methods that are so eerily similar, yet often without seeming to realize it. I'll have to go deeper on this topic in another post.

So what are we really talking about here? At a deeper level, we're really talking about language and meaning. What are the words and symbols that people use and what is the underlying conceptual meaning that they attach to those words?

In philosophy, ontology is the study of what things exist and of the basic categories and relationships between those things. In information science and artificial intelligence ontology is a formal representation of a set of concepts and their interrelationships. This is closely related to concept learning from psychology and the concepts of taxonomy and controlled vocabulary.

I don't really want to go into depth on each of these topics right now, but I will say that after a bit of exploration, controlled vocabulary seems to have the closest match to what I'm looking to develop for each of the projects that I'm working on. I found a nice series of articles on boxes and arrows starting with What is a controlled vocabulary? that I'm currently reading.

In my next blog entry, I want to explore some specific methods and tools for capturing and sharign controlled vocabularies. I currently have the following two specific leads:
  1. Creating a Controlled Vocabulary, and
  2. Concept Maps
See you next time!

-Keith