Tags and semantics Michał Moroz

Today I started to think about the following problem: Given the 100 questions, how to aid a human in the process of extracting real needs behind them? The problem is simple enough, but I don't know of any application which would focus on helping people with data categorization and meaning extraction. As usually in that kind of situation, I began to wonder how I'd implement such an application. I decided on getting some useful information from tags assigned to each question. Here's what I found.

Typical tag implementation is a simple many-to-many relationship between some kind of data (in this example, questions) and a set of terms, each represented by a Tag entity. This allows to attach many terms to each piece of data and then search by tags, especially when we are interested in this particular kind of data.

For the sake of this article, let's assume that all Tag terms are separated by commas, and they ar allowed to have spaces inside.

1. What is the sense of life?
tags: life

2. What should I do to be happy throughout my life?
tags: life, future

A query for life tag would then get us two results, and for future would get one.

Extracting knowledge

However, there is a problem with using tags on given data. By contrast, when writing a blog, we are pretty sure what kinds of topics we are going to include in the whole blog - our interests, life observations, photos, or maybe technical posts, such as this one. This constitutes the domain of our tags that is mostly unchangable from the beginning.

The problem starts when we don't know our domain. Let's get back to our 100 questions. In order to get a grasp of how to categorize them, we'd need to read all of them at least once, assigning some kind of terms to them. After first reading, the terms can be either too specific or too general to catch the real information stored in all of the questions. In order to that, we'd revise the terms, looking for similarities and proposing similar set that would be better applicable to all the questions.

This is an iterative process and after a few iterations we have a pretty good idea of how to segregate all the questions and which terms make the most sense. By most sense I assume the following:

a term specifies a category which can be described by words;
a term is not too general (applies to all questions);
a term does not group too many concepts, thus resulting in information loss;
there are as few terms as possible.

Most of that, as of 2014 is a human process, impossible to be replaced by a machine. However, computers were built to help, so let's see what kind of help we can get from them.

Tag sets and hierarchy of tag sets

Let's name $T$ as the set of all tag terms and $Q$ as the set of all questions. At the beginning $T$ will be empty, but when we start tagging particular questions it will quickly fill up with different terms.

Then, let $ S_q$ be the set af all tag terms related to the statement $q \in Q$ and $\mathcal{P} (S_q)$ be the and power set (a set of all subsets) of $S_q$.

Example: for the question 2:

$S_2$: { life, future }
$\mathcal{P} \left(S_2\right)$ is a set of four sets:
1. { life, future }
2. { life }
3. { future }
4. $\emptyset$

The power set $\mathcal P\left(T\right)$ includes every available set of terms found in $T$, and we simply can describe a hierarchy, where $S_x$ is a child of $S_y$ iff $S_y \subset S_x$. At the top of our hierarchy we'll have all the single-term sets, and lower levels will contain more and more terms in the set, and the lowest one is the $T$. Visually it looks like that:

{ life }    { future }
   { life, future }

A very interesting thing starts to unfold when we introduce a size function

$$s(S_q) = \left|\left\{q_x: q_x \in Q, S_q \subset S_{q_x} \right\}\right|$$

that counts how many questions are described by this term set. With this, we can measure how a more specific set is different to its parent.

$$\alpha_{x,y} = \frac{s(S_x)}{s(S_y)} \; \rm{where}\; S_y \subset S_x$$

In example let's assume that:

$S_{{\rm{life}}}$ has size of 30
$S_{{\rm{future}}}$ has size of 7
$S_{{\rm{life, future}}}$ has size of 5

$\alpha_{S_{\{\rm{life, future}\}}, S_{\{\rm{life}\}}} = \frac{5}{30}$, whereas $\alpha_{S_{\{\rm{life, future}\}}, S_{\{\rm{future}\}}} = \frac{5}{7}$. High ratio means that there is a probable redundancy of semantic between two term sets. These pairs should be analyzed.

This computation is simple enough for a computer to perform and to present the results to a human.

Analysis

A computer should perform this analysis on every tag set from the power set of $T$. When it finishes computations, the human should be able to see affected pairs and lower/raise the threshold for similarity ratio to narrow down the result set.

In order for the results to become readable for a human, the following context should be presented when analyzing a single term set.

The term set in question.
All affected pairs regarding the term set and its ancestors.
All ancestors of the term set.
Data attached to each term set.
A list of all terms in the system.

This information is enough to make decision about what to do with the pair. The following are available:

Leave sets alone, set a pair as resolved.
Rename of a tag term.
Extraction of a term set into a new tag.
Merge of the term set into an existing tag.

To make decision about point 3 simpler, the computer should present a view that displays two columns: the first one displays all questions under the term set, and the other one displays all questions that are in the parent term set and not in the term set in question.

Conclusion

I've only shown one aspect of the process that could be easily computed by a program. I believe there are many more ways to provide hints to support humans in data categorization and extraction, and I'd like to hear your suggestions about that.

The role of UI for this kind of task would be tremendous. Most of the information should fit in a single display in order for a human to be able to grasp it without remembering too much. When done properly, I think it could enhance the process of information extraction from the data a lot, especially for small sets of data (up to 1000 entries).