Sunday, March 27, 2011

It's better to search, than to be searched!

Consistent with the headline, this post will be about enhanced search & information retrieval accomplished via usage of text analytics. For those who haven't read the Introduction post, please do it before reading this one.

In the following demonstration, we will use RSS feeds from a few (credible) information sources from Slovakia. This is because of my preparation for upcoming PosAm TechDay, where I will give presentation in Slovak language. Translation is provided along the whole post and all pictures, so I hope this fact will not discourage you from further reading. For the record, here is the list of used sources from which I downloaded and indexed RSS feeds. From each source, two channels were used, one publishing domestic and the other one foreign news stories:

SME.sk - internet portal of a print media
Pravda.sk - internet portal of a print media
Aktuality.sk - internet-only news portal
TA3.sk - internet portal of a TV news channel

The reason why only RSS content (and not full featured articles) was used is simply because using RSS is legal as defined by copyright policies of content publishers (as you can see I really stick with the message from the headline and therefore will leave my web-spiders locked down this time :-). For the purpose of this demo, more than 400 article previews were downloaded and indexed. First of all, very basic and very common full-text search capability will be presented with the query “japonsk*” which is equal to “japan*” in English:

 

To quickly translate few results from the result set:
  • First result informs about established way (by Embassy of Japan) how to help people of Japan.
  • Second about Japanese heroes which volunteered for the service (details of the service not known from the headline).
  • Third about radioactive contamination of water in nine prefectures of Japan, etc...
Of course, by reading all the titles and bodies of RSS records, user will gain detailed understanding of what exactly is going on, but let's project this into daily life situation - when one employee is trying to search for something he is not deeply familiar with – gaining really good overview can take a lot of time.

Now let’s have a look, how text analytics can fill the gap here. After running first analysis, we might be presented with data like the one on the next picture:



We can see emergence of variety of topics (large pop up window in the background with table and the yellow highlighted row). In the first line, after quick examination, there are several common words (starting from left – "inform“ (a lot of stories are taken from information agencies and transformed into form of "agency xyz informed about..."), "which“ and "agency“, but after that comes a first word of interest – "reactor", which we will take and create a related topic (smallest pop up window on the left) – from now on, under this topic, only the news containing the word "reactor" (reaktor in Slovak) will match (preview – smaller pop up window in front of large one, slightly to the right). In a similar way we will create more topics. Topics will vary from domestic to foreign affairs as we are processing all articles together). To name few topics from foreign affairs (as much wider audience should be familiar with them): UN resolution on no-fly zone in Libya, Fukusima power plant problem, radiation measurements across Europe, fear of radioactive contamination of food in Japan, etc..

During the following analysis, the bottom table is also used which shows suggestions about relationships between topics that are yet not connected. From it we can derive a standard „HAS“ relationship - the example below is the connection between Japan and Fukusima nuclear power plant, but from the table also follows a suggestion that "Japan" should link with "food from Japan"). Please note, that in our example - there are two standard relationships (Generalization - „IS“ and Aggregation – „HAS“):



Next step focuses back on suggestions for new topics - for example "radioactivity" OR "radiation":



From the diagram we can see the topic, create new topic dialog with test results preview of articles dealing with radiation and spread of radioactive pollution into Europe. Steps of new topic creation and linkage of existing topics are repeated until satisfaction with the „network“ of topics (or end of time given for analytics of particular domain) is reached. Now we have a fancy topic network overlay over indexed articles. How we can use it? A picture is worth of million words so here we go - the very same search regarding “japan*” related articles (left table) can bring on best-related topics (upper right table) and also bit more intelligent visualization of associated topics:



I hope you see the difference from a basic full-text search. Right now we have a summary of all topics, related to our search that can be used for further navigation. For example, from topic visualization we can see that there is something going on with Fukusima nuclear power plant (and the linkage with radioactivity is suggesting that it’s of no good) which is not obvious from the first bunch of relevant full-text results. Of course, as next step in his search, user will click on Fukusima topic and see how the results table, suggested topics table and visualization will change to provide him/her with more targeted information about the situation in Fukusima – and also with hint that there are both - nuclear reactor and radiation - covered. All this is retrieved before reading any news story (which are quite short one sentence articles in this example, but it’s easy to imagine a more common situation - usually in a business environment - that there are larger source documents full of unstructured data):



Congratulations to those who held out till the end of this first demonstration. To sum it up, we have seen how text analytics/text mining can enrich the user’s experience when searching through unstructured data. Benefits are twofold:

  1. User is navigated from known (japan query) to unknown (nuclear reactor problem at Fukusima).
  2. User search is faster and more likely to provide the needed results. The speed in our demonstration might not be so obvious, because we have gone through a bit detailed description how topics network is being built. We can rectify it right now and simply imagine, that the network building and final search is done by different users in the system (with different objectives). The first user (doing analysis) knows the domain and can quickly distinguish from the results what is important and what is not. The other user (only searching) will receive all relevant knowledge immediately free-of-any-extra-effort during his search.
I hope that the first demonstration was interesting for you, if you have any comments about something you liked/disliked, please leave a comment so that I can improve next time. And I'll appreciate any response anyway. Thank you for your time.

Friday, March 25, 2011

Introduction

This post will serve as general introduction into any other post on this blog, which is dedicated to demonstration/examples/impacts of text analytics technique in certain scenarios. To gain quick insight into what text analytics is, please have a look at Wikipedia. The short definition is:

"The term text analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation.[1] The term is roughly synonymous with text mining; indeed, Prof. Ronen Feldman modified a 2000 description of "text mining"[2] in 2004 to describe "text analytics."[3] The latter term is now used more frequently in business settings while "text mining" is used in some of the earliest application areas, dating to the 1980s,[4] notably life-sciences research and government intelligence"

(If you would like to get more detailed story, excellent one comes from Seth Grimes)

Text analytics is about employing raw computer power to process large sets (docs, records, ...) of unstructured textual data in order to mine out structural information (categories, tags, associations, etc...) that can be used in variety of ways. For example:

  • Organization of huge document sets in order to achieve better retrieval capabilities.
  • Pattern recognition in text records, that leads to definition of new (for example business) rules.
  • ...

There are of course more situations, where it is handy to have structural representation above the huge pile of unstructured data - which will be explored in later posts. Each post will be dedicated to one demonstration, there will be no ordering sou you can pick up and jump right into those you like. The next post will demonstrate enhanced information retrieval accomplished via text analytics.