Sunday, August 18, 2013

User-Defined Query Term Weighting in Lucene

I am sharing a simple code with explanation on how Lucene (pyLucene to be specific) can be used for Query Expansion.

What I will not discuss here is how to devise a strategy for finding new terms for Query Expansion (a person can implement this on his/her own). But what I will explain here is, how one can assign different weights to query terms for retrieval task.

Consider four documents having following content
D1 -> 'pagerank pagerank algorithm'
D2 -> 'pagerank algorithm algorithm',
D3 -> 'pagerank',
D4 -> 'algorithm'

It implies our vocabulary of corpus is just 'pagerank' and 'algorithm', while corpus frequency of each term is 4 and document frequency is 3. Hence now idf and cf does not influence the scoring technique.

In the attached source code you can see that we have boosted term 'pagerank' by 10 times compared to term 'algorithm'. The query is 'pagerank algorithm'.

Upon retrieving the document D3 has 10 times higher score than D4 and likewise D1 has higher score than D2 (but not 10 times since the document's total terms are 3 which influences the scoring unlike in previously discussed case). Please run the source code and observe the results.

Source code: http://codeviewer.org/view/code:35db

Version: pylucene-3.6

Saturday, July 27, 2013

EuroHCIR2013 Work Towards a New Search Interface namely Perspective-Aware Search

Recently an updated version was demoed  in SIGIR 2014: http://dl.acm.org/citation.cfm?id=2611184

There are occasions when search results do not satisfy the information need and give a completely undesirable set of results than what the user is looking for. A possible reason for this lies inside the returned documents which contain some perspectives while giving coverage to the topic and this perspective may be observed as bias by the user.
Lets take the following example scenarios:
  • Consider a case where a user wishes to find information about a certain event (say, a bomb attack in a certain region). The search results returned, contain a majority of news reports blaming Islam (its implicit writing style) relating it with terrorism in most of the cases. This prompts the user to explicitly observe how much Islam is related with terrorism in the returned set of search results.
  • Consider another case where a user wishes to find information about roles and rights of women in Islam but the search engine returns articles that contain a tendency of highlighting oppression against women instead of women rights and roles. In this case the user observes a correlation between women and oppression instead of factual position on rights.
In the above cases, the user's information need may lead him towards an explicit investigation of the underlying document collection and he/she may be interested in observing the amount of perspective tendencies in various search results (e.g., news reports). Current search engines do not facilitate this need by highlighting perspectives while displaying the search results. Hence, we propose the concept of "perspective-aware search." The proposed search interface enables the user to explicitly analyze search results with a touch of perspective awareness.

The following presentation contains some screen-shots of the proposed search interface; I will be giving a demo of this system at EuroHCIR Workshop that is co-located with SIGIR2013.


The system is built on top of the WikiMadeEasy API which is an API for mining Wikipedia data and is the output of work I am doing towards my PhD thesis. Feel free to contact me for more details of the API. The full paper describing the system can be found here.

Tuesday, February 5, 2013

Python: Reading large bz2 file with bz2.BZ2File()

There might arise a problem of partial (incomplete) reading of a file while reading a bz2 file in python.

The tip to overcome such a problem is very simple, uncompress the bz2 file using extraction utility (Ubuntu has the graphical utility by default). Once extracted, zip it back as bz2 and now try reading it again, this time you may have solved the problem.

Reason for the problem: the side that produced the bz2 file may have produced the bz2 file from multiple files which is not well recognized by bz2.BZ2File() functionality in python.