I am sharing a simple code with explanation on how Lucene (pyLucene to be specific) can be used for Query Expansion.
What I will not discuss here is how to devise a strategy for finding new terms for Query Expansion (a person can implement this on his/her own). But what I will explain here is, how one can assign different weights to query terms for retrieval task.
Consider four documents having following content
D1 -> 'pagerank pagerank algorithm'
D2 -> 'pagerank algorithm algorithm',
D3 -> 'pagerank',
D4 -> 'algorithm'
It implies our vocabulary of corpus is just 'pagerank' and 'algorithm', while corpus frequency of each term is 4 and document frequency is 3. Hence now idf and cf does not influence the scoring technique.
In the attached source code you can see that we have boosted term 'pagerank' by 10 times compared to term 'algorithm'. The query is 'pagerank algorithm'.
Upon retrieving the document D3 has 10 times higher score than D4 and likewise D1 has higher score than D2 (but not 10 times since the document's total terms are 3 which influences the scoring unlike in previously discussed case). Please run the source code and observe the results.
Source code: http://codeviewer.org/view/code:35db
Version: pylucene-3.6
What I will not discuss here is how to devise a strategy for finding new terms for Query Expansion (a person can implement this on his/her own). But what I will explain here is, how one can assign different weights to query terms for retrieval task.
Consider four documents having following content
D1 -> 'pagerank pagerank algorithm'
D2 -> 'pagerank algorithm algorithm',
D3 -> 'pagerank',
D4 -> 'algorithm'
It implies our vocabulary of corpus is just 'pagerank' and 'algorithm', while corpus frequency of each term is 4 and document frequency is 3. Hence now idf and cf does not influence the scoring technique.
In the attached source code you can see that we have boosted term 'pagerank' by 10 times compared to term 'algorithm'. The query is 'pagerank algorithm'.
Upon retrieving the document D3 has 10 times higher score than D4 and likewise D1 has higher score than D2 (but not 10 times since the document's total terms are 3 which influences the scoring unlike in previously discussed case). Please run the source code and observe the results.
Source code: http://codeviewer.org/view/code:35db
Version: pylucene-3.6