I am PC, powered by Linux: 2010

Monday, November 22, 2010

Why Writing is Important?

Do I need to learn how to write? What does it mean? Does it only mean I should write correctly in terms of grammar? The answer for 'only grammar' is No.

Lets go to the fundamentals, why do we write? Um... one would say, for communication. Then why do we communicate? For making others understand our viewpoint. So understanding is the most important concept.

Now, whats special with writing provided that we can talk too? The key idea is preservation of clear thoughts and that's the magic of writing. Its not important to write in perfect grammar as compared to writing clear and unambiguous(which are interpretable in one way) thoughts. Grammar is a side hero in this area but the main hero is transfer of clear thoughts.

In this article, I took grammar as a general representative of syntax (i.e., spellings, tenses etc). Some people feel that writing is about bragging their fluency in a particular language but this is one common myth.

Let me give some examples of unambiguous writing so that I can transfer my thought in a clearer way.

1- This is a sentence where a/b is true.

Ans- How do I read that? This is a sentence where a or b or both are true. Common mistake in the interpretation could be a or b (at the reader’s end or at the writer’s end or at both ends).

2- This is example: This supports the example strongly.

Ans- How do I read that? This is example and following sentence strongly supports it due to colon. Common mistake would to ignore ‘:’ by reader/writer.

3- I am writing; I am sharing myself.

Ans- How do I read that? I am writing and one reason is to share myself. Common mistake could be to ignore ‘;’ or make it strongly associated with previous sentence by reader/writer i.e., I am writing because I want to share myself.

To conclude my short post, writing is a very effective way of communication. However, it brings complications because it does not involve body language (as we have in talks). Therefore, its important to understand the idea behind this style of communication in order to avoid being mis-reported or mis-read etc. It is also an important skill for both readers and writers.

If you find this post interesting please share it with your friends. This post is both applicable for general and technical writing purposes/practices.

Tuesday, September 14, 2010

Steps of utilization of human mind

Following are general steps defining the utilization of human mind in an activity.

Question your surrounding
Question beyond your surrounding
Think for what to adopt
Maximize following of adoption

If required go back to previous steps.

People may apply this to maximize anything which interests or affects them, be it education and research, paid professional research, business need, scientific advancement and religious/non-religious thinking. While the activities which do not involve above general steps fall into norms, culture and control. If an activity expects utilization of mind but above mentioned steps are not followed then the person is in illusion. Questioning always breaks illusion and borrowed questions may introduce stronger illusion; if questions are borrowed then expansion through step 2 should not be borrowed to full in order to keep it reasonably natural. A point to note here, thinking is natural but illusion is sleeping drug to our human mind.

Friday, September 3, 2010

[Paper]: Blogosphere Topic Clustering and Ranking

Following are the slides of my paper presentation at COLING 2010 Workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources

Identifying and ranking topic clusters in the blogosphere from M Atif Qureshi

The paper deals with the Identification and Ranking of Topic Clusters in the blogosphere. Topic clusters represent in this paper the concept of grouping together blogs sharing a common interest i.e. topic. The algorithm takes into account both the hyperlinked social network of blogs along with the content in the blog posts. Topic-specific ranks are assigned to each blog in the cluster using a metric called “Topic Discussion Rank,” that helps in identifying the most influential blog for a specific topic. Experiments show that the presented method reaches a high level of accuracy.

This method proposed suggests a hybrid approach: first, content words of the blog posts are used to determine the relevance of a blog for a given concept, and thus the assignment to a "cluster". Second the links originating from the blog are used to rank the relevance of the blogs within the "cluster".

In the age of Web 2.0 the blogosphere has assumed a very significant role and it serves as an opinion dissemination medium; as such this research is part of a long-term project on blogosphere research and we would like to invite students/researchers who are interested in this area to collaborate with us. I recommend such people to contact me through email at atifms@kaist.ac.kr or matifq@yahoo.com; or they can contact my colleague at arjumandms@kaist.ac.kr or arjumand_younus@yahoo.com.

The full text of the paper can be downloaded from this link: paper.

Additionally a discussion on the workshop: collaboratively constructed semantic resources can be found here and your comments are also welcome.

Tuesday, August 31, 2010

Visit to Microsoft Research Asia: Details on Discussion

27th August, 2010 was a memorable day for me as me and my colleague who also happens to be my wife had a tour of Microsoft Research Asia in Beijing, China.

Mr. Yunbo Cao who works there hosted the two of us. He was an extremely humble guy and his humility and kindness was evident through the session. First we were shown around the research labs and work area; he showed us how for every employee at Microsoft Research instead of having name plates outside each employee's office they have written name of city to which that employee belongs.

Microsoft Research Asia is Microsoft’s fundamental research arm in the Asia Pacific region and it was founded on November 5, 1998. In 2004, MIT Technology Review named Microsoft Research Asia “the hottest computer lab in the world.” From Microsoft Research Asia have emerged many technologies that have had a huge impact on the technological community today. Over 200 innovations from the lab have been transferred to Microsoft products, including Office XP, Office System 2003, Windows XP, Windows Server 2003, Windows XP Media Center Edition, Windows XP Tablet PC Edition, Xbox, MSN, Windows Live. In addition, technologies from the lab have been adopted by international standards bodies such as MPEG4 (error-resilient video transmission), IETF (TCP/IP header compression), and ITU/ISO (video-compression technology).

The prominent labs at Microsoft Research Asia which Mr. Yunbo talked about are:

Web Search and Mining: the goal of this group is is to drive the next generation of Web search by leveraging data mining, machine learning, and knowledge discovery techniques for information analysis, organization, retrieval, and visualization. Its core areas focus on structuralizing the Web, vertical search, large-scale experimental web search platform, mobile search and multimedia search.
Information Retrieval and Mining: the goal of this group is to develop advanced technologies to help users accurately, quickly, and easily find information. Currently, the group is working on three projects: algorithms for improving web search, enterprise search, and community search. The following research areas are being intensively investigated: search relevance and learning to rank, link analysis and web graph mining, anti-spam and adversarial information retrieval, document information extraction, and search log data mining.
Natural Language Computing: this group is focusing its efforts on a variety of research topics, including multi-language text analysis, machine translation, cross language information retrieval, and question answering. Over the years, the group has made significant contributions to Microsoft products, including a Japanese and Chinese Input Method Editor (IME), English writing assistant for Office 2007, Chinese couplet game for Windows Live, Chinese word breaker, pinyin search
Web Intelligence: the aim of this group is to enable synergetic collaboration between people and between people and computers to enlighten them and enrich their lives. For this mission researchers of this group develop scalable automatic content analysis methods and quality metrics to analyze a huge amount of online text such as blogs, community-based question answering, forum discussions, news, reviews, Twitter, Wikipedia, etc. and to harvest explicit and implicit knowledge from these media.

Mr. Yunbo himself is part of Web Intelligence group and prior to this he was a part of the Natural Language Computing group. The areas of focus of his group are expert and social search, user intent/activity recognition and prediction, inarticulate user assistance, information access evaluation, social question answering and summarization and sentiment analysis. His research work heavily centers around community-based question answering services.

He told us that the third and fourth floor of Microsoft Reseach Asia is called the Microsoft Search Technology Center and the main focus of all research there is the newly released search engine by Microsoft Bing - this year's SIGIR had 15 papers by Microsoft Research and one can see that researchers at Microsoft are pushing hard day and night to make Bing better and better.

We could not take photographs of the labs as it was not allowed but it was allowed for some places and the picture below shows the Microsoft Research Asia recreation area where the employees enjoy some time off from work, when we reached it was fruit time. Mr. Yunbo offered us some but we could not take any as we were fasting.

The discussion then moved towards our research focus at Database and Mutlimedia lab of KAIST. We told him about a state-of-the-art project by our Professor which relates closely to search engines - the paper got best demonstration award in ICDE 2005. I further explained him how my Professor considers it as an offense to the traditional databases if researchers worldwide believe that MapReduce based systems are the answer for massively scaled information retrieval tasks. The research idea is to incorporate an information retrieval architecture into existing parallel databases to provide best of both worlds: the scalability of Map-Reduce based systems and the additional functionalities of databases (SQL, schemas etc.). These ideas were appreciated by the Microsoft researcher. I told him about my thesis problem statement on improving quality of web search results by combating spam and my wife told her about her work on scalable, massive architectures for parallel web crawling; he was surprised at hearing that we were the only ones in our lab working on these huge project modules and said that at Microsoft they are working in groups for even at module level (for Bing project). We exchanged some information regarding system's nature from being static search to dynamic search.

Then we raised the point for final part of our talk centering around academic collaborations between Pakistani universities and Microsoft Research Asia and future plans of Microsoft Research Asia towards Pakistan. Mr. Yunbo said that Microsoft Research Asia has many, many collaborations with universities through out Asia and that Microsoft Research has two research centers in Asia, i.e., in China and India but China heads Asia in whole as its Microsoft Research Asia. He added that Microsoft Research Asia is always looking for more and more engineers. I asked about the reason behind presence of academic/research collaborations in universities through out Asia but an absence in Pakistan; to this question Mr. Yunbo replied that Microsoft has some collaborations with Pakistani universities but he agreed that those were only technology-oriented collaborations and none exist on applied research and academic level. Additionally he pointed out that Microsoft Research Asia would love to have academic/research collaborations with Pakistani universities and till now there is no serious thought due to lack of an appropriate channel through which to initiate such collaborations. He added that student-researchers like us can serve as the bridge between Pakistani universities and Microsoft Research Asia; and he suggested us to contact Microsoft Research University Relations Team. Mr. Yunbo believes that both Pakistani universities and Microsoft Research can greatly benefit from such academic collaborations and it can lead to a whole new opening for researchers in Asia, as intelligence can never be confined inside some regions only.

On our way in a passage, he showed us souvenirs that were given to Microsoft Research Asia by different universities of Asia but Chinese presence was much dominant in those proving how open Microsoft Research Asia is in visiting different universities.

Readers are advised to drop in their comments, questions, suggestions as we intend to carry these plans forward for promoting and improving Computer Science research in Pakistan.

Tuesday, July 27, 2010

[Paper]: Revisiting Crawlers’ Role in a Search Engine

Following are the slides of my paper presentation at ICISA, 2010 in Seoul, Korea.

Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search Problem in the Minimum Amount of Time through a Distributed Framework from M Atif Qureshi

This paper considers tradeoffs in web crawler design especially from the perspective of events versus threads[1,2]. The paper also makes some recommendations for better OS support for web crawling. It points out that the two principal problems with web crawling are:

Choosing the right pages to crawl
Basic architecture for performing the crawl

The focus of the work lies on the second problem with our proposition that events are the ideal way for implementation of web crawlers as events give better throughput while crawling the web. Furthermore we argue that the growing usage of search engines needs a careful redesign of the constituents of the search engine and that too from the perspective of systems software with the conclusion that the exokernel[3] is the right answer in removing some of the limitations of search engines today. We recommend having a future operating system dedicated to search engines.

If any of you is interested in more details I recommend him to contact me through email at atifms@kaist.ac.kr or matifq@yahoo.com. Moreover you can also request for a copy of the paper by personal email.

References

[1] von Behren, R., Condit, J., and Brewer, E. Why Events are a Bad Idea (for High-concurrency Servers). In 10th Workshop on Hot Topics for Operating Systems (HotOS IX), Lihue, Hawaii, May 2003.

[2] Ousterhout, J. Why threads are a bad idea (for most purposes). In Invited talk presented at 1996 USENIX Annual Technical Conference, San Diego, CA, October 1996.

[3] Engler, D. R., Kaashoek, M. F., and O'Toole, J. 1995. Exokernel: an operating system architecture for application-level resource management. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles (Copper Mountain, Colorado, United States, December 03 - 06, 1995). M. B. Jones, Ed. SOSP '95. ACM, New York, NY, 251-266.

Wednesday, June 2, 2010

Blog Research: 19 GB data processing with 2 GB active RAM

Few days I was performing blogosphere analysis using my crawler "VisionerBot" that I recently presented at ICISA, Seoul, 2010. I had quite a tough time and not to forget a number of sleepless nights while I was on this task as the amount of data was extremely huge :), my task was to process blogs for interest of users over the blogspot domain. In this post I present the problem I was facing and the technique I used to overcome it:

Input: 69 Blogs

Objective:

1) Find 5,000 Blogs

2) Process at most 2,000 Blog posts per blog

Achievement

Total Blogs found: 5,067

Total Alive Blogs: 4,552

Total Number of Posts: 1,704,587

Problems:

Size of data that needs to be processed: 19 GB+

Size of available active RAM: 2 GB

Now what to do from here... When I started working on this, I never expected to find 1,704,587 posts with data size of 19 GB, I worked almost days and nights to get this data fit inside my desktop machine. If I used database for this experiment then it will cost me months to download this where as I had a deadline to complete this task within 10 days. There I gave birth to a new algorithm that I call as "Rack Algorithm" which downloads data in RAM until RAM gets filled and then it flushes the data on disk and cleans up RAM for remaining download process and this exercise continues until data is downloaded completely. After download comes the process of finding meaningful data out of that 19 GB and calling it to RAM to start processing and there I mananged to shrink its size size enough to manage it inside RAM. In this process I used (Key,Value) pairs along with lists.

Finally a sigh of relief, I have accomplished what seemed nearly impossilbe within 10 days, I am happy I managed to find opinion clusters of 1,704,586 posts by my coming algorithm that I call TDR (Topic Discussion Rank).

Friday, May 14, 2010

Welcome Post

So what should I write here? I think its time to write something random; something that gives me freedom to write whatever I do in small world of daily science.

Consider if computers were to be human, then I will have three more close friends in my life. Desktop machine, laptop and notebook. Why??? the reason is I spend most of my time with computer :)

I plan to share what ever I think , no matter how scattered or how confined.

I also intend to write things related to Linux; yeah! I am PC powered happily by Linux. So those of who are looking to try Linux feel free to come in contact with me. My prescription is Windows is a cancer for researchers and the cure is Linux. Even if you don't know how to operate Linux, then don't be afraid Linux is now more friendly than ever before, try Ubuntu.