Wednesday, June 2, 2010

Blog Research: 19 GB data processing with 2 GB active RAM


Few days I was performing blogosphere analysis using my crawler "VisionerBot" that I recently presented at ICISA, Seoul, 2010. I had quite a tough time and not to forget a number of sleepless nights while I was on this task as the amount of data was extremely huge :), my task was to process blogs for interest of users over the blogspot domain. In this post I present the problem I was facing and the technique I used to overcome it:

Input: 69 Blogs
Objective:
1) Find 5,000 Blogs
2) Process at most 2,000 Blog posts per blog
Achievement
Total Blogs found: 5,067
Total Alive Blogs: 4,552
Total Number of Posts: 1,704,587
Problems:
Size of data that needs to be processed: 19 GB+
Size of available active RAM: 2 GB

Now what to do from here... When I started working on this, I never expected to find 1,704,587 posts with data size of 19 GB, I worked almost days and nights to get this data fit inside my desktop machine. If I used database for this experiment then it will cost me months to download this where as I had a deadline to complete this task within 10 days. There I gave birth to a new algorithm that I call as "Rack Algorithm" which downloads data in RAM until RAM gets filled and then it flushes the data on disk and cleans up RAM for remaining download process and this exercise continues until data is downloaded completely. After download comes the process of finding meaningful data out of that 19 GB and calling it to RAM to start processing and there I mananged to shrink its size size enough to manage it inside RAM. In this process I used (Key,Value) pairs along with lists.

Finally a sigh of relief, I have accomplished what seemed nearly impossilbe within 10 days, I am happy I managed to find opinion clusters of 1,704,586 posts by my coming algorithm that I call TDR (Topic Discussion Rank).

5 comments:

  1. That is a great job you did.
    Could you give a bit clarification for me about the Active RAM?
    What exactly is active RAM? Is it the RAM which exist in computer? I mean other than virtual RAM which created when RAM fills up.
    Or is it something else?

    ReplyDelete
  2. Sorry for delay in response. Thanks for liking the work.
    Active RAM means the real memory without any involvement of disk or virtual memory. It is simply the available memory to the program in pure RAM. The reason why I didn't use virtual memory provided by OS or 3rd party was because of performance overhead. While, doing such critical stuff its important to define own policy of virtual memory on disk in order to overcome overhead to maximum. So in above case, I self made an algorithm called Rack which is similar to GFS but optimized to the need of application. Let me know if you have any more questions.

    ReplyDelete
  3. Now I got it. It's very clear for me. Thanks for answering. Good luck with your future researches.

    ReplyDelete