Real Time Web & Real Time Search
There has been a lot of hype lately about real time web and search, and total Twitter mania. I have been working on these issues inside Yahoo! Search lately, and here are some of my observations.
News is published in real time now. By news I mean both traditional, global/official stories, and local/socially relevant updates. RSS feeds from news sites, blogs -- these get updated on the web as soon as some newsworthy thing happens and the writer finishes his article. Your friends and family update their Facebook and Twitter status all the time and upload photos straight from their phone. Newspapers can't really compete with the web in providing information this fresh, which is a big part of the reason for their demise.
However, news is not consumed in real time. In fact, it is not humanly consumable in real time. Unless you are a news junkie or a day trader, you do not look up every single story that goes on the wire. This is where Twitter breaks down, at least today. It is like a stock ticker. There is way too much content, and old stuff gets buried by the new, even if some of the older pieces are actually more relevant and interesting. Facebook also suffers from some of this information overload.
So how does one follow real time web updates? One example are mainstream news sites, like Yahoo! News or CNN.com: they highlight the most important stories on the front page. This browse model works pretty well for sites that have human editors who digest the stories and keep the good stuff for you to read. Even these editors, however, use the other methods below to determine what's hot. A key observation about news: there are not that many original and interesting events (core news stories, as it were), and most stories are duplicates (based on Reuters / AP syndicated feeds). Likewise, a huge proportion of Twitter traffic simply refers to these core news stories; it is not news by itself.
Real time search is the best way to consume the latest updates. Search does not mean that you actually type a query into a box. The same algorithms that rank results when you enter a query can also be used to keep news contents fresh and ranked by relevance when you go to a news site or your Facebook home page.
However, there are challenges to building good real-time search. The main one is just the speed at which new updates are published, especially on sites like Twitter, which calls its feed the Firehose, with good reason. Normal web crawling techniques are too slow -- they leisurely construct a list of links and relationships in order to determine popularity and PageRank, and by the time a brand new web update has enough PageRank, it may be too stale to be relevant. The way forward is to ingest most of the fresh new content through feeds with a push mechanism. The search engine indexes the feed instead of trying to discover and crawl the content. As for ranking, the level of duplication is a good proxy for popularity, and so are number of clicks, appropriately weighted over time. Another important metric is authority of the content publisher. Not all Twitter users are the same; a Tweet from President Obama or CNN is probably more interesting than thousands of unknown authors chatting about the same topic. Likewise, you don't care equally about all your Facebook friends and what they are up to. Last but not least, real time spam seems to be a growing problem, and identifying its sources and patterns is more akin to the techniques for email spam fighting (which tends to be real time and message based) than the traditional web spam and link farming.
So the challenge is on -- let's see who will build the best real time search experience!