Monday, June 19, 2006

Microsoft researchers propose a not-so-new way to measure Web page importance

Three Microsoft researchers published an interesting paper last month titled Beyond PageRank: Machine Learning for Static Ranking. Sounds a bit scary and logical, but it revisits an SEO myth that has long held back the search engine optimization community.

Google's big claim to fame is that they supposedly incorporate PageRank into their search results. PageRank is a measurement of the probability that you will land on any given page simply by clicking on links you find as you surf. That probability is determined by how many links point to a page, but since pages with more links intuitively have a higher probability of being found, any links from those pages will help other pages more. It's very circular and very confusing.

But Google doesn't actually show you search results in PageRank order, except in their directory. When you go to Google and type in a query, say for tolkien forum, Google looks at a lot of other data first. It computes what is called a relevance score and then adds the page's PageRank score to the Relevance score. The combined Relevance + PageRank values are used to determine which pages are listed first.

That's a pretty simple concept, actually laid out in very simple language in Larry Page and Sergey Brin's original Google paper. Compute PageRank, compute Relevance, add the scores, and that is how you arrange search results.

You'd think a lot of people could grasp that simple concept, but Nooooooo! The vast majority of references to PageRank among people interested in or involved in search engine optimization are so far out in left field you have to wonder what they were drinking the day they started reading about PageRank. They come up with some of the stupidest explanations for how Google works, and now the mainstream news and business news media have picked up on their silly explanations.

Sheesh!

PageRank is what the Microsoft researchers call a static measurement of quality. It's static because it's computed once per document (within an undisclosed cycle). That static value is then added to a dynamic measurement of relevance. It's dynamic because it's computed each time the document is considered as a possible result for a query. So, if you run a query for "michael martinez blog pagerank" you'll get the static PageRank and a dynamic Relevance score A for this page. But if you run a query for "michael martinez blog google" you'll get the same PageRank with a different dynamic Relevance score B.

Got all that? Don't worry. You've got plenty of company.

Still, Microsoft's researchers have proposed looking at some different measurements of importance. One such measurement is traffic to a Web page. They discuss several possible ways of collecting such data but they have settled on what I call the Alexa Solution. Alexa.com, now owned by Amazon, tracks surfing data from about 11 million people who have downloaded their toolbar. Every time you open a new Web page, if you have the Alexa toolbar installed in your browser, it sends a little message back to Alexa that says, "Just opened page X".

Microsoft (MSN) has its own toolbar, and the researchers proposed that tracking their toolbar users' data will give them an idea of which pages are important to surfers. PageRank tells you which pages are important to Webmasters, but Webmasters and surfers are not exactly the same group of people. There are more surfers than Webmasters, and Webmasters tend to be more sneaky and manipulative.

Anyone who has questioned the value of Alexa's rankings, which have been manipulated in various ways, can immediately spot the flaw in Microsoft's proposed data source. What can be done to Alexa can be done to MSN.

In fact, no system is perfect. PageRank can be faked, inflated, borrowed, and misused. The abuse of PageRank has become so bad that Google has been depriving major sites of their ability to pass PageRank to other pages they link to. That's a shame, but PageRank has become a sham and Google has invested a lot of its prestige and corporate ego in the concept. They are losing the battle for preserving the value of PageRank (as they see it -- in my opinion, it's always been a stupid idea anyway because most Webmasters don't link on the basis of quality, but rather on the basis of whom they know and like).

So what's the SEO myth I referred to above? Well, as many search engine optimizers have fumbled around the basic concepts and use of PageRank, so they have fumbled around Google's toolbar. A lot of silly ideas have been proposed for what Google does with toolbar data, if anything. People believe that Google is tracking where you surf and ranking sites on the basis of where you go. Some companies actually require their employees to surf to the corporate Web site every day (not realizing that if Google is tracking this data, it's also tracking IP addresses and most corporate networks use the same IP address to access the Internet).

Google has never indicated in any patent application, technical paper, Webmaster guideline, or public presentation or communication that it uses toolbar data to determine search results rankings. Matt Cutts, a Google engineer who frequently speaks about Google's technology, has said on his own blog (expressing his personal opinion independent of Google's undisclosed corporate view) that he would consider use of the toolbar data to be kind of spooky.

So, maybe this later paper from Microsoft indicates they are using the toolbar data they collect to influence their search results. Maybe not. Toolbar data can be spoofed by software, and there are people who have claimed to do it through the years for Alexa and Google. If they can do it for Alexa and Google, then they should be able to do it for Microsoft, Yahoo!, and Ask (all of whom offer toolbars for your browser).

0 Comments:

Post a Comment

<< Home