Friday, 30 May 2008

How does Google's algorithm work?

Perhaps the eighth wonder of the world should be Google’s mysterious “Algorithm”. By the way, I do say that with tongue planted firmly in cheek (never really understood that turn of phrase, if any one can tell its origins that’d be great).

But I do regularly get asked about it. What is it? What’s in it? What does it do?

Today, I’ll take a bit of a stab at what it is and what we know about it.

I think I mentioned many-a-blog-post-ago that there are three parts to a search engine, the robots, the index and the algorithm. I’ll do a very simple re-cap to put it into context.

Remember, the ‘bots’ go out over the internet and find and collect web pages. When they find a page, they scurry back and plonk it into Google’s massive storage system, the index. The third part of the ‘engine’ is the algorithm which effectively analyses each page for relevance. When someone performs a search, it tries it’s best to sort all pages in the index, ranking the most relevant result highest then the second most relevant page, well, second, and the third and so on.

The problem for you and I is that Google doesn’t tell us exactly how the algorithm works. It can’t really, because for starters, if Google did “give it all away” competitors would no-doubt copy it, and there’d be an optimisation free-for-all by every website owner out there.

It’s Google’s own “11 secret herbs and spices” recipe.

I read recently that in 2007 Google changed or tweaked the algorithm around 450 times. That’s more than once a day! Talk about a moving target! That’s the main reason we will never guarantee a number 1 position at Google. Customer expectation management 101.

I think the most important thing to remember about the algorithm is that while it’s been written and updated by humans, there is no human involvement in a website’s ranking position. It’s best summed up by Ubi Manber, Google vice president who oversees search quality.

"If we find, for a particular query, that result No. 4 should be result No. 1, we do not have the capability to manually change it. We have to find what weakness in the algorithm caused that result and find a general solution to that, evaluate whether a general solution really works and if it's better, and then launch a general solution."

While Google might be constantly fiddling around the edges, there are things about the algorithm which tend to remain fairly constant. Over at SeoMoz (effectively the SEO industry’s version of Smartcompany), the world’s top SEO industry experts were invited to vote on what they believed were most important factors to influence Google’s algorithm. The Title Tag came in first, followed by body text and Headings etc. Certainly, links are also a play a huge factor, and the anchor text of in-bound-links to a site was of “exceptional importance” to all respondents.

Even so, the algorithm doesn’t always get it spot on. The main inspiration for this post was the research I was doing to the AIMIA speech last week, “the future of search”. When I typed that key phrase into Google for some inspiration, the number 1 result’s content was written in 2004. So, there’s still some work left for Google to do!

Labels: , ,

Tuesday, 26 February 2008

Australian Financial Review - Digital Rights Management and SEO

One of the guys here at work today (Anthony) visited the Australian Financial Review website www.afr.com.au. Anthony was reading one of their articles, and as is his want, was highlighting the page copy as he was reading.

He noticed the page text was 'switching out' every second character. It's probably best explained with some screenshots (I love screen shots). Here's the normal version of the site.



The next screen shot shows what happens if you swipe the text (to try and copy it):



Try it yourself here

This is a html form of Digital Rights Management. For the more technically minded, basically what AFR has done is use two floating div tags each containing every second character, which, when overlayed make the text read normally. It's only when you swipe the text that the system comes into play, because you're just swiping one layer.

It potentially creates a strain on your server as it's working hard putting the whole thing together each time a page is called. It would also send your bandwidth through the roof!

It also has it's SEO pros and cons, so lets go through the implications if you decide to protect your content with this system.

1. This technique is SEO unfriendly!

If we take a sneak peak at the source code, there's no way Google's going to index this page effectively.

Ever.

Here's an example of the source code:
I don’t think google will like this… (or any other search engine for that matter)



If you don't want your content in Google's index, use a robots.txt file to keep the Googlebot out (very useful for subscriber / members only content).

And given the way that Google handles duplicate content (it gives gives first 'dibs' to the web page where the Googlebot first found unique content and ignores any other page with the same or highly similar content), I don't really see the point in using this system.

To make sure your content gets indexed first, set up a xml sitemap in Google webmaster tools, and make sure that when you publish something new, Google knows about it seconds after it's gone live.

In other words, even if someone else copied and published AFR content on their website or Blog, Google would simply ignore it.

Besides... it's easy to steal content. Ever heard of a 'screengrab'? (see above!).

2. This is a usability disaster. Not everyone online can see.

Beth (a collegue) also chipped in... What happens when someone with a screen reader comes through? They get this:

A s r l a n N w e l n B n i g r u f u m x d h m r e b f r C r s m s i h s r t g p e e t t o p o l i i g h t h b n 's s a b s n s w u d r w o 0 e c n o p o i s n i e e r . F o 7 e c n t d y. e t e e a n m j r c u s t o o t e o i o a d o e x s i g u i e s s o l e e b s l .

Mmmm.

Finally, one of our clever developers spent 10 minutes to create a script which renders the whole AFR digital rights 'thing' useless. Sorry, we won't be offering that for download!

Do you have any other thoughts?

Labels: , , , , , ,

Clicky Web Analytics