Google comes to town

Is it inevitable that everyone who writes about computers must eventually do a bit about Google? Well, why fight it? Here goes…

In June I was one of the privileged few hundred to pile into a lecture hall at Melbourne University to hear a talk called Google - Finding Needles in a 20 TB Haystack, 200 Million Times per Day. Yes, Google had come to town and they weren’t just sightseeing - rumour had it that they were recruiting, looking for the brightest minds Melbourne has to offer.

Unfortunately my PhD hadn’t arrived from dodgy-degrees-r-us.com, so I didn’t get an interview but I was able to elbow my way into the public lecture. The speaker was Dr Craig Nevill-Manning, a kiwi by origin and Director of Engineering at the New York office of Google. Here’s the highlights as I saw them.

On the cheap

Google does not favour exotic, high-end server hardware. No, Google does things on the cheap. The 200 million hits per day on Google are handled by standard desktop hardware - the sort of thing that you or I might have, only multiplied thousands of times over. They keep their costs low by buying in the “sweet spot” where the price-performance ratio ensures a good bang for their bucks.

Cheaper hardware has it’s advantages but it also means lower reliability. Failures do occur and Google manages this (indeed they expect it) by having plenty of capacity and redundancy, so that the user experiences a reliable site. One presumes that there are dozens of people employed by Google to constantly replace parts or entire machines as they fail.

Doing it cheap has always been the Google way. According to Nevill-Manning the original setup at google.stanford.edu was a mixed bag of hardware left over from previous research work - complete with an external disk drive casing made of Lego (he even had photos to prove it). And later when they moved into dedicated hosting premises they took in their homemade server racks with motherboards mounted directly onto metal trays insulated with cork. Apparently the bemused management at their hosting facility was somewhat concerned about the fire risk!

And like all good IT start-ups, the early Google team served the obligatory time in a friend’s garage.

How they do it

Along with monitoring their vast farm of hardware, Nevill-Manning reckons that Google’s other main challenge is, not surprisingly, indexing the web. He outlined their approach to hypertext analysis, which includes taking into account: * web page text

  • relative font size of the text

  • the text of incoming links

  • and, of course, PageRank

The last one is the Google method of rating the “reputation” of every site on the web. PageRank is not influenced by traffic to a site, but by the number of links to a site and the reputation of sites making those links. It all sounds a bit self-referential (i.e. We decide on the ranking of a site by looking at the ranking of other sites, which we ranked by looking at the ranking of other sites …) but you’ve got to admit it seems to work.

And if you don’t believe that Google uses incoming link text to index pages, try to predict what will be at the top of a search for the term “click here”. I’ll leave it as an exercise for the reader to explain that result.

Metadata is not used in Google indexing because they have found it to be unreliable and often deliberately misleading. (Metadata is extra data that provides information about other data, such as the description and keywords fields in a HTML document. See metatdata entry in the Free On-Line Dictionary Of Computing.) Also metadata is usually invisible and Google ignores all “invisibles” when indexing. And the indexing algorithm is clever - it can tell if you’re trying something underhanded to get into the search engines like hiding white text on a white background or making some text unreadably tiny - that sort of text if invisible, so Google ignores it.

But the cleverness also works in positive ways. Images, for example, are indexed using related information such as the surrounding text, text in the image tag ALT attribute, and the file name of the image to give clues to the image contents.

So search results depend on a combination of the relevance of search terms used and the reputation of web sites (represented by PageRank).

Advertising

Nevill-Manning rejected any suggestion that top ranking in Google can be bought - either by those that advertise with them or others. This seems to be a very difficult claim for an outside observer to prove or disprove, but a little thought suggests that it is highly unlikely because the core value of Google is the ability to produce highly relevant results. Any weakening of the results also weakens the credibility of Google. And as Nevill-Manning pointed out their “Sponsored Links”, though prominently featured, are clearly labeled and physically separated from the “real” results.

Google Adsense is the service that provides small advertisements to web sites, each targeted to the content of the current page. This provides some curly problems for Google - for example in a page that mentions “Java” is the topic Indonesia, coffee, or programming?

Working for Mr Google

Apart from getting their name on one of the coolest business cards in the world, staff of Google have a very attractive perk: “20% time”. This allows Google folks to spend 20% of their time sitting around dreaming up stuff and trying things that they wouldn’t otherwise have the time to do. There’s no strings attached - it’s free form innovation time.

Some of the resulting work is released into the wild at Google labs, and a few end up as part of the main Google system. Products of 20% time include: * Froogle (”smart shopping through Google”),

  • local.google (search by geographic location - not available for Australia at the time of writing),

  • define (get word meanings simply by typing “define [your word here]” in the Google search box), and

  • spell checking (how many times has Google asked you “Did you mean: …”?). To illustrate the need for spell checking, Nevill-Manning showed a depressingly long list of the creative ways people have misspelled “Britney Spears”…

But then, some other ideas are just interesting solutions in search of a problem, like Google sets.

Google zeitgeist

Nevill-Manning closed with a story that illustrates Google’s place in the mainstream. (This anecdote has been well reported, for example see Ga-ga over Google.)

A couple of years ago on the US game show “Who wants to be a millionaire?” a contestant reached the top price question which was:

In “The Brady Bunch” what was Carol Brady’s maiden name?

In accordance with the rules of the game, the contestant was allowed to phone a friend - a friend who was waiting with the Google search page open. In the time available the friend tried to search for “carol brady maiden name” but didn’t quite have enough time to return the correct answer.

However the contestant’s friend was not the only searcher that night - Google statistics showed a sharp peak of thousands of searches on “carol brady maiden name” - firstly at the time the show was broadcast on the east coast and then several times during the evening as the syndicated show went to air in different time zones across the country.

I’m really not sure what to make of that.

First published: PC Update August 2004