Google is known for its good and almost accurate search engine and the speed with which it answers queries. There is no magic for sure. They do this trick with millions of machines spread across the world in various data centers. It is estimated that Google runs over a million servers worldwide, processes a billion search requests, and generates twenty petabytes of data per day.
Google is reticent when it comes to exposing how exactly their search works. This is because search is seen to be central to the company’s strategy. So what follows is what we think we know about Google’s infrastructure and the basic idea behind how Google distributes its traffic by pooling IP addresses and performing several layers of load balancing.
Google maintains a pool of hundreds of IP addresses, all of which eventually resolve to its Mountain View, California, headquarters. When you initiate a Google search, your query is sent to a DNS server, which then queries Google’s DNS servers. The Google DNS servers examine the pool of addresses to determine which addresses are geographically closest to the query origin and uses a round robin policy to assign an IP address to that request. The request usually goes to the nearest datacenter, and that IP address is for a cluster of Google servers. This DNS assignment acts as a first level of IP virtualization, a pool of network addresses have been load balanced based on geography. This is the first step.
A Google cluster can contain thousands of servers. Google servers are racks of commodity (low cost) 1U or 2U servers containing 40 to 80 servers per rack with one switch per rack. Each switch is connected to a core gigabit switch. Google servers run a customized version of Linux with applications of several types. Most of the Google machines are assembled PC’s.
When the query request arrives at its destination, a Google cluster is sent to a load balancer, which forwards that request to a Squid proxy server (Software load balancer) and Web cache daemon. This is the second level of IP distribution, based on a measure of the current system loading on proxy servers in the cluster. The Squid server checks its cache, and if it finds a match to the query, that match is returned and the query has been satisfied. If there is no match in the Squid cache, the query is sent to an individual Google Web Server based on current Web server utilizations, which is the third level of network load balancing, again based on utilization rates.
It is the Google Web Servers that perform the query against the Google index and then format the results into an HTML page that is returned to the requester. This procedure then performs two more levels of load balancing based on utilization rates.
Google’s secret ingredient is its in-memory inverted index and page rank algorithm. Google’s GoogleBot (a spider or robot) crawls the Web and collects document information. Some details of the search and store algorithm are known. Google looks at the title and first few hundred words and builds a word index from the result. Indexes are stored on an index server.
Some documents are stored as snapshots (PDF, DOC, XLS, and so on), but lots of information is not addressed in the index. Each document is given a unique ID (“docid”), and the content of the document is disassembled into segments called shards, subjected to a data compression scheme and stored on a document server. The entire index is maintained in system memory partitioned over each instance of the index’s replicas. A page rank is created based on the significant links to that page.
Queries are divided into word lists, and the Google algorithm examines the words and the relationships of one word to another. Those word relationships are mapped against the main index to create a list of documents, a feature called an inverted index. In an inverted index, words are mapped to documents, which can be done very quickly when the index is fully kept in memory.
Google doesn’t use hardware virtualization; it performs server load balancing to distribute the processing load and to get high utilization rates. The workload management software transfers the workload from a failed server over to a redundant server, and the failed server is taken offline. Multiple instances of various Google applications are running on different hosts, and data is stored on redundant storage systems.
Basically, Google works on multiple load balancing criteria. It’s inverted memory index and page rank algorithm are responsible for a sense of accuracy and speed. Google keeps adding more and more machines everyday as search grows more prolific. As far as Ads are concerned, they are served from different AdSense servers located throughout the world. Google uses CDN’s (their own) to speed up latency. CDN’s mean Content Delivery Networks like Akamai, BlueCoat etc.
With Google Instant, Google has really proven that it is fast. As you are aware, the front end uses AJAX requests. And the magic is performed by intelligence located in a node close to the user. A cluster, to be more precise. Perhaps, Google is the best example of application on a Cloud.
(some notes above taken from the Cloud Computing Bible)