Yahoo has discovered that outsourcing data storage and management has become expensive and inadequate in capacity to meet their demand, and is consequently developing a multi petabyte SQL database. The system will, like Google`s BigTable, use a system of distributed columns rather than the typical tables system, which is organised by rows and columns. The difference from Google`s BigTable is that Yahoo`s database is designed for a SQL interface.
Google`s BigTable method of using distributed columns employs a plurality of storage servers, one with a database engine that partitions database tables into column chunks.
“A distributed column chunk data store may be provided by multiple storage servers operably coupled to a network. A storage server may include a database engine for partitioning a data table into the column chunks for distributing across multiple storage servers”
“Any data table may be flexibly partitioned into column chunks using one or more columns as a key with various partitioning methods.”
BigTable Process diagram taken from Google Patent application
Yahoo chose the distributed columns system because of its nature to only read through data that is relevant to the query, thereby massively reducing the labour involved in a given query. Another major advantage is that the programming of software to write to and query the database will be far cheaper than using the C++ or Java languages that BigTable requires.
I wonder if this database technology will help their indexing system. A Google patent for anchor text processing, covered by Bill Slawski at SEO by the sea, suggests that indexed web pages are associated to anchor text in external inward pointing links via a database-powered cataloging system. This system is used in ranking the indexed pages for single-query search results. Yahoo may use their database applications similarly. A highly souped-up database querying method could enhance their index and ranking ability, or their other functions like the concept dictionary to quite a speed.
A petabyte is a very large amount of data, equal to roughly a thousand terabytes, or one million gigabytes. This capacious storage reserve will do nicely for the immediate future, allowing Yahoo flexibility when developing new products and applications that require such space and efficiency. It will also provide Yahoo employees and their families & friends a place to store an endless reserve of mp3`s, jpegs and mpegs 🙂