I attended last week a conference organized by UC Berkeley EECS (Electrical Engineering and Computer Science) School whose title was “Erasure Coding for Cloud Storage“. The speaker was Parikshit Gopalan who is a researcher in the Microsoft Azure team.
Under that apparently complicated title, there is the solution to a very interesting problem of storage efficiency and reliability.
These days nearly everyone stores things in the cloud and many users go for free storage on the various internet content providers (Goolge, Microsft, Dropbox, etc).
To estimate what it represents in terms of storage, let’s take a look at some raw figures:
One simple photo per day : 256kb; 1 year of photos: 100 Mb; 1 billion users over 1 year: 100Tb… call it a data deluge !!!
If 1Tb of data costs 100$, the estimated 100Tb of data, would cost 100 M$…. spent in reliable storage , partly “for free” for the users…
The conclusion is that storing such massive amount of data comes with costs, not only disk storage but also servers, power, airco and the associated professional services.
The challenge is then: “To manage data to help keep it safe and secure while minimizing the amount of storage space it requires”.
And this exactly what Parikshit and his colleagues at Microsoft Research and the Windows Azure Storage group have been doing for some time now , saving Windows Azure Storage millions of dollars and the work they have carried out have been recognized by the research community since they have been awarded several prizes for their research.
The easiest way to keep data integrity is duplication. Generally speaking three full copies of the data are enough to keep them safe, but in this case the “store cost overhead” is exactly three what is prohibitively costly.
One method to keep data accessible and durable , while using less space, is to “code” the data, i.e. create a shortened description of the data that can be reassembled and delivered safely to the user on request.
Parikshit and his colleagues have used “erasure coding” as the basis for their algorithms. Erasure Coding (EC) is a method of data protection in which data is broken into fragments, expanded and encoded with redundant data pieces and stored across a set of different locations or storage media.
A very nice description for “dummies” of Erasure Coding can be found at http://smahesh.com/blog/2012/07/01/dummies-guide-to-erasure-coding/. Erasure Code is also used in several distributed file systems like the now very popular Hadoop Distributed File System (HDFS)
A well understood way to perform the Erasure Coding (EC) operation is called Reed-Solomon coding (devised in 1960 !!!!) and used by the U.S. space program to reduce and correct problems in data communications. It also helped made compact discs possible by catching and correcting problems in the disc’s digital coding.
And why this coding method is important for the subject of this post ? For example, by using 6+3 Reed Solomon code, which converts three copies of data to nine fragments (six for data and three for parity), each 1/6 of the size of the original data, it cuts data footprint in half, to an “storage overhead cost” of 1.5 instead of the 3 with the full three copies of the data mentioned above.
But there is no such think like a “free meal”… coding data has a cost; it slows performance of servers due to the need to reassemble data from code (it would also happen to a person who is reading a text where every other letter is missing !!!). Data retrieval also can be slowed if a data fragment is stored on a hard disk that has failed or is on a server that is temporarily offline (for un upgrade or patching , for instance).
The goal of Parkshit and the rest of the team on their approach was to reduce the time and cost of performing data retrieval , especially during hardware failures (… yes High Volume Standard , hardware will fail !!!) or data-centre maintenance operations such software upgrades. In addition to reducing data-retrieval time , the goal of the new approach was to perform “lazy erasure coding” that enables greater data compression and thus reducing “storage overhead cost” to 1.33 … or lower !!! and all that with minimal performance losses.
“Storage overhead cost” of 1.33 can be achieved with a 12+4 Reed-Solomon codes. But there is a n undesirable effect with this approach: if a piece of data fails, all the 12 fragments will have to be read to reconstruct the data…. Meaning 12 disk I/O operations and 12 network transfers.. and that is expensive, doubling the dis I/O actions and network transfers of the 6+3 mentioned above.
As stated above, Reed-Solomon coding is designed for deep space communications with the objective of tolerating as many errors as possible given a certain overhead. But the error pattern in data centres behaves differently.
In fact, well managed data centres (and I bet that Microsoft knows how to do it ;-)…….) have low hard-failure rate; it means that most of the data chunks (called extents) are healthy with no failed data fragments. The extent with two or more failed data or parity fragments are rare and appear for a short duration and the repair rate in data centres is high and quick.
The approach used by Parikshit and the team is based on the rich mathematical theory of “locally decodable codes and probabilistic checkable proofs“. They have called it Local Reconstruction Codes (LRCs) and enables data to be reconstructed more quickly than with Reed-Salomon codes, because fewer data fragments (six instead of twelve) need to be read to recreate the data in most of the cases and the mathematical algorithms are also simpler leading to less complexity in the operations needed to combine data pieces and therefor the response time is quicker.
The “local” in the coding technique’s name refers to the concept that, in the event of a fragment being offline, the code needed to reconstruct the data is not spread across the entire span of data centre’s servers and therefore the fragments are retrieved faster.
In the implementation, they propose “lazy erasure coding”; but … why “lazy”?. In fact the term “lazy” comes from the way coding works in the background, not in the critical writing path. When a “data chunck”, called an “extent”, is opened and filled, it is duplicated with three full copies for the reasons mentioned earlier. When it is sealed, erasure coding is launched in the background when the data centre load level is low. The extent is split into equal-sized data segments , coded to generate the parity fragments, with each of data fragments and parity fragments being stored in different places. Once the data is erasure-coded and all the data parity fragments are distributed, all the three original three copies of the data can be deleted.
With the proposed coding approach the team has met two main criteria of the data storage. First, reliability that guarantees durability and be readily available. A data chinck can have three failures and still be rebuilt with 100% accuracy. In the unlikely event of four failures, the success rate drops to 86%. Second , the new coding results in a data overhead of 1.29. When one thinks about the amount of storage existing and provisioned regularly in a Cloud service a small decrease in overhead may mean millions in savings.
The proposed coding method was deployed in Azure some tile ago and is also present in Windows 8.1 and Windows Server 2012.
A very interesting presentation and a discovery, at least for me, of the storage design and management approach needed in the Cloud.
Stay tuned for more !!