Storage Architecture and Challenges
Faculty Summit, July 29, 2010
Andrew Fikes, Principal Engineer
Introductory Thoughts
Google operates planet-scale storage systems
What keeps us programming:
Enabling application developers
Improving data locality and availability
Improving performance of shared storage
A note from the trenches: "You know you have a large storage
system when you get paged at 1 AM because you only have a
few petabytes of storage left."
The Plan for Today
Storage Landscape
Storage Software and Challenges
Questions (15 minutes)
Storage Landscape: Hardware
A typical warehouse-scale computer:
10,000+ machines, 1GB/s networking
6 x 1TB disk drives per machine
What has changed:
Cost of GB of storage is lower
Impact of machine failures is higher
Machine throughput is higher
What has not changed:
Latency of an RPC
Disk drive throughput and seek latency
Storage Landscape: Development
Product success depends on:
Development speed
End-user latency
Application programmers:
Never ask simple questions of the data
Change their data access patterns frequently
Build and use APIs that hide storage requests
Expect uniformity of performance
Need strong availability and consistent operations
Need visibility into distributed storage requests
Storage Landscape: Applications
Early Google:
US-centric traffic
Batch, latency-insensitive indexing processes
Document "snippets" serving (single seek)
Current day:
World-wide traffic
Continuous crawl and indexing processes (Caffeine)
Seek-heavy, latency-sensitive apps (Gmail)
Person-to-person, person-to-group sharing (Docs)
Storage Landscape: Flash (SSDs)
Important future direction:
Our workloads are increasingly seek heavy
50-150x less expensive than disk per random read
Best usages are still being explored
Concerns:
Availability of devices
17-32x more expensive per GB than disk
Endurance not yet proven in the field
Storage Landscape: Shared Data
Scenario:
Roger shares a blog with his 100,000 followers
Rafa follows Roger and all other ATP players
Rafa searches all the blogs he can read
To make search fast, do we copy data to each user?
YES: Huge fan-out on update of a document
NO: Huge fan-in when searching documents
To make things more complicated:
Freshness requirements
Heavily-versioned documents (e.g. Google Wave)
Privacy restrictions on data placement
Storage Landscape: Legal
Laws and interpretations are constantly changing
Governments have data privacy requirements
Companies have email and doc. retention policies
Sarbanes-Oxley (SOX) adds audit requirements
Things to think about:
Major impact on storage design and performance
Are these storage- or application-level features?
Versioning of collaborative documents
Storage Software: Google's Stack
Tiered software stack
Node
Exports and verifies disks
Cluster
Ensures availability within a cluster
File system (GFS/Colossus), structured storage
(Bigtable)
2-10%: disk drive annualized failure rate
Planet
Ensures availability across clusters
Blob storage, structured storage (Spanner)
~1 cluster event / quarter (planned/unplanned)
Storage Software: Node Storage
Purpose: Export disks on the network
Building-block for higher-level storage
Single spot for tuning disk access peformance
Management of node addition, repair and removal
Provides user resource accounting (e.g. I/O ops)
Enforces resource sharing across users
Storage Software: GFS
The basics:
Our first cluster-level file system (2001)
Designed for batch applications with large files
Single master for metadata and chunk management
Chunks are typically replicated 3x for reliability
GFS lessons:
Scaled to approximately 50M files, 10P
Large files increased upstream app. complexity
Not appropriate for latency sensitive applications
Scaling limits added management overhead
Storage Software: Colossus
Next-generation cluster-level file system
Automatically sharded metadata layer
Data typically written using Reed-Solomon (1.5x)
Client-driven replication, encoding and replication
Metadata space has enabled availability analyses
Why Reed-Solomon?
Cost. Especially w/ cross cluster replication.
Field data and simulations show improved MTTF
More flexible cost vs. availability choices
Storage Software: Availability
Tidbits from our Storage Analytics team:
Most events are transient and short (90% < 10min)
Pays to wait before initiating recovery operations
Fault bursts are important:
10% of faults are part of a correlated burst
Most small bursts have no rack correlation
Most large bursts are highly rack-correlated
Correlated failures impact benefit of replication:
Uncorrelated R=2 to R=3 => MTTF grows by 3500x
Correlated R=2 to R=3 => MTTF grows by 11x
source: Google Storage Analytics team
D.Ford, F.Popovici, M.Stokely, and V-A. Truong, F. Labelle, L. Barroso, S. Quinlan, C. Grimes
Storage Software: Bigtable
The basics:
Cluster-level structured storage (2003)
Exports a distributed, sparse, sorted-map
Splits and rebalances data based on size and load
Asynchronous, eventually-consistent replication
Uses GFS or Colossus for file storage
The lessons:
Hard to share distributed storage resources
Distributed transactions are badly needed
Application programmers want sync. replication
Users want structured query language (e.g. SQL)
Storage Challenge: Sharing
Simple Goal: Share storage to reduce costs
Typical scenario:
Pete runs video encoding using CPU & local disk
Roger runs a MapReduce that does heavy GFS reads
Rafa runs seek-heavy Gmail on Bigtable w/ GFS
Andre runs seek-heavy Docs on Bigtable w/ GFS
Things that go wrong:
Distribution of disks being accessed is not uniform
Non-storage system usage impacts CPU and disk
MapReduce impacts disks and buffer cache
GMail and Buzz both need hundreds of seeks NOW
Storage Challenge: Sharing (cont.)
How do we:
Measure and enforce usage? Locally or globally?
Reconcile isolation needs across users and systems?
Define, implement and measure SLAs?
Tune workload dependent parameters (e.g. initial chunk
creation)
Storage Software: BlobStore
The basics:
Planet-scale large, immutable blob storage
Examples: Photos, videos, and email attachments
Built on top of Bigtable storage system
Manual, access- and auction-based data placement
Reduces costs by:
De-duplicating data chunks
Adjusting replication for cold data
Migrating data to cheaper storage
Fun statistics:
Duplication percentages: 55% - Gmail, 2% - Video
90% of Gmail attach. reads hit data < 21 days old
Storage Software: Spanner
The basics:
Planet-scale structured storage
Next generation of Bigtable stack
Provides a single, location-agnostic namespace
Manual and access-based data placement
Improved primitives:
Distributed cross-group transactions
Synchronous replication groups (Paxos)
Automatic failover of client requests
Storage Software: Data Placement
End-user latency really matters
Application complexity is less if close to its data
Countries have legal restrictions on locating data
Things to think about:
How do we migrate code with data?
How do we forecast, plan and optimize data moves?
Your computer is always closer than the cloud.
Storage Software: Offline Access
People want offline copies of their data
Improves speed, availability and redundancy
Scenario:
Roger is keeping a spreadsheet with Rafa
Roger syncs copy to his laptop and edit
Roger wants to see data on laptop from phone
Things to think about:
Conflict resolution increases application complexity
Offline codes is often very application specific
Do users really need peer-to-peer synchronization?
Questions
Round tables at 4 PM:
Using Google's Computational Infrastructure
Brian Bershad & David Konerding
Planet-Scale Storage
Andrew Fikes & Yonatan Zunger
Storage, Large-Scale Data Processing, Systems
Jeff Dean
Additional Slides
Storage Challenge: Complexity
Scenario: Read 10k from Spanner
1. Lookup names of 3 replicas
2. Lookup location of 1 replica
3. Read data from replicas
1. Lookup data locations from GFS
2. Read data from storage node
1. Read from Linux file system
Layers:
Generate API impedence mismatches
Have numerous failure and queuing points
Make capacity and perf. prediction super-hard
Make optimization and tuning very difficult
Storage Software: File Transfer
Common instigators of data transfer:
Publishing production data (e.g. base index)
Insufficient cluster capacity (disk or CPU)
System and software upgrades
Moving data is:
Hard: Many moving parts, and different priorities
Expensive & time-consuming: Networks involved
Our system:
Optimized for large, latency-insensitive networks
Uses large windows and constant-bit rate UDP
Produces smoother flow than TCP