All Things Distributed: HPTS Day 2: How are databases used at big customers?

These are the old pages from the weblog as they were published at Cornell. Visit www.allthingsdistributed.com for up-to-date entries.

October 15, 2003

HPTS Day 2: How are databases used at big customers?

This day was focused on 'user' experiences with large database and services. The morning started with talks by people from Google, eBay and Amazon, about their architectures. The Google talk by Anurag Acharya gave an overview of the overall system as we are already know it, with some additional information about how they deal with reliability issues. For example each data structure they use has a checksum associated with it, to handle memory corruption, which is a issue they have to deal with. The Amazon talk Charlie Bell was mainly about how they expose their store through different web services to partners and the general public. There were interesting numbers about update rates and messages sizes between Amazon and partners such as Nordstrom.

The eBay talk James Barrrese was the most fascinating of the three. Lots of detail about the backend, about the software, about the development, etc. Equally interesting where the afternoon talks by people from Swab, NASDAQ and Merrill Lynch, which all gave a look into their backend systems.

I took a lot of notes on all of these talks, and have no time/desire to transcribe them into proper prose, so if you are interested in these raw nodes, read on ...

James Barrese - eBay

How many datacenters: 2 planned to be 5 datacenters.
Update the codebase every 2 weeks. 35K lines of code chances every week.
Rollback of a code update within 2 hours.
Scalability is in the growth 70% or more per year, every year. Hardware and software
Database in 1999 single cluster database on a Sun A3500.
Too much load : moved to no triggers, starting to partition account, feedback, archive, later categories.
Replication of user write and read, batch operations.
Current state: few triggers, denormalization of the schema
application enforced referential integrity, horizontal scale
application level sorting
compensating transactions,
application level join.
Database only used to durable storage, every else in application code
Ebay data access layer: partitioning, replication, control (logical hosts), gracefull degradation,
NO 2PC. 'avoid it like the plague'.
index always available: if one category out still show item, but unclickable.
online move of table if its getting hot.
Centralized Application log sevice. audit, problem tracking;
Search: their own index, real-time update (15 sec) homegrown,
V2: IIS driven with C++, V3: j2EE container environment.
4.5 M lines of code

Adam Richards - Swab

80% revenue goes through compute channels
no downtime allowed
real-time and stateful -> back-end become single point of failure.
High peaks at market open & close -> tremendous over provisioning peak load easy 10:1
Much of the data is not partitionable, multiple relationship between data elements
given the rate of change (1000/month) availability is 'pretty good'
Some technology need to be supported because of acquisitions (archeology)
lack of "desire to retire" anything - investment in new stuff -> complexity goes up
new technologies: organic IT
Isolated islands of data and compute (often religious, sometimes technological)
linkage of island through messaging
60% of project spent on mitigating complexity
lower availability due to complexity
slower time to market
Design: SOA with application bus, request/reply, pubsub and streams
Nobody wants distributed 2PC. Only when forced by legal reason. 2PC is the anti-availability product
Q: clients want datagrade access, but services want interfaces
Q: how to decouple platforms 1) compute - asynchronous but no 2pc 2) data - propagation & caching but no 2PC
Q: how to act when the system is not perfect.
More and more unneeded information on home page, peaks when logon (10's of transactions)

Ken Richmond - NASDAQ

High performance data feed
Top 5 levels of NASDAQ order book
near real-time, short development time
messages self-describing
windows gateway msmq 'channels' partitioned by alphabet of the symbol
SQL server clusters passive-active - failover in 1 minute
SAN storage (EMC)
Multicast fan-out
pipeline architecture, everything should do 'one thing', serialization through single threading (regulation requires FIFO)
sending blocks of messages to SQL server improves performance
biggest reason for app outage is overrun - flow control between the mainframe and intel boxes
entire DB in memory all the time
interface in C++, logging event to Tibco Hawk
Test bed investment is crucial
Saturdays functional testing, Sunday performance testing
10,800 msg/sec sustained in test, 30K database ops/sec in test, 1800 msg per second per channel production, 7800 overall in production

Bill Thompson - Merrill Lynch

GLOSS - 2 tier architecture sybase/solaris PowerBuilder GUI
all business logic in stored procedures
uses tempdb as a key component
use: standing data maintenance, transaction management, interface to external clearing services (msgs), books & records
some of the topology: transaction manager front-office then to settlement engine and to the books & records, T+x updates from clearing house
70K trades and settlements 60K prices, 40K cash movements ... 250,000 transactions (=100's sql statements) / day
limits in 2001: transaction rates too high, overnight extracts over the window, DB 450 GB: unable to do maintenance tasks
Then move to SQL server
Challenges: migrate stored procedures, maintain the Unix code, migrate database in 48 hours
Unix - SQL server connectivity: use FreeTDS (open source) engine (Sybase ODBC and others didn't work or were a performance drag)
migrating the data was painfully slow using vendor tools. They wrote utilities themselves using bcp functionality from FreeDTS
SQl server 'bulk insert' rocks
performance increase transaction throughput 30%, database extracts from 235 mins to 25 minutes
Future: 50-100% increase of transaction expected in 2004
SQL Server issues: stored procedure recompilation
Redesign: distributed transaction processing, consolidate stock & cash ledgers, data archival
re-architect the queue-based processing.
database now 1T, backup: full in 3-4 hours, incremental 1 hour

David Patterson - Berkeley

Recovery Oriented Computing
margin of safety works for hardware, what does it mean for software?
tolerate human error in design, in construction and in use
Raid 5: operator removing wrong disks, vibration/temperature failure before repair
never take software raid
ROC philosophy: if a problem has no solutions, it may not be a problem but a fact - not to be solved but to be coped with over time - Shimon Peres
people HW SW fails is a fact
how can we improve repair
MTTF is difficult to measure, MTTR is better measurement
if MTTR is fast enough, it is not a failure
'Crash only software' Armando Fox
Recognize, Rewind & Repair, Replay
Application proxy records input & output, feeds into an undo manager, combined with a no-overwrite store.
Lesson learned: takes a long time to notice the error, longer than to fix it
Operators are conservative, difficult to win trust
Configuration errors a rich source of causes for errors
Automation may make the ops job harder - reduces ops understanding of the system

Paul Maglio - IBM

Studies about system administrators
important users, expensive, understudied
interface for system administrators are similar to regular users, but this may not be correct because they are different
field studies, long term videos
Talks about how misunderstanding play a major role, how admins control info while talking to others
All about saving face & establishing trust
Most of the time spent troubleshooting is spent on talking to other people (60%-90%)
20% time about who (else) to talk to? (collaboration)
typical day: 21% planning 25% meetings, 30% routine maintenance, half of all activities were involved in collaboration
DBAs always work in pairs or teams, meticulous rehearsal, attention to details, but easy to panic.

David Campbell - Microsoft

Supportability - Doing right in a Complex World
Emerging Technology: lots of knobs, high maintenance testing replacing, etc. Everybody needs to be the expert
Evolving Technology: some knobs, automation, good enough for the masses - some local gurus
Refined Technology: no knobs, goal directed input - it just works - user base explodes - small expert base
Design targets: non-cs diciplines have design targets, cs often not, certainly not supportability
Design for usability for common users: "virtual memory too low ..."
Limited design for test, no standard integrated online diagnostics
Techniques similar to those 20 years ago
Errors: don't be lazy make explicit what you want the user to do
Never throw away an error context
'Principle of least astonishment':
do reasonable user action yields reasonable results? Make it hard for people to hose themselves
With respect to reliability, how much of the complexity is spend on actually making the system more reliable
In automobiles today faults are recorded to help the mechanics.
Customers are willing to invest up to 20% to get to 'it just works'
in sql server introduced more closed feedback loops to reduce operator involvement
Automated, adaptive control based on 'expert systems knowledge'
result were surprising, results are better than the best expert could do.
example: memory allocated, statistics collection, read-ahead, system structures dynamic allocated
Memory: replacement cost for each page, utilities 'bid' on memory, feedback on cost of allocation.

Posted by Werner Vogels at October 15, 2003 01:18 PM

TrackBacks

Big Databases
Excerpt: Werner Vogels reports some more from HTPS '03: How are databases used at big customers? — Big like in eBay, Schwab, NASDAQ, Merrill Lynch.Besonders interessant ist das eBay'sche Backend: "application enforced referential integrity, application le...
Weblog: Bernhard Seefeld's Blog
Tracked: October 16, 2003 01:00 PM

In case you were ever curious
Excerpt:
Weblog: justinbigelow
Tracked: October 16, 2003 08:58 PM

High-end Backend
Excerpt:
Weblog: Ashutosh Nilkanth's .NET Blog
Tracked: October 29, 2003 01:00 PM

Database Usage
Excerpt:
Weblog: Extreme RAD from a Trading Desk
Tracked: October 27, 2004 05:36 AM

Databases as Services
Excerpt:
Weblog: Andres Aguiar's Weblog
Tracked: May 31, 2005 12:16 AM

Comments