HPTS Day 2: How are databases used at big customers?
This day was focused on 'user' experiences with large database and
services. The morning started with talks by people from Google, eBay and Amazon,
about their architectures. The Google talk by Anurag
Acharya gave an overview of the overall system as we
are already know it, with some additional information about how they deal with
reliability issues. For example each data structure they use has a checksum
associated with it, to handle memory corruption, which is a issue they have to
deal with. The Amazon talk Charlie Bell was mainly about how they expose their
store through different web services to partners and the general public. There
were interesting numbers about update rates and messages sizes between Amazon
and partners such as Nordstrom.
The eBay talk James Barrrese was the most fascinating of the three. Lots of
detail about the backend, about the software, about the development, etc.
Equally interesting where the afternoon talks by people from Swab, NASDAQ and
Merrill Lynch, which all gave a look into their backend systems.
I took a lot of notes on all of these talks, and have no time/desire to
transcribe them into proper prose, so if you are interested in these raw nodes,
read on ...
James Barrese - eBay
- How many datacenters: 2 planned to be 5 datacenters.
- Update the codebase every 2 weeks. 35K lines of code chances every week.
- Rollback of a code update within 2
hours.
- Scalability is in the growth 70% or more per year, every year. Hardware
and software
- Database in 1999 single cluster database on a Sun A3500.
- Too much load
: moved to no triggers, starting to partition account, feedback, archive, later
categories.
- Replication of user write and read, batch operations.
- Current state: few triggers, denormalization of the schema
- application
enforced referential integrity, horizontal scale
- application level sorting
- compensating transactions,
- application level join.
- Database only used to durable storage, every else in application code
- Ebay
data access layer: partitioning, replication, control (logical hosts), gracefull degradation,
- NO 2PC. 'avoid it like the plague'.
- index always
available: if one category out still show item, but unclickable.
- online move of
table if its getting hot.
- Centralized Application log sevice. audit, problem tracking;
- Search: their
own index, real-time update (15 sec) homegrown,
- V2: IIS driven with C++, V3: j2EE
container environment.
- 4.5 M lines of code
Adam Richards - Swab
- 80% revenue goes through compute channels
- no downtime allowed
- real-time and stateful -> back-end become single
point of failure.
- High peaks at market open & close -> tremendous
over provisioning peak load easy 10:1
- Much of the data is not partitionable, multiple
relationship between data elements
- given the rate of change (1000/month) availability
is 'pretty good'
- Some technology need to be supported because of
acquisitions (archeology)
- lack of "desire to retire" anything - investment in
new stuff -> complexity goes up
- new technologies: organic IT
- Isolated islands of data and compute (often
religious, sometimes technological)
- linkage of island through messaging
- 60% of project spent on mitigating complexity
- lower availability due to complexity
- slower time to market
- Design: SOA with application bus, request/reply,
pubsub and streams
- Nobody wants distributed 2PC. Only when forced by
legal reason. 2PC is the anti-availability product
- Q: clients want datagrade access, but services want
interfaces
- Q: how to decouple platforms 1) compute -
asynchronous but no 2pc 2) data - propagation & caching but no 2PC
- Q: how to act when the system is not perfect.
- More and more unneeded information on home page,
peaks when logon (10's of transactions)
Ken Richmond - NASDAQ
- High performance data feed
- Top 5 levels of NASDAQ order book
- near real-time, short development time
- messages self-describing
- windows gateway msmq 'channels' partitioned by
alphabet of the symbol
- SQL server clusters passive-active - failover in 1
minute
- SAN storage (EMC)
- Multicast fan-out
- pipeline architecture, everything should do 'one
thing', serialization through single threading (regulation requires FIFO)
- sending blocks of messages to SQL server improves
performance
- biggest reason for app outage is overrun - flow
control between the mainframe and intel boxes
- entire DB in memory all the time
- interface in C++, logging event to Tibco Hawk
- Test bed investment is crucial
- Saturdays functional testing, Sunday performance
testing
- 10,800 msg/sec sustained in test, 30K database
ops/sec in test, 1800 msg per second per channel production, 7800 overall in
production
Bill Thompson - Merrill Lynch
- GLOSS - 2 tier architecture sybase/solaris
PowerBuilder GUI
- all business logic in stored procedures
- uses tempdb as a key component
- use: standing data maintenance, transaction
management, interface to external clearing services (msgs), books & records
- some of the topology: transaction manager
front-office then to settlement engine and to the books & records, T+x
updates from clearing house
- 70K trades and settlements 60K prices, 40K cash
movements ... 250,000 transactions (=100's sql statements) / day
- limits in 2001: transaction rates too high,
overnight extracts over the window, DB 450 GB: unable to do maintenance
tasks
- Then move to SQL server
- Challenges: migrate stored procedures, maintain the
Unix code, migrate database in 48 hours
- Unix - SQL server connectivity: use FreeTDS (open
source) engine (Sybase ODBC and others didn't work or were a performance
drag)
- migrating the data was painfully slow using vendor
tools. They wrote utilities themselves using bcp functionality from FreeDTS
- SQl server 'bulk insert' rocks
- performance increase transaction throughput 30%,
database extracts from 235 mins to 25 minutes
- Future: 50-100% increase of transaction expected in
2004
- SQL Server issues: stored procedure recompilation
- Redesign: distributed transaction processing,
consolidate stock & cash ledgers, data archival
- re-architect the queue-based processing.
- database now 1T, backup: full in 3-4 hours,
incremental 1 hour
David Patterson - Berkeley
- Recovery Oriented Computing
- margin of safety works for hardware, what does it
mean for software?
- tolerate human error in design, in construction and
in use
- Raid 5: operator removing wrong disks,
vibration/temperature failure before repair
- never take software raid
- ROC philosophy: if a problem has no solutions, it
may not be a problem but a fact - not to be solved but to be coped with over
time - Shimon Peres
- people HW SW fails is a fact
- how can we improve repair
- MTTF is difficult to measure, MTTR is better
measurement
- if MTTR is fast enough, it is not a failure
- 'Crash only software' Armando Fox
- Recognize, Rewind & Repair, Replay
- Application proxy records input & output, feeds
into an undo manager, combined with a no-overwrite store.
- Lesson learned: takes a long time to notice the
error, longer than to fix it
- Operators are conservative, difficult to win trust
- Configuration errors a rich source of causes for
errors
- Automation may make the ops job harder - reduces
ops understanding of the system
Paul Maglio - IBM
- Studies about system administrators
- important users, expensive, understudied
- interface for system administrators are similar to
regular users, but this may not be correct because they are different
- field studies, long term videos
- Talks about how misunderstanding play a major role,
how admins control info while talking to others
- All about saving face & establishing trust
- Most of the time spent troubleshooting is spent on
talking to other people (60%-90%)
- 20% time about who (else) to talk to?
(collaboration)
- typical day: 21% planning 25% meetings, 30% routine
maintenance, half of all activities were involved in collaboration
- DBAs always work in pairs or teams, meticulous
rehearsal, attention to details, but easy to panic.
David Campbell - Microsoft
- Supportability - Doing right in a Complex World
- Emerging Technology: lots of knobs, high
maintenance testing replacing, etc. Everybody needs to be the expert
- Evolving Technology: some knobs, automation, good
enough for the masses - some local gurus
- Refined Technology: no knobs, goal directed input -
it just works - user base explodes - small expert base
- Design targets: non-cs diciplines have design
targets, cs often not, certainly not supportability
- Design for usability for common users: "virtual
memory too low ..."
- Limited design for test, no standard integrated
online diagnostics
- Techniques similar to those 20 years ago
- Errors: don't be lazy make explicit what you want
the user to do
- Never throw away an error context
- 'Principle of least astonishment':
- do reasonable user action yields reasonable
results? Make it hard for people to hose themselves
- With respect to reliability, how much of the
complexity is spend on actually making the system more reliable
- In automobiles today faults are recorded to help
the mechanics.
- Customers are willing to invest up to 20% to get to
'it just works'
- in sql server introduced more closed feedback loops
to reduce operator involvement
- Automated, adaptive control based on 'expert
systems knowledge'
- result were surprising, results are better than the
best expert could do.
- example: memory allocated, statistics collection,
read-ahead, system structures dynamic allocated
- Memory: replacement cost for each page, utilities
'bid' on memory, feedback on cost of allocation.
Posted by Werner Vogels at October 15, 2003 01:18 PM