Home‎ > ‎

Database Tips

  • Stop being conservative
    Hard drive space is ridiculously cheap, stop trying to save bytes and make columns 50% bigger than you think you'll need

  • Mnemonics can be better than integers
    When you have a short list of things, or things that are naturally slow to set up, such as roles and workstation IDs, consider using a VARCHAR that stores names rather than integers coming from an identity (auto-increment) field. For example, you have bunch of user roles like "Order Taker", "Editor", "Negotiator", "Shipper", etc. You don't want to look at the "role" column in your tables and see "1", "2", "3", etc., when it would have a negligible impact on performance and storage to just use "ORDERTAKER", "EDITOR", and "NEGOTIATOR" as their keys instead

  • Don't cram all your tables into one DB
    Make as many separate domains, databases, and DBM instances as you need to keep tables grouped logically. Look into Service Oriented Architecture (SOA) as a way to create islands of related tables that have a high-level interface in front of them. One day you'll want to update one of those islands with a faster (or cheaper) storage method without forcing you to migrate all the data that didn't need to move with it

  • Use auxiliary tables to speed-up math intensive jobs
    Pre-compute the answer-space for a common function and store the results in a table (eg: the log or sin of a few thousand integers), then JOIN rather than calling the function. It will run faster. (Credit: Joe Celko)

  • Only pave where the grass has been trodden down
    Don't define any views or stored procedures in the beginning. Develop your schema, then your client apps, and then make views and stored procedures only after you catch yourself doing the same query in 3+ different places

  • Don't store configuration in a relational DB
    Your list of payment types and shipping methods will be 4-5 records long and change only once a year. Put them in an CSV or XML file instead so you can version them

  • Avoid putting database assumptions in libraries
    You write a function that takes a database connection and queries a table it assumes exists and with a structure it expects. This itself can be fine, but if you put it into a library you'll have to recompile everything linked to it whenever the table has to change

What to store where

In general, use the strengths of everything to solve your problem: dump raw data into a key=value store (hashtable) until it can be processed and massaged to fit the model of a relational DB, then use the modeled data in the relational DB to create summaries and precomputed views in another hashtable that you'll use to drive your high-volume web site. Use a version controlled filesystem (GIT, SVN, etc.) to store program configuration, workflow definitions, definitions of types and formal correspondence. Your unversioned filesystem will store rapidly changing log files.

Relational DB (OLTP / Transactional DB) Version Controlled Filesystem Key=Value Store
Events (orders, transactions, traffic violations...) Configuration (programs, contracts, workflows...) Summarized events and transactions
Names and addresses, product catalogs Definitions of types Cached views
Processed Measurements Images, digital products Raw data from probes, web sites...
Instant messages, Tweets Formal messages (EDI, Email) Snapshots
Relational DB (OLAP / Data Warehouse) Unversioned Filesystem Distributed Hashtable
Sanitized, deduped transactions / Cubes Logs "Facebook killers"


The Tamper-evident transactional database

 I've used this architecture for a successful Orders-2-Cash system (AKA order fulfillment or order management) after noticing that terabytes were getting cheaper, that most of the analytical work was now being done by data-warehouses, and that SOA-style Enterprise Architectures let me offload a lot of crap that used to be stored in our legacy O2C, such as CRM, product catalogs, inventory management, supply-chain and so-on. I wanted a system that would store the complete history of everything that happened to an order, would concern itself only with the facts of the order, and would be tamper evident.
 Rather than investing effort trying to prohibit unauthorized data-entry (which I knew would fail anyway--no system is unhackable), I wanted a system that could tolerate rogue transactions as long as there was something I could do about it after the problem was discovered.

 The principles were:
  1. All records are immutable: every table is INSERT-only, never UPDATE or DELETE unless it's an admin fixing a bug
  2. Every table that belongs to the data model includes a binary "signature" column and a GUID for "transactionID". The signature is an RSA-signed hash of the row's contents1
  3. There is a "meta_transactions" table with the transactionID, name of the table affected, user ID, and server's timestamp. There's one meta_transaction record for each table/transaction pair
  4. All data belonging to the model was submitted to a queue. The database was made read-only for all users except admins and the process that read the queue
  5. All surrogate keys ("IDs") were GUIDs instead of Identity/Auto-increment columns
 New records were formatted by the client to fit the relational model, then signed on the client machine with the user's own unique certificate (issued automatically by the Windows CA), then serialized to a collection of dictionary structures (each item in the collection was one row in a transaction, and those items were dictionaries of field=value pairs), and then submitted in whole transactions-worth at a time to the queue. The process reading the queue validated the messages, sorted the new rows according to foreign key relationships, inserted each row as a single distributed transaction, and then finished the job by inserting a meta_transaction row to describe what it did and when.

 We also didn't bother with any N-Tier object-request abstractions. Any client that was part of the "suite" connected to the database directly and used either plain SQL or whatever ORM it wanted. But communication between other systems in the Enterprise Architecture was done by a REST-style service, meaning that the Orders-2-Cash system never tried to do a SELECT on the Customer database, it just queried a web service that we built as the interface for it.

 The benefits were:
  • If someone's account was compromised and bad data submitted, we could identify every record created by that account and remove them
  • We could add that account's certificate ID to a CRL (Certificate Revokation List) and have downstream clients refuse to honor any records signed with them
  • A nightly process could trawl through the database checking the signatures on each record to discover tampering
  • Clients didn't need to have a connection to the database to create records and submit them--in fact they could be laptops in the field that submitted transactions when they got back to home office (using GUIDs meant there wouldn't be problems with rows having the same ID), or workstations and application servers that continued to take orders even when the DB was down
  • It was very easy to replicate the database at the application level (just crawl through meta_transactions to sync, and then distribute copies of each message submitted to the queue), which meant we could solve certain kinds of performance problems by creating departmental caches
 Some disadvantages:
  • No UPDATES meant the model was very normalized and required lots of joins
  • It also meant the client had to query a lot and compute a lot to figure out what the state of an order was in. We couldn't do "UPDATE orders SET status = 'shipped'". Some of this was solved by creating a few views after a bit of analysis
  • The performance of queries that summarized multiple orders wasn't great, but we were counting on running those kinds of queries on a data-warehouse anyway
1 - All columns in the table except the signature itself, plus the table name (Horton principle) were serialized to an escaped and delimited string containing key=value pairs. The string was then RSA signed to a 128-byte value.