A Day with Ralph Kimball, Part 2
This continues the blog began in Part 1 of A Day with Ralph Kimball.
So, on to the seminar! Please remember that this isn’t what Ralph said as much as it’s my interpretation of what Ralph said. I’m trying to explain what you’ll get from his seminar but it’s through my eyes not yours. That’s why this is not a replacement for his seminar. Hopefully, I will induce you to attend if he makes it to your town. You should go just so that you can tell me where I got it wrong if for nothing else.
When we last saw our intrepid heroes they were devouring their lunch and asking Ralph inane questions. After the lunch break, Ralph covered Star Schema design and ETL. I felt the ETL portion was a little light but as all of this was covered in a single day, with some of the time consumed by vendors, I think Ralph did a good job. I’m not going to lay this out in the same order he did. I’m going to do this more in a topic outline fashion. I think this will flow better in a blog.
A primary Key (PK) uniquely identifies a single row in a table.
A foreign key (FK) is a link to a different table, i.e. a child or parent record.
A surrogate key is a sequential number that is meaningless. In oracle it’s usually created by a sequence.
A natural key is a key that has meaning. An employee’s social security number might be used to uniquely identify employees in an HR system (I would NOT recommend this but I have seen it). That would be a natural key.
Warehouse systems MUST use surrogate keys. It might seem like a pain but it removes administrative overhead when it comes time to fix production errors in the data.
A note is needed here though. In an OLTP system, an ORDER_ID may be a surrogate key. By the time that ORDER_ID gets to the warehouse, it is a natural key. It has meaning in the OLTP system. Warehouses need to take control of keys or they will fail. Do not use surrogate keys from an OLTP system as the keys in a warehouse. It sounds redundant but create a new sequence that will have no meaning to the OLTP systems. Remove that dependency.
Kimball recommends key fields be 4 bytes in length. This is primarily to conserve space. When a table can get into the billions of records, a few bytes per records really adds up.
2. Fact Tables
The fact table is a table that holds facts, or measurements. An event happens and a measurement is taken. When a measurement is taken, a fact record is created.
Facts are uniquely identified by a set of FKs to a set of dimension tables. The PK of the table may be all of the FKs but is usually a sub set.
This key implements the grain of the fact table. You can think of the grain as the frequency of the data, although that is somewhat inaccurate and very simplistic. A grain can be daily or monthly, order detail or order header, etc.
Fact tables can be aggregated into rolled up facts but the base fact table should be detail records at the declared grain.
Besides the PK, the fact table should contain numerical measurements. Sales quantity, sales amount, total profits, etc are all measurements.
In special cases the fact table may contain information other than FKs and measurements. A fact may contain a degenerate dimension. An example of this might be an order number. A dimension would be almost one-to-one with the fact table and in a warehouse this data would usually be display only. Instead of creating a monster dimension, you can carry this information in the fact table. Other examples might be a check number in a retail POS transaction or a risk score in financial transactions.
Indexing fact tables is a difficult proposition. If you only have one or two dimensions you can easily index them. What if you have 10 or 20? In this case you’ll probably want a bitmap index on each FK. The actual facts in the table are almost never indexed.
To build a fact table you need to figure out where you’re getting the data from, what’s your grain, what fields describe the fact and what exactly is the fact.
3. The three grains of a fact table
Above I said that you could consider the grain the frequency of the data. In addition to that, there are three types of grain: Transaction, Periodic Snapshots and Accumulating Snapshots.
a. Transactions – This is the basic detail record of the warehouse. A POS transaction, an order details line, or a web “click”. Kimball points out, and my experience aggress that this grain type is usually has a lot of dimensions and only exists if the detail event occurs. If the data from production is correct, there is no reason to change it.
b. Periodic Snapshots – This is a set of detail records that are repeated over time. I believe the example Kimball used was temperature in a room. If a system polls the temperature every fifteen minutes, the measurement would be the temp and the only dimensional change would be time.
c. Accumulating Snapshot – This is a fact that will change over time. The common example is the student enrollment example. The student’s data doesn’t change but the dates for enrollment, admission, etc.
In my opinion, type c is useful only when all possible date changes are known in advance and there are only a set number of possible dates. I have run into examples where this is not feasible due to the change, i.e. state, is dependant on other factors in the system. A transaction may have 1 state change and the life span of a single day or it may have 10 or 15 state changes over a lifetime of months. This is one of the things I will cover in my future star schema entries.
I think that covers the gist of fact tables. Ralph covered a bit more than that but he had all day. Heh
A dimension describes a fact. Dimensional data is usually textual and descriptive: the date of a transaction, the product type, the location, etc.
A fact record should connect to only one dimension record in a particular dimension.
Dimensions are independent of other dimensions. If you want to find data in a dimension using criteria in a different dimension, you must join through the fact table.
As I said above (well, really as Ralph said, I’m just regurgitating), a warehouse uses surrogate keys. The PK of dimensions should be a sequence. The one exception Ralph lists is for the date dimensions. He says that the key should be the physical date. This is doable in this case because the date is independent of application data. The date dimension should be a chronological table of dates with no gaps. It’s highly doubtful that the dates will change.
Something Ralph said that resonated with me is that the dimensions implement the User Interface (UI). That is very true. If you think of a tool like BRIO or BO, the users are defining their interest using dimensions. It’s very important to name the dimensions with meaningful descriptions of what is in them.
Dimensions should implement a NULL or Unknown record. The fact FKs should never be null. If the descriptive data is unknown, the FK should point to the Unknown dimension record. You can implement this with a -1 or with another entry from your sequence but the fact is, pun intended, that a fact should not have nulls in FKs. Since a fact is made of FKs and measurements, I think it is safe to say a fact table should never contain nullable columns.
A conformed dimension is a just a dimension used by two marts that match either in the dimension or in the values in the dimension. Basically it means making sure your definitions conform against different marts. This is also called Master Data Management and is the basis of Kimball’s Bus architecture.
5. Changing Dimensions
There are three ways to handle changes to a dimension table: Over-write, Create New, or Save Old values.
a. Type 1 – Overwrite: This type is appropriate to correct errors. No history of the change is kept. You will probably have to recompute all aggregates that use this dimension.
b. Type 2 – Create New: This type is the most common in my experience. When you make a change, you create a new record with a new surrogate key and a new date. All future records will use this new record. When selecting facts, you can see the history of the values. Aggregates should not be affected.
c. Type 3 – Save Old Values: I have never had a use for this type. You would add previous_value columns and when there is a change move the old data to these columns and replace the normal columns with new data.
Ralph also talked about a Hybrid Type 2/3. His example was based on yearly changes to sales districts. I’ll let you attend the seminar or his class to find out more. I have not run into this particular type.
Another topic he covered briefly was monster dimensions. If you have a dimension, the example was customer dimension, that is huge and has some often changing data, it is appropriate to split that dimension into two dimensions, one for static data, i.e. name, address, date of birth, and a different dimension for volatile data, income, number of children, etc.
Something else he covered that I cannot find in my notes is junk dimensions. If you have a set of flags that are always set one way or another, instead of creating a bunch of yes/no dimensions, create a junk dimension containing all of the flags. The final data set would be a Cartesian product of all the individual flags. You can also do this for other unrelated data that would make sense to combine. If all of the possible values result in a fairly small data set, say up to 10,000 records, you should probably pre-load the data. If it’s more than that, you can populate it at load time. If a junk dimension might hit 1 million records, I would recommend splitting it apart. That’s just too big for a junk dimension.
6. ETL – Delivering the data
Like I said above, the ETL portion was a pretty quick run through. We were running out of time. It’s also hard to talk about specifics without talking about a specific tool.
Ralph started out defining E, T and L and adds M and R.
a. E – Getting the data into the warehouse
b. T – Clean and Conform
c. Prepare for presentation
d. M – Manage all the processes
e. Optionally R – Adapt to real-time
The basic ideas here are that E is for figuring out where to go for the data and how to extract it: Profiling, Change Data Capture or Extracts. T is data quality, i.e. cleaning, mapping to conformed dimensions and error handling. L generates keys, makes changes to SCDs, populates your fact tables, handles any late arriving data and does any aggregation. M is your workflow, job scheduling, backups, etc. The optional R is to handle loading in real-time. As data volumes increase and batch windows decrease, it’s up to the warehouse professionals to figure out how to stream the data into the warehouse instead of batching it in.
Wow! I still have several pages of notes but I think that’s enough for today. I am working on several documents on building star schemas, starting with getting requirements on through design, development and implementation. I’ll put those out individually over the next couple of months with entries on unrelated topics.
Ill say this again. If you can attend this seminar and you’re involved in your data warehouse at all, I would recommend you attend. You’ll get some fairly interesting marketing literature from informatica, Sybase and trillium but more importantly, you’ll get to hear this stuff from the guy who invented it. Having the chance to grill him on topics near and dear to your heart is fun too.
I guess I should also say I am in no way associated with Kimball, the Kimball group, Sybase, informatica or trillium. I’m just a geek who enjoys these kinds of seminars and I never seem to get to enough of them.
If you have any questions, comments or concerns, please create a comment. I know you’re out there. I can hear you breathing. heh
You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.