Designing the Data Mart – Part 2
So now we have our transactional model and a basic user story:
Our first request from the business for our data mart is that they want to be able to query all of the orders by date, by customer and/or by region (state, city or country). They want to be able to aggregate (sum and average) across those items.
The first thing we need to do is talk to the business and find out exactly what that request means. Do the users want to see information about daily orders in general? Do they want to see what the status of orders are over time? Do they want to get dollars amounts?
We’ll have to pretend that we have had that conversation but in real life, make sure you have it with them. We’ll go with “sale amounts over time by customer” as being the most important requirement at the moment.
We have the requirement, now what? What is a data mart and what are we going to do with it? I will start with some definitions, for our purposes:
The Data Mart
A data mart is
- a specialized set of data structures that are easy to query, obvious as to meaning and capable of scaling.
- a subset of a businesses operational data store, usually covering only part of a specific business area
- made up of measurements and the attributes that describe those measurements (facts and dimensions)
- easy to enhance over time
- implemented as partially denormalized star schemas (and occasionally snowflakes)
A data mart is not
- intended to answer only a specific question
- an aggregated data store
A data mart may be part of a larger data warehouse but that is not necessarily true. The business need should determine the scale and scope of a warehouse. Data mart may equal data warehouse, if the data mart(s) provide the business with the solution required.
A star schema is a fact table surrounded by one or more dimensions. It’s called a star schema because if you draw a line around the outside points it looks something like a star. Whatever. Really, it’s just a series of measurements attached via foreign keys to the measurement’s descriptive attributes. So what is a fact table and what is a dimension? Read on.
A fact is a measurement. A count of something. A average or a summation. A trend over time. An actual physical testing of a value. A fact is usually a numeric value that can be added, summed, etc.
If we refer to our user requirement above, that would lead me to believe that our measurements for the first part of our data mart will come from the orders information. Referring back to our OE data model, I can see that orders are made up of two tables: sales orders and sales order lines.
What measurements can we get from those two tables? I see prices, amounts, number of order lines and a discount in the sales orders table. The lines table has the same sort of information as well as unit price and cost. From this information, we can also calculate orders per time period, orders by customer, etc. Plenty of measurements to start with.
A dimension is a data structure that defines a fact. A dimension stores attributes of descriptive information that directly relate to the measurement being reported.
Looking at the OE data model, my first pass at identifying important dimensions would be that orders obviously have a customer, the customer orders products, the sold product comes from a vendor and the product ships as a unit of some kind. I will label those ideas as my first attempt at dimensions.
It’s not important that I get it right on the first pass. For the moment, I just want to start building the overall approach in my head.
While it is not on the model itself, there is a very obvious dimension that is missing. That would be some kind of date/time dimension. Orders are sold at a point in time. Looking at the definition of of an order (and maybe reviewing some data and confirming with the users), I see that an order only stores dates and not times.
We will need a date dimension and that leads me to declared grain.
Grain and granularity
The lowest level of detail in your data mart is referred to as its grain and/or its granularity. In our model, we can’t get any lower than day (in time) because that is the lowest level of time based detail stored in out source system (for the data we have looked at so far). However, our grain is not time based. We also store records at the order lines level. That is truly our grain (called a transactional grain). By storing data at the lowest level, we can provide aggregated information at any higher level.
If we only stored data at the day level, we could never say exactly how each order was created. If we only stored data at the customer level, we can still not say anything about the individual orders.
Our declared grain is the order lines level. Now that we have declared the grain of the first star schema, it’s time to discuss conformed dimensions.
A conformed dimension seems to give some people trouble. I think that, as IT people especially, we tend to make things harder than they need to be. A conformed dimension is nothing more than saying what you mean and sticking to it.
When I define a dimension, say customer for example, I give that dimension a set of attributes. A customer will have a name, an address, a credit rating, etc. From now on, when I refer to customer, that is what I mean. If I need a dimension that stores customer data but will also include other attributes, I will either conform my original dimension so that I still end up with a single customer dimensions, or I will give the new dimension a different name.
When I speak about a “customer” it is important that everyone understand what I mean. When I include the customer dimension is a query, that customer dimension should be the exact same dimension for every fact table in my data mart (and preferably for any data mart in my company).
A conformed dimension means that I have defined a dimension and I will make sure that the dimension maintains that same meaning regardless of who is using it. That’s all.
Part 2 Summary
So, I have identified my fact table and some dimensions. The next step is to start defining the attributes that will live in my new objects.
What I have so far is:
You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.