Slowly Changing Dimensions
Thoughtspot with slowly changing dimensions.
In the data loading module, in the presentation video at about minute 44 of bootcamp, the presenter is talking about slowly changing dimensions, ie, dealing with duplicate primary keys with history. She says in slowly changing dimensions, primary key to foreign key is not used.
In a slowly changing dimension scenario, a database applies a surrogate key to the fact table that links to the dimension where that surrogate key is unique but the primary key can be duplicate due to the history. My question is, Is this data layout supported in thoughtspot? I would think it should because it would just be an import with a modified structure where the dimension table is modified. It would still be a one to many from surrogate key (unique) in dimension to surrogate key (many) in the fact table.
Who knows the answer to this question?
Yes, ThoughtSpot supports slowly changing dimensions, and it does so with a relationship join. The TQL docs are here in the second ot last section: https://docs.thoughtspot.com/6.0/reference/sql-cli-commands.html
Whenever I've done this in the past, I've followed the same syntax as the example:
ALTER TABLE "wholesale_buys" ADD RELATIONSHIP "REL_fruit" WITH "retail_sales" AS "wholesale_buys"."fruit" = "retail_sales"."fruit" AND ("wholesale_buys"."date_order" < "retail_sales"."date_sold" AND "retail_sales"."date_sold" < "wholesale_buys"."expire_date");
As you can see, the join combines both the primary key with a relationship on the date key. Does that fit what you are looking for?
Thanks Aaron. I'm not sure that it does. I'm still trying to work through that logic. Unless I'm missing something, a range join on dates could cause duplicates whereas the normal slowly changing dimension with a surrogate key would be a simple one to many with the surrogate key being the new primary key and all the joins being equajoin. The other thing that I'm not clear on in you example is which is the fact table and which is the dimension? Also how would it work if you're trying to track history of a customer where previous information would be relevant?
Adding a timestamp would make the rows unique, but my question was more towards the feasibility of bringing in a table from a database that already supports history in the traditional way. ie, dimension having surrogate key and primary key, and the surrogate key being the foreign key in the fact table. I fully understand that Thoughspot is only a reporting tool. It's not meant to actually implement the slowly changing dimension strategy. But surrogate key (dimension) to surrogate key (fact) should work shouldn't it? Why do we use date range keys?
In this example I just posted, assume the company wants one supplier for apples, only one, so the primary key is the product "apple" in this case. We want to track the history of the changes in the supplier of this product. (normally I would put Product ID but thoughtspot likes seeing words instead of codes)
Remember that ThoughtSpot does not do the logic of tracking the history, the database does, Thoughspot merely reports on the results.
We see the dates a product supplier was effective, and ceased to be effective (end date) and finally a flag of whether or not it's current. The dates and the current indicator are not relevant to reporting so they would be excluded from the import into thoughtspot.
Whereas product use to be unique, here, it is repeated but upon each change of current status, a new surrogate key has been applied. It is primary in the dimension we see here but would be repeated in the fact table. If upon reporting one would like to see transactional history of when the primary supplier was Fuji, they would use the surrogate key 1 whereas all the current products would in this case use the surrogate key 3 in the fact table.
In answer to your question about the SK, for all intents and purposes, it becomes the new primary key. In fact, knowing this is important because if an upload is done using the "former" pk, you would get a bunch of upserts, retaining only the most current record for the pk.
All this behavior is managed in the database, but we're bringing this data into TS for reporting so we would not need the extra fields like the date fields, but we would need to know in TS to make the SK the new primary key.
I'm not asking if it would work because I'm quite confident that it would because to a reporting tool, it sees the SK as unique in the Dimension and multiple in the fact.
The reason for the question that started this string was, coming from the database industry, I like to protect the integrity of the data, and it seems that joining on date ranges would at some point lend itself to returning multiple records, some which should not be returned, unless this was somehow managed with code somewhere. Being that I don't know some of the inner workings of TS, I'm wondering if there is something in TS that is managing this or in the logic, the statement presented actually does guarantee uniqueness where needed.
I totally understand your viewpoint and need for data integrity.
That final clarification is definitely the key to understanding context here. I'll let a Spotter address the question, but I don't see any reason why you'd have to worry about the model integrity here.
I'm personally left unconvinced you'd need a date join in this situation. As long as you have a unique SK, I'm thinking you'll be just alright so long as you define the relationship appropriately. If you have an available cluster, this would be cheap to test out on a small dataset.
I have modelled SCDs Type II with both options discussed in the thread. Both are valid solutions Some points to consider.
- A data range join at volumes of data will be slower than a primary-foreign key relationship join. For a model I built with 16 Billion records this was a reason we went with a surrogate key.
- You should not need a bridge table to resolve SCD Type IIs
- A date range solution ensures that you can easily model a SCD TYPE I with the same model. I.e. The current values. This is because you can create a separate table with the current values that joins to the fact on the ProductID rather than the Surrogate Key. Therefore it is useful to have both values on the fact.
- If modelled correctly, i.e. no overlapping dates, then you do not get duplicate records.
- I would not go with a bridging table as at volume this will degrade performance.