Choosing a shard key

An important aspect of sharding is choosing a shard key. A shard key is a field that exists in every every shard, so the choice of a shard key is an important one. Key considerations in choosing a shard key are the following:

  • A shard key must be part of the table's primary key.  This requirement is to prevents scenarios where the data with the same primary key ends up in different nodes because the shard key changed.
  • Tables that are sharded should use the same shard key if they are joined together.  This choice leads to the best performance.
  • Choose a shard key that results in a wide variety of keys.  For example, if you have only two values for the shard key, but 128 partitions, then you will get all of the data into only two partitions and performance will be slowed.  
  • Choose a shard key so that the data is distributed well across the keys.  Do not choose a shard key where the data falls heavily into a couple of shard keys; this results in data in only a few partitions and degrades performance.

Any other recommendations or best practices that you would add?

10replies Oldest first
  • Oldest first
  • Newest first
  • Active threads
  • Popular
  • Bill Back - Details around how create table, create fact table, create dimension table impacts sharding strategy. There is some mention here: ·      https://help.thoughtspot.com/02_Administration/Administrator_Guide_4.2/020/040/030

    (TQL syntax for create fact table, create dimension table does not exist in documentation. We caught it in our DDL script which lead us to inquire. Pankaj also gave us some insight). 

    Like
  • Bill, is there a way to check current Shard Keys?

    Like
  • In TQL, if you enter

    SCRIPT TABLE table_name;

    at the bottom you will see the current sharding.

    Like
  • Thanks Siva!

    Like
  • This process is extremely useful when only considering sharding a single table or a single Fact-Dimension relationship.  Could you elaborate on this process when it gets more complex.  For example, sharding multiple dimension tables to the same fact table (e.g. 3 to 1),  or multiple dimensions to multiple facts (e.g. 2 to 2).  How do you choose shard keys when dimensions don't share similar fields except with the fact table?

    Like
  • Curtis Balusek you should look at co-sharding documentation 

    Like
  • Marc Price Are you talking about the information in the 5.3 documentation here: https://docs.thoughtspot.com/5.3/admin/loading/sharding.html
    Because it does not go into much detail about many dimensions to many facts.  Would you mind sharing the documentation you are referring to?

    Like
  • Curtis Balusek it really depends on how big your dimension tables are 

     

    If there are quite small it will be ok to not shard them at all

    Like
  • Marc Price Let me contextualize:  Let's assume I have an [Orders] table of 10B rows (2Tb), a [Customers] table of 300M rows (5Gb) and a [Products] table of 500M rows (8Gb).  I have a PK/FK relationship on custid for the [Customers]-[Orders] relationship and productid for the [Product]-[Orders] relationship.  How would you recommend co-sharding the tables involved? This is probably the 'simplest' example for explaining multiple co-sharding necessary to enhance overall performance. 

    Documentation discussing best practices around sharding multiple dimensions & facts would be extremely useful as I have yet to find something comprehensive.

    Like 2
  • Hi

    I have also have question about sharding configuration. That all about distribution data cross cluster nodes. Right?

    So why according doc FOR 1 node installation we need 32 shards?

     

    Like 1
Like3 Follow
  • 3 Likes
  • 9 mths agoLast active
  • 10Replies
  • 1531Views
  • 7 Following