Choosing a shard key

An important aspect of sharding is choosing a shard key. A shard key is a field that exists in every every shard, so the choice of a shard key is an important one. Key considerations in choosing a shard key are the following:

  • A shard key must be part of the table's primary key.  This requirement is to prevents scenarios where the data with the same primary key ends up in different nodes because the shard key changed.
  • Tables that are sharded should use the same shard key if they are joined together.  This choice leads to the best performance.
  • Choose a shard key that results in a wide variety of keys.  For example, if you have only two values for the shard key, but 128 partitions, then you will get all of the data into only two partitions and performance will be slowed.  
  • Choose a shard key so that the data is distributed well across the keys.  Do not choose a shard key where the data falls heavily into a couple of shard keys; this results in data in only a few partitions and degrades performance.

Any other recommendations or best practices that you would add?

10replies Oldest first
  • Oldest first
  • Newest first
  • Active threads
  • Popular
  • Bill Back - Details around how create table, create fact table, create dimension table impacts sharding strategy. There is some mention here: ·      https://help.thoughtspot.com/02_Administration/Administrator_Guide_4.2/020/040/030

    (TQL syntax for create fact table, create dimension table does not exist in documentation. We caught it in our DDL script which lead us to inquire. Pankaj also gave us some insight). 

    Reply Like
  • Bill, is there a way to check current Shard Keys?

    Reply Like
  • In TQL, if you enter

    SCRIPT TABLE table_name;

    at the bottom you will see the current sharding.

    Reply Like
  • Thanks Siva!

    Reply Like
  • This process is extremely useful when only considering sharding a single table or a single Fact-Dimension relationship.  Could you elaborate on this process when it gets more complex.  For example, sharding multiple dimension tables to the same fact table (e.g. 3 to 1),  or multiple dimensions to multiple facts (e.g. 2 to 2).  How do you choose shard keys when dimensions don't share similar fields except with the fact table?

    Reply Like
  • Curtis Balusek you should look at co-sharding documentation 

    Reply Like
  • Marc Price Are you talking about the information in the 5.3 documentation here: https://docs.thoughtspot.com/5.3/admin/loading/sharding.html
    Because it does not go into much detail about many dimensions to many facts.  Would you mind sharing the documentation you are referring to?

    Reply Like
  • Curtis Balusek it really depends on how big your dimension tables are 

     

    If there are quite small it will be ok to not shard them at all

    Reply Like
  • Marc Price Let me contextualize:  Let's assume I have an [Orders] table of 10B rows (2Tb), a [Customers] table of 300M rows (5Gb) and a [Products] table of 500M rows (8Gb).  I have a PK/FK relationship on custid for the [Customers]-[Orders] relationship and productid for the [Product]-[Orders] relationship.  How would you recommend co-sharding the tables involved? This is probably the 'simplest' example for explaining multiple co-sharding necessary to enhance overall performance. 

    Documentation discussing best practices around sharding multiple dimensions & facts would be extremely useful as I have yet to find something comprehensive.

    Reply Like 1
  • Hi

    I have also have question about sharding configuration. That all about distribution data cross cluster nodes. Right?

    So why according doc FOR 1 node installation we need 32 shards?

     

    Reply Like
Like3 Follow
  • 3 Likes
  • 4 wk agoLast active
  • 10Replies
  • 1477Views
  • 7 Following