Database Sharding

Any website or application eventually will grow big and scaling is inevitable, it is required to accommodate increase in traffic and space usage. Not only to scale but it is also critical to ensure data security and integrity.

So what is database Sharding?

Sharding involves breaking tables into table rows into multiple table it is related to a database architecture pattern called Horizontal Partitioning. Different partition will have the same column structure but have totally different rows of data. The data will held uniquely and independently in each partition usually have a certain identity, here is a little example of the concept :

img

While the original table contains all the data with any separations, after we shard the database we separate it into 2 kinds. A-G Shard & H-Z Shard Where the alphabet stands for the first letter of the key, we can each shard or database as a Node.

A Sharded will usually implements a shared-nothing-architecture. Which simply says each node in the system does not share any shared computing resource or access, which each unique request will be satisfied by a single node. This done to reduce contention within the system.

Sharding usually implemented in the application level, where the application will determine which database or table that they will request to.

Sharding Pros and Cons

Like any other architecture pattern, sharding have it have it pros and cons.

Pros

  1. Facilitate Horizontal Scaling

    Database sharding can improve system scalability it can help with Horizontal Scaling, in horizontal scaling we can add a new machine to help distributing load the system. Where in vertical scaling we need to scale the current machine RAM / CPU to a higher capabilities.

  2. Improve Performance

    Database Sharding can also improve performance of the system, in contrast when you wanted to query a data that is not sharded yet, you have to scan the whole databases. But when you are requesting data to a sharded databases, you can look the data in a specific datbase.

  3. Disaster Mitigation

    When using non sharded database, when the node is down everything is down. But when you are using a sharded database even tough some part of the system is down and cannot be accessed, the overall impact is still less than a monolithic single database.

Cons

  1. Extra layer of complexity

    Creating a sharded database is tricky, first you have to pick which architecture that are suitable with your system. And then you have to Implement it correctly a fail in the implementation of the architecture could lead to a slow and may even corrupted database. And even after you Implement it correctly it will still impact the workflow of your team.

  2. Unbalanced Shard

    Sharded databases are prone to unbalanced data distribution in example above we separate the data with range of A-G & H-Z. Let’s say that there is more data in the A-G shard and way less in the other shard this introduce a Database hotspot this may lead to application that access the A-G shard will be slower and eventually crashed. And may need to redesign or even reshard to create more balanced data distribution.

  3. It is hard to Un-shard

    After a database is sharded it might be hard to return it to unsharded architecture. Depending on the architecture you use, sharding a database can be very costly and time consuming.

Sharding Architectures

Key Based Sharding

In key based sharding also known as hash based sharding this method used a certain value that taken from a newly written data, such as id, email, username, zip code or maybe other attributes that considered as unique key and plugging the value to a hash function to produce a Hash Value. The Hash Value will be used as a refence to determine which shard will the data stored to.

Although that this architecture seems solid to work with, it could be a problem if you wanted to reshard a sharded databases. You may have re hash every data in the target database and it can subject to a long down time.

Range Based Sharding

Range based sharding example diagram

Range based sharding involves using a certain attribute that have range as a reference and then splitting the shard by range. in example above we split the data by using the price. One to keep in mind is that in range based sharding your database shard is really prone to unbalanced data.

Directory Based Sharding

Directory based sharding example diagram

Directory Sharding or Lookup Sharding. This sharding method involves a lookup table that function as a reference of where a data can be found. In this example the Delivery Zone is a lookup key and next to the attributes we can find the Shard Id which is the database reference.

With this architect we can have more flexibility and customization about where to store the data, we can even check which database have less data and store it there. By this we can also ensure that there is no unbalanced database issue.

Even though this architect is the most flexible among other architect, performance of directory sharding architect could be slower. Because you need to create an extra connection to lookup which database is the data stored.

But before you shard..

Even though sharding base is possible you might wanted to consider some other resolve to your data problem before deciding to shard your database, such :

  1. Implement Caching

    Your read may slow and you think to shard your database, but is it really problem?? Your user might only read certain data but alot of times. In that case it is best to just implement caching rather than Sharding.

  2. Create database replica

    By having multiple database replica you have the freedom to pick which database you wanted to read from. You could balance your read by using nginx to a less busy server. but your write will still only goes to one server and create a trigger so it automatically create replicas in other database. It will be a good solution if you have a read problem not a write one.

  3. Upgrading the server

    This is the most straight forward solution, and it maybe good and also bad at the same time. since vertical scaling have a limit.

Conclusion

Sharding database could be a good solution when you need to distribute your system and improve performance, but at the same sharding can be costly and it will be add extra layers of complexities to your system. You might wanted to consider to just move your database to natively distributed ones like Apache Cassandra.

This is an article that is amateurly written and still lacking alot of things. Kindly give me a feedback and say hi at my twitter! @arifluthfi16.

Sources

  1. https://stackoverflow.com/questions/18302773/what-are-horizontal-and-vertical-partitions-in-database-and-what-is-the-differen
  2. https://medium.com/@varshney.shivam786/database-sharding-system-design-1-63203f494fc0
  3. https://www.digitalocean.com/community/tutorials/understanding-database-sharding