Database Sharding

Posted by : Sushanth Friday, 24 December 2021

Database Sharding:

As the data grows over time, the database will be overloaded and triggers a strong need to scale the data tier.

Database Scaling:

There are 2 ways to scale a database.

1.Vertical scaling

· Add more power (CPU, RAM, DISK, etc.) to an existing machine.

· Drawbacks:

o Hardware limits.

o With large userbase, a single server is not enough.

o Single point of failures.

o Cost of vertical scaling is high. Powerful servers are expensive.

2.Horizontal scaling

Also known as sharding, is the practice of adding more servers.

It separates large databases into smaller, more easily managed parts called shards. Schema corresponding to each shard is same even though the data on each shard is unique.

Consider the tinder application in which the user data is allocated to a database server based on location. Anytime you access data, a hash function is used to find the corresponding shard.

Consider an example, where the database servers are segregated based on user_id. Say user_id % 4 is used as the hash function. If the result equals to 0, shard 0 is used to store and fetch data. If the result equals to 1,shard 1 is used.

Horizontal Vs Vertical Partitions:

Advantages of Sharding

Sharding allows you to scale your database to handle increased load to a nearly unlimited degree by providing increased read/write throughput, storage capacity, and high availability. Let’s look at each of those in a little more detail.

Increased Read/Write Throughput — By distributing the dataset across multiple shards, both read and write operation capacity is increased as long as read and write operations are confined to a single shard.
Increased Storage Capacity — Similarly, by increasing the number of shards, you can also increase overall total storage capacity, allowing near-infinite scalability.
High Availability — Finally, shards provide high availability in two ways. First, since each shard is a replica set, every piece of data is replicated. Second, even if an entire shard becomes unavailable since the data is distributed, the database as a whole still remains partially functional, with part of the schema on different shards.

Important Considerations:

Sharding key: The most important factor to consider when implementing a sharding strategy is the choice of the sharding key. Sharding key (known as a partition key) consists of one or more columns that determine how data is distributed. A sharding key allows you to retrieve and modify data efficiently by routing database queries to the correct database.

Challenges with Sharding:

1.Resharding data:

Resharding data is needed when

1) a single shard could no longer hold more data due to rapid growth.

2) Certain shards might experience shard exhaustion faster than others due to uneven

data distribution. When shard exhaustion happens, it requires updating the sharding

function and moving data around.

2.Celebrity problem: This is also called a hotspot key problem. Excessive access to a specific shard could cause server overload. Imagine data for X, Y and Z users calls end up on the same shard. For social applications, that shard will be overwhelmed with read operations. To solve this problem, we may need to allocate a shard for each celebrity. Each shard might even require further partition.

3.Join and de-normalization: Once a database has been sharded across multiple servers, it is hard to perform join operations across database shards. A common workaround is to denormalize the database so that queries can be performed in a single table.

Alternatives to Sharding?

Sharding adds operational complexity, it is usually performed when dealing with very large data.

Below are the scenarios where it may be beneficial to shard a database:

· The amount of application data grows to exceed the storage capacity of a single database node.

· The volume of writes or reads to the database surpasses what a single node or its read replicas can handle, resulting in slowed response times or timeouts.

· The network bandwidth required by the application outpaces the bandwidth available to a single database node and any read replicas, resulting in slowed response times or timeouts.

Following are the alternatives to consider before considering sharding

Setting up a remote database:

If you’re working with a monolithic application in which all of its components reside on the same server, you can improve your database’s performance by moving it over to its own machine. This doesn’t add as much complexity as sharding since the database’s tables remain intact. However, it still allows you to vertically scale your database apart from the rest of your infrastructure.

Implementing caching:

If your application’s read performance is what’s causing you trouble, caching is one strategy that can help to improve it. Caching involves temporarily storing data that has already been requested in memory, allowing you to access it much more quickly later on.

Creating one or more read replicas:

Another strategy that can help to improve read performance, this involves copying the data from one database server (the primary server) over to one or more secondary servers. Following this, every new write goes to the primary before being copied over to the secondaries, while reads are made exclusively to the secondary servers. Distributing reads and writes like this keeps any one machine from taking on too much of the load, helping to prevent slowdowns and crashes. Note that creating read replicas involves more computing resources and thus costs more money, which could be a significant constraint for some.

Upgrading to a larger server:

In most cases, scaling up one’s database server to a machine with more resources requires less effort than sharding. As with creating read replicas, an upgraded server with more resources will likely cost more money. Accordingly, you should only go through with resizing if it truly ends up being your best option.

Sharding Architectures

1. Key Based Sharding

This technique is also known as hash-based sharding. Here, we take the value of an entity such as customer ID, customer email, IP address of a client, zip code, etc and we use this value as an input of the hash function. This process generates a hash value which is used to determine which shard we need to use to store the data.

Consider an example that you have 3 database servers and each request has an application id which is incremented by 1 every time a new application is registered. To determine which server data should be placed on, we perform a modulo operation on these applications id with the number 3. Then the remainder is used to identify the server to store our data.

The downside of this method is elastic load balancing which means if you will try to add or remove the database servers dynamically it will be a difficult and expensive process.

2. Vertical Sharding

In this method, we split the entire column from the table and we put those columns into new distinct tables. Data is totally independent of one partition to the other ones. Also, each partition holds both distinct rows and columns. Take the example of Twitter features. We can split different features of an entity in different shards on different machines. On Twitter users might have a profile, number of followers, and some tweets posted by his/her own. We can place the user profiles on one shard, followers in the second shard, and tweets on a third shard.

The main drawback of this scheme is that to answer some queries you may have to combine the data from different shards which unnecessarily increases the development and operational complexity of the system.

3. Directory-Based Sharding

In this method, we create and maintain a lookup service or lookup table for the original database. Basically we use a shard key for lookup table and we do mapping for each entity that exists in the database. This way we keep track of which database shards hold which data.

Subscribe to Posts | Subscribe to Comments

Technical Articles

Software Programming articles