- Back to Home »
- Database Sharding
As the data grows over time, the database will be
overloaded and triggers a strong need to scale the data tier.
Database
Scaling:
There are 2 ways to scale a database.
1.Vertical
scaling
·
Add more power (CPU, RAM, DISK, etc.) to an existing
machine.
·
Drawbacks:
o
Hardware limits.
o
With large userbase, a single server is not enough.
o
Single point of failures.
o
Cost of vertical scaling is high. Powerful servers are
expensive.
2.Horizontal
scaling
Also known as sharding, is the practice of adding more
servers.
It separates large databases into smaller, more easily
managed parts called shards. Schema corresponding to each shard is same even
though the data on each shard is unique.
Consider the tinder application in which the user data
is allocated to a database server based on location. Anytime you access data, a hash function is
used to find the corresponding shard.
Consider an example, where the database servers are segregated
based on user_id. Say user_id % 4 is used as the hash function. If the result
equals to 0, shard 0 is used to store and fetch data. If the result equals to
1,shard 1 is used.
Horizontal
Vs Vertical Partitions:
Advantages of Sharding
Sharding allows you to scale your database to handle
increased load to a nearly unlimited degree by providing increased
read/write throughput, storage capacity, and high availability. Let’s
look at each of those in a little more detail.
- Increased
Read/Write Throughput — By distributing the dataset across multiple
shards, both read and write operation capacity is increased as long as
read and write operations are confined to a single shard.
- Increased
Storage Capacity — Similarly, by increasing the number of shards, you
can also increase overall total storage capacity, allowing near-infinite
scalability.
- High
Availability — Finally, shards provide high availability in two ways.
First, since each shard is a replica set, every piece of data is
replicated. Second, even if an entire shard becomes unavailable since the
data is distributed, the database as a whole still remains partially
functional, with part of the schema on different shards.
Important
Considerations:
Sharding key: The
most important factor to consider when implementing a sharding strategy is the
choice of the sharding key. Sharding key (known as a partition key) consists of
one or more columns that determine how data is distributed. A sharding key
allows you to retrieve and modify data efficiently by routing database queries
to the correct database.
Challenges
with Sharding:
1.Resharding
data:
Resharding data is needed when
1) a single shard could no longer hold more data due
to rapid growth.
2) Certain shards might experience shard exhaustion
faster than others due to uneven
data distribution. When shard exhaustion happens, it
requires updating the sharding
function and moving data around.
2.Celebrity problem: This is also called a
hotspot key problem. Excessive access to a specific shard could cause server
overload. Imagine data for X, Y and Z users calls end up on the same shard. For
social applications, that shard will be overwhelmed with read operations. To
solve this problem, we may need to allocate a shard for each celebrity. Each
shard might even require further partition.
3.Join and de-normalization: Once
a database has been sharded across multiple servers, it is hard to perform join
operations across database shards. A common workaround is to denormalize the
database so that queries can be performed in a single table.
Alternatives to Sharding?
Sharding adds operational complexity, it is usually performed when
dealing with very large data.
Below are the scenarios where it may be beneficial to shard a database:
·
The amount of application data grows to exceed the
storage capacity of a single database node.
·
The volume of writes or reads to the database
surpasses what a single node or its read replicas can handle, resulting in
slowed response times or timeouts.
·
The network bandwidth required by the application
outpaces the bandwidth available to a single database node and any read
replicas, resulting in slowed response times or timeouts.
Following are the alternatives to consider before considering sharding
- Setting
up a remote database:
If you’re working with a monolithic
application in which all of its components reside on the same server, you can
improve your database’s performance by moving it over to its own machine. This
doesn’t add as much complexity as sharding since the database’s tables remain
intact. However, it still allows you to vertically scale your database apart
from the rest of your infrastructure.
- Implementing caching:
If your application’s read performance is
what’s causing you trouble, caching is one strategy that can help to improve
it. Caching involves temporarily storing data that has already been requested
in memory, allowing you to access it much more quickly later on.
- Creating
one or more read replicas:
Another strategy that can help to improve read
performance, this involves copying the data from one database server (the primary
server) over to one or more secondary servers. Following this, every new
write goes to the primary before being copied over to the secondaries, while
reads are made exclusively to the secondary servers. Distributing reads and
writes like this keeps any one machine from taking on too much of the load,
helping to prevent slowdowns and crashes. Note that creating read replicas
involves more computing resources and thus costs more money, which could be a
significant constraint for some.
- Upgrading
to a larger server:
In most cases, scaling up one’s database
server to a machine with more resources requires less effort than sharding. As
with creating read replicas, an upgraded server with more resources will likely
cost more money. Accordingly, you should only go through with resizing if it
truly ends up being your best option.
Sharding
Architectures
1. Key Based Sharding
This
technique is also known as hash-based sharding. Here, we take
the value of an entity such as customer ID, customer email, IP address of a
client, zip code, etc and we use this value as an input of the hash
function. This process generates a hash value which is
used to determine which shard we need to use to store the data.
Consider an example that you have 3
database servers and each request has an application id which is incremented by
1 every time a new application is registered. To determine which server data
should be placed on, we perform a modulo operation on these applications id
with the number 3. Then the remainder is used to identify the server to store
our data.
The downside of this
method is elastic load balancing which means if you
will try to add or remove the database servers dynamically it will be a
difficult and expensive process.
2. Vertical Sharding
In this method, we split the entire
column from the table and we put those columns into new distinct tables. Data
is totally independent of one partition to the other ones. Also, each partition
holds both distinct rows and columns. Take the example of Twitter features. We
can split different features of an entity in different shards on different
machines. On Twitter users might have a profile, number of followers, and some
tweets posted by his/her own. We can place the user profiles on one shard,
followers in the second shard, and tweets on a third shard.
The main drawback of this scheme is that to answer some queries you may
have to combine the data from different shards which unnecessarily increases
the development and operational complexity of the system.
3. Directory-Based Sharding
In this
method, we create and maintain a lookup service or lookup
table for the original database. Basically we use a shard key for
lookup table and we do mapping for each entity that exists in
the database. This way we keep track of which database shards hold which
data.