What is Skew and Skew handling techniques

Thumbnail

What is Skew in Database:

In database management and distributed systems, skew happens when data or workload isn’t evenly distributed across the system’s resources (like servers or processors). Imagine you have five people trying to carry a load, but one person is carrying half of it while the other four share the rest. That one person is overloaded, which slows down the whole group. Similarly, in a database, when one server or partition has more data or processing work than others, it’s called skew, and it creates inefficiency because some resources are overloaded while others are underutilized.

Types of Skew:

  1. Data Skew: When data is unevenly distributed, with certain partitions having much more data than others.
  2. Execution/Processing Skew: When processing tasks are not balanced across resources, leading some processors or nodes to finish much later than others.

Why Skew is a Problem? Skew can make the entire system slower, lead to high costs, waste computing resources, and make it harder to get results efficiently. Handling skew is essential in big data systems, parallel processing, and databases that distribute work across multiple nodes.

Skew Handling Techniques

Let’s look at various methods used to handle skew, each with an explanation.

  1. Hash Partitioning
    • Explanation: Hash partitioning assigns data based on the output of a hash function. The function takes a key (like an ID or name) and generates a value that determines which partition the data goes into. This helps evenly distribute data across partitions, assuming the hash function distributes outputs uniformly.
    • How It Helps: By hashing keys, the data is spread more evenly across the system. For example, if data is divided based on UserID % 10, then users with IDs ending in different numbers would likely go to different partitions.
    • Limitations: Hash partitioning might still create skew if certain keys or values are especially common, which leads to "hot" partitions. For instance, if one user has significantly more data than others, the partition holding that user’s data could become overloaded.
  2. Range Partitioning with Bucketing
    • Explanation: In range partitioning, data is divided into ranges (for example, users with IDs from 1-1000 in one partition, 1001-2000 in another, etc.). This can lead to skew if certain ranges are very populated. Bucketing solves this by dividing each range into smaller, equally-sized “buckets.”
    • How It Helps: By dividing each range further, bucketing reduces the chance that any one range will overload a single partition. For example, if there are many users in the ID range 1-1000, dividing it into smaller chunks across different servers will balance the load.
    • Limitations: Requires knowledge of data distribution to set appropriate ranges and bucket sizes. If data is very unevenly distributed, you may still experience skew.
  3. Salting
    • Explanation: Salting adds random or semi-random values (salt) to keys before partitioning, spreading similar data across multiple partitions. This technique is often used when a specific key is heavily used.
    • How It Helps: By slightly altering keys with a salt value, the same key doesn’t keep getting sent to the same partition. For instance, adding a random number to a user’s ID can distribute data more evenly across servers.
    • Example: If a database table has a very popular city, like “New York,” which might go to the same partition, adding random numbers to “New York” could make it appear as "New York_1," "New York_2," etc., and spread these entries across multiple partitions.
    • Limitations: Increases complexity when querying since the original data keys are modified with a salt.
  4. Replicating Skewed Data
    • Explanation: For very commonly accessed or “hot” data, replication means creating multiple copies of the same data across different partitions or nodes.
    • How It Helps: When multiple copies of frequently accessed data exist, requests for that data can be distributed across copies, reducing the load on any single partition. For example, if many users are accessing “Product A,” having multiple copies allows these requests to be distributed.
    • Limitations: Requires more storage, which could be expensive, and updates to the data require synchronizing all replicas, adding complexity.
  5. Load Balancing Based on Monitoring
    • Explanation: This approach dynamically monitors system load and redistributes data or tasks based on current usage. When certain partitions or nodes show heavy usage, the system can shift data or tasks to balance the load.
    • How It Helps: By actively monitoring and redistributing load in real-time, this technique ensures the system stays balanced and adjusts to changing workloads or data patterns.
    • Limitations: Real-time monitoring and redistribution require complex infrastructure, which can be computationally intensive and challenging to implement.
  6. Adaptive Query Processing
    • Explanation: In adaptive query processing, the system analyzes the execution of queries and dynamically adjusts the processing strategy based on intermediate results or identified bottlenecks.
    • How It Helps: Adaptive query processing helps reduce processing skew by shifting parts of a query to less-loaded nodes as the query is running, improving overall performance and reducing idle times for certain nodes.
    • Limitations: Requires complex query execution engines that can monitor performance, analyze bottlenecks, and adjust processing paths on-the-fly.

Using the right mix depends on the workload, system architecture, and the specific types of skew encountered.

Post a Comment

Previous Post Next Post