Mastering Database Keys: primary, foreign, partitioning, clustering, sort and secondary keys explained

System design and databases

Jun 20, 2025

AI is everywhere! From personalized shopping recommendations to real-time language translation and self-driving cars - it’s hard to escape the buzz. The headlines are filled with breakthroughs in large language models and generative AI. But behind the scenes, away from the spotlight, there’s a quieter force at work that makes all of this possible: data.

AI doesn’t run on magic. It runs on massive volumes of data - clean, well-structured, and fast-access data. And that data needs to live somewhere. Often, that “somewhere” is a database.

Get Your Cert💰 Knowledge pays💰 Literally.

Jakub Slys 🎖️

Feb 22

Get Your Cert💰 Knowledge pays💰 Literally.

💡 My grandfather once told me…

Read full story

While models get most of the attention, the way we store and access data is just as critical. A well-designed database can be the difference between a blazing-fast AI application and one that lags or fails under pressure. And at the heart of that design lies one essential concept: keys.

In this article, we’ll explore the different types of keys used in database systems - from primary and foreign keys in SQL, to partitioning, clustering, sort keys, and secondary indexes in NoSQL. We’ll break them down with real-world examples from platforms like Amazon, Instagram, and modern analytics systems - showing how the right key choice can make your data systems smarter, faster, and AI-ready.

Primary Keys

A primary key is a column (or a set of columns) in a table that uniquely identifies each record.1

In other words, no two rows can have the same primary key value.

key | data
----+------
han | solo

A primary key value is also never empty (NULL), because every record needs an identifier. Many databases will enforce uniqueness and automatically index the primary key to allow fast look-ups of records by that key. A table can have only one primary key defined (though that one key may consist of multiple columns, called a composite primary key).

 key_part_one | key_part_two | data
--------------+--------------+--------------------
      ronaldo |            9 |    football player
      ronaldo |           10 | ex-football player

Think of a primary key like a unique identifier or an ID card for each row. Just as every citizen has a unique Social Security Number, or every book has a unique ISBN, every row in a table has a unique primary key value. One simple analogy from literature: a textbook’s pages each have a unique page number – that page number is like the primary key for the content on that page2, uniquely identifying it.

On Amazon, you might have a Products table where each product has a unique ProductID as its primary key. This ensures that when someone looks up product “B07XYZ…”, there’s exactly one product that matches. Similarly, a Users table could use a UserID as the primary key to uniquely identify each customer. If two customers accidentally had the same ID, the database wouldn’t know who is who – primary keys prevent that confusion by being unique.

How it affects data & performance?

Primary keys are fundamental for entity integrity – making sure each record is distinct. Because the database often indexes the primary key, queries to find a row by primary key are very efficient.

For instance, retrieving a user by their UserID or an order by its order number is typically a fast operation because of this index.

In some database systems (like MySQL’s InnoDB engine), the table is even physically organized by the primary key (called a clustered index), meaning the data on disk is sorted by primary key for faster retrieval. A primary key is what makes each row special – it’s the main reference point for your data. Always choose a primary key that will not duplicate (e.g. an auto-incremented ID or a universally unique identifier).

Foreign Keys

A foreign key is a column (or columns) in one table that refers to the primary key of another table.3 Foreign keys create a relationship between two tables, enforcing that the data in the two tables remains consistent. This concept is crucial in relational (SQL) databases for maintaining referential integrity – ensuring that references between tables are valid. In simpler terms, a foreign key in a child table points to a primary key in a parent table. user_id column in the Orders table (right) references the user_id in the Users table (left). An order can only use a user_id that exists in the Users table – this rule is enforced by the foreign key.

Think of a foreign key like a reference or a link. If the primary key is an ID card, a foreign key is like keeping someone’s ID on file to refer to their details.

Example: imagine a library checkout system: the checkout record might store a member ID to indicate which member borrowed a book. That member ID must exist in the members list. The checkout record’s member ID is a foreign key pointing to the member list’s primary key (member ID). It ensures you can’t have a checkout for a non-existent member.

Consider an Orders table on Amazon and a Customers table. The Orders table might have a column CustomerID that is a foreign key referencing the Customers.CustomerID primary key. This means each order must be linked to a valid customer. The database will not allow an order to exist with a CustomerID that isn’t in the Customers table.

order_no | customer_id | product_sku
---------+-------------+-------------
      93 |          11 | 123
      94 |          11 | 789
      95 |         789 | 13 # NOT ALLOWED !!!

customer_id   |               email | name 
--------------+---------------------+---------
           10 | sadio@example.com   | Sadio 
           11 | mo@example.com      | Mohamed 
           12 | rinsola@example.com | Rinsola
           13 | amalie@example.com  | Amalie

For instance, Order #789 cannot claim to belong to Customer #999 if no customer #999 exists – the foreign key constraint prevents that. In an Instagram-like social media database, you could have a Comments table where each comment has a PostID foreign key that must match a real Post in the Posts table, and maybe a UserID foreign key that references which user made the comment. This ensures comments aren’t orphaned – every comment is tied to a real post and user.

Why it’s important?

Foreign keys ensure data integrity across tables. They prevent orphan records and inconsistent data. Using the earlier example, they make sure you can’t have an order assigned to a customer that doesn’t exist, which keeps the database accurate4. Essentially, foreign keys put the relational in relational databases by formally linking tables together in a parent/child fashion.

Effects on performance

Foreign keys mainly serve a data integrity role, but they also hint how tables will be joined in queries. When you query an order and want to get the customer’s details, the database will use the foreign key relationship (typically via a JOIN on CustomerID). For performance, it’s common to index foreign key columns as well, so that joins are faster5.

Unlike primary keys, many SQL databases do not automatically index foreign keys, but adding an index on the foreign key column can significantly speed up join queries that use those keys.

However, foreign keys also add a bit of overhead on write operations: if you insert or update a foreign key, the database must check that the corresponding parent record exists (and when deleting or updating a referenced primary key, it must check or enforce rules on the child records, e.g. cascading deletes or preventing the action). NoSQL databases typically do not enforce foreign keys – they often leave relational consistency to the application layer or use embedding/denormalization – so in a NoSQL context (like pure key-value or document stores), you might not have foreign keys built-in. But in SQL systems, foreign keys are a powerful tool to maintain consistency. Foreign keys tie tables together. They ensure that relationships (like orders belonging to customers, or comments belonging to posts) are consistent and valid, which is essential for accurate, reliable data.

Partitioning Keys

A partitioning key (or partition key) is a column (or set of columns) used to divide a dataset into smaller partitions. This term comes up often in two scenarios:

partitioning within a single database table (like splitting a huge table into smaller physical chunks by some criteria), and
distributed databases and NoSQL systems where the partition key decides which server or node stores a given piece of data. In both cases, the idea is to break data into manageable pieces based on the key’s value.

Partitioning is like organizing a very large library into sections. If you have millions of books, you wouldn’t just pile them randomly; you might arrange them by genre or by the author’s last name (A–C in one section, D–F in another, etc). This organizing principle is the “partition key”.

              |       clustering keys |
partition_key |      author | book_id | title 
--------------+-------------+---------+---------------------------------
          's' | shakespeare |     001 | tempest 
          'd' | dickens     |     002 | oliver twist 
          'r' | rowling     |     003 | harry potter and ...
          'd' | dickens     |     004 | great expectations
          't' | twain       |     005 | the adventures of tom sawyer

Example: all books by authors whose last name starts with A might go to one section, B to another, and so on. Now, if you need to find books by Twain, you can go straight to the T section instead of searching the whole library. In a database, a partition key works the same way: it determines which partition of the data you need to look at, so you don’t have to scan everything6.

              |       clustering keys |
partition_key |      author | book_id | title
--------------+-------------+---------+---------------------------------
          'd' | dickens     |     002 | oliver twist
              | dickens     |     004 | great expectations
--------------+-------------+---------+---------------------------------
          'r' | rowling     |     003 | harry potter and ...
--------------+-------------+---------+---------------------------------
          's' | shakespeare |     001 | tempest
--------------+-------------+---------+---------------------------------
          't' | twain       |     005 | the adventures of tom sawyer

Imagine a web analytics platform that stores billions of event logs. A common strategy is to partition the logs by date – e.g., all data from January goes into the January partition, February data in another, etc. Here, the Date field is the partitioning key for the logs table. If a query asks for data in a certain date range (say, events in March 2025), the database can skip all partitions that aren’t March and only read the March partition, dramatically speeding up the query. This is called partition pruning – only the relevant partition is scanned.

Another example: an orders table in Amazon could be partitioned by OrderYear – all 2023 orders in one partition, 2024 in another. So, a report on 2024 sales would only touch the 2024 partition, not the entire table.

In a distributed NoSQL database like Apache Cassandra or Amazon DynamoDB, the partition key is used to decide which node or shard will store a given item.

For instance, Instagram could use the UserID as a partition key for a table of posts. This means all posts from User#123 are stored on the same partition (likely on the same server or shard). User#124’s posts might live on a different server. When Instagram wants to fetch all posts for User#123, it knows exactly which partition (and which machine) to query – it doesn’t have to search all servers.

The primary goal of a partition key in such systems is to distribute data evenly across nodes and enable efficient lookups.7 Each unique UserID would hash to a particular partition. This also allows the system to scale – new users can be spread out across the cluster. It’s important that the partition key has a lot of possible values and is chosen to avoid hotspots.

For example, if Instagram used a fixed value like Country as partition key, all users from the same country might be clumped on one server, overloading it. Using UserID spreads load more evenly.

What it is for and why it matters?

Partition keys are all about scalability and performance.

Queries can target a subset of data. As described, the database can quickly locate the subset of data that pertains to your query and ignore the rest.8 Going to our library analogy, it’s faster to go to the specific section than to search the whole library. In database terms, this means faster query response when the partition key is used in the query filter.

Parallel processing. If data is partitioned, different partitions can often be processed in parallel. In a distributed system, if you ask a question that involves many users, each server can handle the users it stores in parallel, rather than one giant server handling everything. If analyzing multiple partitions (say, a year’s worth of monthly partitions), the database might scan each month partition on a different processor or node concurrently, speeding up the analysis.

Maintenance and manageability. Partitioning makes it easier to manage huge datasets. You can, for example, archive or delete old partitions without touching newer data. An analytics platform could automatically drop partitions older than X years to manage data retention. It’s like being able to remove an entire section of old books from the library in one go without disturbing the rest.

Balanced load. In distributed databases, a well-chosen partition key ensures that no single node handles a disproportionate amount of data or traffic. Each partition (node) gets an even share, preventing bottlenecks. For instance, hashing user IDs across servers aims to spread users roughly evenly.

Design considerations. Choosing the right partition key is crucial. Ideally, it’s a field you frequently use to query data (so the benefits of partition pruning are realized), and in distributed systems, it has high cardinality (many distinct values) to spread data out. If we partition our orders by OrderDate, queries by date are fast – but if we never query by date, that partitioning doesn’t help much. Likewise, if 90% of your data shares the same partition key value, you haven’t achieved a balanced distribution. In Cassandra (and similar systems), the partition key is actually part of the table’s primary key definition: it’s the first part of the primary key that decides the data’s location in the cluster9. Once data lands in the correct partition, the next question is how it’s organized within that partition – which brings us to clustering keys.

Clustering Keys

A clustering key is a key that determines the order of data within a partition. This term is most commonly used in the context of wide-column NoSQL databases like Apache Cassandra. In Cassandra, each table’s primary key is made of two parts: the partition key (one or more columns that determine which partition the data goes to) and the clustering key (one or more columns that determine how the data is sorted within that partition). Essentially, once all the data with the same partition key ends up on the same node, the clustering key says: “now sort those rows by this second key”.

If partitioning is like splitting a library by genre, then clustering is like sorting the books within each genre section. Suppose the library is partitioned by genre (all science fiction books in one area, all history books in another). Within the science fiction section, you might further arrange books alphabetically by author’s last name – that ordering is analogous to a clustering key. It doesn’t change which section a book belongs to (that’s decided by genre, like a partition key), but it organizes the books within the section so you can easily skim in order. In a database, if your partition key groups related rows together, a clustering key can sort those related rows by date, by name, or some other logical ordering.

Social Media example: Let’s return to the Instagram example. If UserID is the partition key for posts (grouping all of a single user’s posts together on one node or partition), we can use Timestamp as the clustering key. This means that within each user’s partition, posts are stored sorted by timestamp (say, newest first or oldest first). Now, retrieving the latest posts for user #123 is very efficient – the database can jump to user #123’s partition (thanks to the partition key) and then read the posts in sorted order by time. If you only want posts in the last week, it can grab the appropriate range since they’re ordered by date. All of user #123’s posts are “clustered” by time. This is incredibly useful for time-series access patterns10 – which is exactly what you have in feeds or timelines.

Cassandra specifics example: In Cassandra’s CQL, if you define a table with PRIMARY KEY((UserID), Timestamp), UserID is the partition key and Timestamp is the clustering key. All rows with the same UserID value form one partition, and within that partition they are sorted by Timestamp automatically. Clustering keys can be multi-column as well (you might sort first by one thing, then by another). But importantly, clustering keys do NOT decide on which node data resides – that’s the partition key’s job. Clustering keys only order the data within that node’s partition. If you add more posts for a user, Cassandra will place them in the correct sorted position by timestamp within that user’s partition.

Why it’s useful?

Clustering keys are about data organization and efficient retrieval of related rows. They shine when you have one-to-many relationships or time-series data within a partition. Here are some advantages.

Read performance for sequential access. If data is pre-sorted by the clustering key, reading data in that order is very fast. For example, reading all posts of a user in chronological order doesn’t require an expensive sort at query time – the database simply reads the rows in the stored order. It can also quickly seek to a particular point (e.g., start reading at posts from March 1st) since the data is ordered.

Logical grouping. Clustering keys keep related data next to each other on disk. In an e-commerce context, say we partition an OrderItems table by OrderID (so all items of one order are together) and cluster by ProductCategory. Then within each order’s partition, items might be grouped by category – maybe not a very common use case, but it illustrates that items in the same order can be further organized if needed.

Avoiding multiple tables or indices. Sometimes clustering keys let you retrieve a set of records without having to create a separate index or do a separate query. In the Instagram example, to get posts in reverse chronological order, you don’t need a separate index on time for each user – the clustering key provides that automatically within the user’s partition.

Relational database note. Traditional SQL databases don’t use the term “clustering key” in the same way, but they do have the concept of clustered indexes. A clustered index means the table’s data is stored in the order of that index. Often, the primary key is the clustered index by default (so data is stored in primary key order). You could say that in such systems, the primary key is also the clustering key for the whole table (since the whole table is the partition, in a sense). For example, in MySQL InnoDB, if your primary key is id, the rows are stored ordered by id– much like a single-partition sorted by id. Some databases allow clustering the table on a non-primary key index, but only one such clustering order can exist at a time (unlike Cassandra where you can have different clustering keys per table partition). The general idea is similar: clustering = ordering data on disk by some key to optimize certain retrievals.

Query requirements. When using partition + clustering keys (like in Cassandra), you typically must specify the partition key when querying, and you can optionally specify conditions or ordering on the clustering key. If you try to query by clustering key without the partition key, the system wouldn’t know which partition to look in. For instance, you can efficiently get “all posts of User123 between two timestamps” (since you provide User123 = partition key, and a time range on the clustering key). But you cannot as efficiently ask “find the first post ever across all users” without scanning all partitions, because each partition is sorted only within itself, not globally.

In summary, clustering keys define how data is ordered within a group. They are crucial in scenarios like time-series data per entity (user, device, etc.) – ensuring that data is stored in a sequence that makes sense for retrieval. By combining a good partition key with a good clustering key, you can achieve both distribution (scaling out across servers) and ordering (fast in-order access within each server’s chunk of data). This combination is a key part of data modeling in Cassandra and DynamoDB (where the term sort key is used similarly, as we’ll see next).

Sort Keys

The term sort key can mean slightly different things depending on context, but it always relates to how data is sorted or ordered. We’ll explain it in two popular contexts: Amazon DynamoDB (a NoSQL database) and Amazon Redshift (a SQL data warehouse), to highlight how sort keys work and differ in each.

Sort Keys in DynamoDB (NoSQL). DynamoDB uses a concept of a composite primary key consisting of a partition key and a sort key. (In earlier sections, we discussed partition keys; here the sort key plays a role similar to a clustering key.) In a DynamoDB table, if you choose to have a sort key, it means items with the same partition key are ordered by the sort key value. The combination of partition key and sort key must be unique for each item. For example, you might have a table of social media posts with partition key UserID and sort key Timestamp – exactly like our Instagram example. In DynamoDB, all posts with UserID = 123 will be stored together and sorted by Timestamp. You can then query that table by specifying UserID = 123 (to target the partition) and provide a range condition on Timestamp (to fetch, say, posts in the last week or the latest posts first). Within the partition, items are automatically kept in sort key order11. Another example: an e-commerce site could use a partition key of CustomerID and sort key of OrderDate on an Orders table. This way, each customer’s orders are grouped and sorted by date, making it efficient to retrieve a customer’s most recent orders or orders in a date range. The sort key basically provides an order for related items (all with the same partition key). It’s important to note you cannot query by sort key alone; you always need to specify the partition key when using the DynamoDB API, but once you’re in the partition, the sort key can be used to filter or sort the results easily.

Sort Keys in Amazon Redshift (SQL Data Warehouse). Amazon Redshift is a massively parallel columnar database (often used for analytics). In Redshift, a sort key is a property of a table that defines how the data is stored on disk in sorted order12. When you load data into a Redshift table that has a sort key defined, Redshift will sort the rows according to that key and then store them. Why do this? Because if your data is sorted, certain queries – especially those that filter by a range of the sort key – can be answered much faster. The Redshift query optimizer knows the data is sorted and can skip reading chunks of data that fall outside the query range13. For instance, if your Redshift table of website clicks is sorted by EventDate, and you query for events in 2025-04-01, the database can quickly seek to the part of the table for that date and largely ignore data from other dates. Redshift stores data in 1 MB blocks and keeps track of the min and max values of the sort key in each block; if a block’s range doesn’t overlap your query’s date filter, Redshift skips that block entirely. To illustrate the benefit: imagine a table storing five years of data sorted by date. A query asking for one month of data might skip ~98% of the blocks (if 60 months total, one month is ~1/60, ~1.6%) – meaning it reads far less data than scanning everything. This makes range queries extremely efficient. Aside from range filters, sort keys can also help with joining tables (if both are sorted on the join key, a merge join is very fast). You can define a sort key on one or multiple columns, and Redshift even supports two types of sort keys: compound (like multi-column lexicographic order) or interleaved (which balances the sorting priority across columns). Choosing a good sort key in Redshift depends on your query patterns – e.g., if you often query by OrderDate, it’s a good candidate for sort key; if you often query by customer and date, maybe a compound sort key on (CustomerID, OrderDate) makes sense. If unsure, Redshift has an AUTO setting where it will decide a sort key for you based on usage.

Key Differences – DynamoDB vs. Redshift Sort Keys. The common idea in both is ordering data by a certain key, but the scope and purpose differ:

DynamoDB Sort Key (within partition): It’s part of the table’s primary key. It orders items only among those with the same partition key. It helps in application-level queries like “get all records for X sorted by Y”. It also, in combination with partition key, ensures each item is unique (no two items can have same partition key + sort key). Think of it like a secondary identifier for an item that also dictates order within a logical group. You must specify the partition key in queries, and then you can use the sort key to filter or sort results for that partition. Essentially, DynamoDB’s sort key = clustering key per partition.
Redshift Sort Key (table-level): It’s not part of the logical primary key at all – it doesn’t ensure uniqueness or define relationships. It’s purely about how data is laid out on disk for optimization. The sort key applies to the entire table’s storage order. It helps in database-level optimization like skipping over irrelevant data in scans, and speeding up large analytical queries that involve ranges or sorting. It doesn’t require a specific query pattern except that if queries align with the sort key, they benefit. You don’t “request” data by sort key in Redshift the way you do by primary key; instead, the database automatically takes advantage of sorting during query execution to make it faster.

To make this more concrete: imagine a large analytics table of ad impressions with a billion rows. In DynamoDB, you might not even store such data (as it’s more OLTP-like), but if you did, you’d need a partition key to spread it (maybe AdID) and a sort key (perhaps ImpressionTimestamp). You could quickly get all impressions for a particular Ad in a time range. In Redshift, you would store all billion rows, perhaps partitioning across nodes by a distribution key and sorting by ImpressionTimestamp. A query like “count impressions in January 2025” would run much faster if the table is sorted by date because Redshift jumps to January 2025’s portion and skips others. Without a sort key, it would have to scan much more. In summary, DynamoDB’s sort key is about data modeling for queries in an application, whereas Redshift’s sort key is about physical storage optimization for analytic workloads.

Secondary Indexes

A secondary key (often used interchangeably with secondary index) is any key or index that is not the primary key, but can be used to look up data. In other words, it’s an alternate way to query the table. Secondary indexes don’t change the actual data layout or uniqueness rules (the primary key still uniquely identifies records), but they allow you to query the data by other fields efficiently. If a primary key is like the main filing system (say, files are organized by ID number), a secondary index is like a cross-reference that also lets you find a file by another attribute (like by name or date).

A classic analogy is the index at the back of a book. The book’s main content might be organized by chapter (that’s like the primary key ordering of content). But the back-of-book index lets you find topics by a keyword and tells you which page to go to. That index is separate from the main content; it’s essentially a lookup table that maps a topic to pages. A secondary index in a database is like that back-of-book index. It doesn’t change the fact the book’s pages are the primary sequence, but it gives you an alternate way to jump to the information you need. Another analogy: a phone book traditionally is sorted by last names (that’s the primary way to look up someone’s number). If you wanted to find someone by their phone number instead, you’d need a reverse lookup directory – that’s a secondary index.

Why/when to use?

You use a secondary index when you have to frequently retrieve data by some field that is NOT the primary key. Without an index on that field, the database would have to scan all rows to find what you need, which is slow for large tables. By indexing the field, the database maintains a sorted (or hashed) structure behind the scenes that it can search quickly to find the matching records. For example, if your Customers table’s primary key is CustomerID but you often need to find a customer by their email address, you’d create a secondary index on the Email column. Then the database can find the customer with email "alice@example.com" almost as quickly as if you had searched by ID.

E-commerce example: In a products table on Amazon keyed by ProductID, you might add secondary indexes on attributes like Category or ProductName. So if a query needs to retrieve all products in category “Electronics”, it can use the index on Category to directly fetch those product IDs without scanning every product. If customers often search by product name, an index on ProductName speeds that up. Essentially, each such index is a copy of (some parts of) the data sorted by that particular field, maintained by the database. Another example: an Employees table with primary key EmployeeID might have a secondary index on LastName if you frequently look up employees by last name.

NoSQL example: In MongoDB (document database), you might index fields within the JSON documents to speed up queries by those fields. In DynamoDB, you have the concept of Global Secondary Indexes (GSI) and Local Secondary Indexes (LSI). A GSI lets you create a whole new pair of keys (partition key + sort key) for a duplicate view of the data. For instance, your main table might have primary key UserID, but you could create a GSI on Email so that you can efficiently get user data by email as well. Under the hood, DynamoDB will manage a separate table (index) keyed by Email that points to the corresponding user records. An LSI, on the other hand, uses the same partition key as the main table but a different sort key – for example, you could have a table partitioned by UserID with primary sort key Timestamp, and an LSI that is also partitioned by UserID but sorted by, say, PostType to query a user’s posts by type without scanning all timestamps. The takeaway is that even NoSQL systems provide secondary indexes because they are so useful for additional query flexibility.

Performance implications

Secondary indexes accelerate read queries at the cost of additional storage and slower writes. Every time you insert, delete, or update a record, any secondary indexes on that table may also need to be updated to reflect the change. This overhead means writes can be a bit slower and the system uses more disk space (because it’s storing multiple copies of sorting of the data). However, for reads, the benefit is huge if the query needs align with the index. Without an index, a query like “find all orders where Amount > 1000” would require scanning every order; with an index on Amount, the database can quickly retrieve just those qualifying orders. As a rule of thumb, you add indexes for fields that are frequently queried to improve performance, but you don’t index everything, because too many indexes would slow down writes and consume a lot of space. Database designers must choose secondary keys judiciously based on access patterns.

Clarity in terminology

In relational database theory, the term secondary key can also mean any candidate key that wasn’t chosen to be the primary key. For instance, if a table has both username and user_id as unique identifiers, and you choose user_id as primary key, username might be called a secondary key (since it’s still a key, just not the primary). However, in practice, most people just say “unique key” for that scenario, and use “secondary index” to refer to the indexing mechanism described above. In most modern database discussions, secondary index is understood as an additional index structure for query, which may or may not enforce uniqueness. It could be a unique index (like on username to ensure no duplicates), or a non-unique index (like on Category, where many products can be in the same category). In both cases, it’s “secondary” because the primary key remains the main unique identifier of the row, and the index is a lookup aid.

Going back to our book analogy, imagine a book that’s primarily organized by chapter (that’s analogous to primary key order). Now, the back-of-book index might list topics like “Database Tuning” and say “pages 50–55, 120–123”. If you want to read about that topic, you use the index to jump directly to those pages. A secondary index in a database works similarly – it lets the system jump directly to the pages (rows) you want, rather than reading everything. Another example: a university might primarily organize student records by student ID (primary key). But the admin office might also keep a folder organized by last name – that’s a secondary way to find a student’s file. They’d have to keep that folder updated whenever students change name or new students enroll, but it makes looking up by name much easier than rifling through ID numbers.

In summary, secondary keys/indexes provide flexibility in how you can retrieve data. They improve query performance for non-primary key queries – often making a slow table-scan query run in milliseconds – at the cost of some extra work on writes and additional storage. Proper use of secondary indexes is key to designing databases that can handle a variety of query patterns efficiently. For a beginner, the main takeaway is: if you need to frequently look up data by a certain column that isn’t the primary key, you likely need a secondary index on that column.

Conclusion

Understanding these different types of keys is essential for effective database design and use. To recap:

Primary keys uniquely identify each record and are the main reference points for your data, much like unique IDs.
Foreign keys create relationships between tables, ensuring data in different tables remains consistent and valid (like linking orders to existing users).
Partition keys determine how data is divided across separate sections or nodes, which is crucial for scaling and managing large datasets.
Clustering keys (or sort keys in some NoSQL contexts) order data within a partition, enabling efficient retrieval of related records in a predetermined sequence (especially useful for one-to-many relationships and time-series data).
Sort keys in analytical databases define how data is stored in sorted order on disk, speeding up range queries and large scans by allowing data skipping. In NoSQL like DynamoDB, the sort key is part of the primary key, used to sort items with the same partition key.
Secondary keys/indexes provide alternate ways to lookup data by attributes other than the primary key, improving query flexibility and performance for those access patterns.

Each type of key serves a different purpose – some ensure data integrity (primary/foreign keys), some improve query performance and scalability (partition, clustering, sort keys), and some give you more query options (secondary indexes). In real-world systems like an e-commerce platform or social media app, you often use all of them in combination: primary keys to identify entities (users, orders, products), foreign keys to link related information (orders to users, posts to users), partition and sort/clustering keys to scale out and handle large volumes of data efficiently (sharding data by user or date, ordering it for fast retrieval), and secondary indexes to support searches and queries by various fields (search by name, filter by category, etc.).

Invite a friend & score a $50 voucher! 🎉

Jakub Slys 🎖️

Feb 22

Invite a friend & score a $50 voucher! 🎉

Together, we’ve built an incredible community, and now… it’s time to grow it even bigger!

Read full story

For an interested person, the key insight is that "keys" are the hooks and handles by which a database organizes and fetches data quickly. By choosing the right keys for the right purpose, we ensure our databases run smoothly, even as they grow. Each key type affects how data is stored or enforced, and thus impacts performance. With this understanding, you can better design and query databases – ensuring that your Amazon orders, Instagram posts, or analytics reports are all managed efficiently under the hood.

Primary key constraints

Difference between secondary key, foreign key, and primary key?

Foreign key constraints

What is a foreign key?

Indexes on foreign key constraints

Why Use SQL Partitioning?

Difference between partition key, composite key and clustering key in Cassandra?

SQL Partitioning: A Step-by-Step Guide for Database Optimization

Table concepts

Clustering column

Core components of Amazon DynamoDB

Choose the best sort key

Sort keys

Tope Olofin

Jul 7Edited

As someone who is fascinated by and plans to write a thesis on how collate and design a platform that would make data accessible for a particular sector in my country, this was definitely a great crash course.

Do you have anything around actually collecting and collating data?

Expand full comment

1 reply by Jakub Slys 🎖️

Chintan Zalani

Jul 2

Jakub thanks for putting this together. It's a little too technical. But I summoned chatGPT to help me understand some parts of it. I am building a repository of AI startups, and as it has grown, I have encountered issues with just using WordPress as my CMS. The many-to-many relationships to manage via foreign keys is tricky. I guess I need a new CMS for that. Do you have any recommendations on a CMS?

2 replies by Jakub Slys 🎖️ and others

3 more comments...