When I began working at Microsoft, I was very much a novice at performance troubleshooting. There was a lot to learn, and hash match joins were pointed out to me multiple times as the potential cause for a given issue. So, for a while I had it in my head, “hash match == bad”. But this really isn’t the case.

Hash matches aren’t inefficient; they are the best way to join large result sets together. The caveat is that you have a large result set, and that itself may not be optimal. Should it be returning this many rows? Have you included all the filters you can? Are you returning columns you don’t need?

If SQL Server is using a hash match operator, it could be a sign that the optimizer is estimating a large result set incorrectly. If the estimates are far off from the actual number of rows, you likely need to update statistics.

Let’s look at how the join operates so we can understand how this differs from nested loops

How Hash Match Joins Operate

Build Input

A hash match join between two tables or result sets starts by creating a hash table. The first input is the build input. As the process reads from the build input, it calculates a hash value for each row in the input and stores them in the correct bucket in the hash table.

Creating the hash table is resource intensive. This is efficient in the long run, but is too much overhead when a small number of rows are involved. In that case, we’re better off with another join, likely nested loops.

If the hash table created is larger than the memory allocation allows, it will “spill” the rest of the table into tempdb. This allows the operation to continue, but isn’t great for performance. We’d rather be reading this out of memory than from tempdb.

The building of the hash table is a blocking operator. This means the normal row mode operation we expect isn’t happening here. We won’t read anything from the second input until we have read all matching rows from the build input and created the hash table. In the query above, our build input is the result of all the operators highlighted in yellow.

Probe Input

Once that is complete, we move on to the second input in the probe phase. Here’s the query I used for the plan above:

USE WideWorldImporters
GO

SELECT *
FROM Sales.Invoices inv
INNER JOIN Sales.InvoiceLines invl
	ON invl.InvoiceID = inv.InvoiceID
WHERE
	inv.AccountsPersonID = 3002
GO

The build input performed an index seek and key lookup against Sales.Invoices. That’s what the hash table is built on. You can see from the image above that this plan performs a scan against Sales.InvoiceLines. Not great, but let’s look at the details.

There is no predicate or seek predicate, and we are doing a scan. This seems odd if you understand nested loops, because we are joining based on InvoiceID, and there is an index on InvoiceID for this table. But the hash match join operated differently, and doesn’t iterate the rows based on the provided join criteria. The seek\scan against the second table has to happen independently, then we probe the hash table with the data it returns.

If the read against Sales.InvoiceLines table can’t seek based on the join criteria, then we have no filter. We scan the table, reading 490,238 rows. Also unlike a nested loop join, we perform that operation once.

There is a filter operator before the hash match operator. For each row we read of Sales.InvoiceLines, we create a hash value, and check against the hash table for a match. The filter operator reduces our results from 490,238 rows to 751, but doesn’t change the fact that we had to read 490,238 rows to start with.

In the case of this query, I’d want to see if there’s a filter I can apply to the second table. Even if it doesn’t change our join type away from a hash match, if we performed a seek to get the data initially from the second table, it would make a huge difference.

Remember Blocking Operators?

I mentioned the build input turns that branch of our execution plan into a blocking operator. This is something try to call out the normal flow of row mode execution.

With a nested loops join, we would be getting an individual row from the first source, and doing the matching lookup on the second source, and joining those rows before the join operator asked the first source for another row.

Here, our hash match join has to gather all rows from the first source (which here includes the index seek, key lookup, and nested loops join) before we build our hash table. This could significantly affect a query with a TOP clause.

The TOP clause stops the query requesting new rows from the operators underneath it once it has met it’s requirement. This should result in reading less data, but a blocking operator forces us to read all applicable rows first, before we return anything to upstream operators.

So if your TOP query is trying to read a small number of rows but the plan has a hash match in it, you will be likely reading more data that you would with nested loops.

Summary

Actual numbers comparing join types would depend a ton on the examples. Nested loops are better for smaller result sets, but if you are expecting several thousand (maybe ten or more) rows read from a table, hash match may be more efficient. Hash matches are more efficient in CPU usage and logical reads as the data size increases.

I’ll be speaking at some user groups and other events over the next few months, but more posts are coming.

As always, I’m open to suggestions on topics for a blog, given that I blog mainly on performance topics. You can follow me on twitter (@sqljared) and contact me if you have questions. You can also subscribe on the right side of this page to get notified when I post.

Have a good night.

There are a lot of things to know to understand execution plans and how they operate. One of the essentials is understanding joins, so I wanted to post about each of them in turn.

The three basic types of join operators are hatch match, merge, and nested loops. In this post, I’m going to focus on the last, and I will post on the other two shortly.

How Nested Loops Operate

Nested loops joins are the join operator you are likely to see the most often. It tends to operate best on smaller data sets, especially when the first of the two tables being joined has a small data set.

In row mode, the first table returns rows one at a time to the join operator. The join operator then performs a seek\scan against the second table for each row passed in from the first table. It searches that table based on the data provided by the first table, and the columns defined in our ON or WHERE clauses.

  • You can’t search the second table independently; you have to have the data from the first table. This is very different for a merge join.
  • A merge join will independently seek or scan two tables, then joins those results. It is very efficient if the results are in the same order so they can be compared more easily.
  • A hash join will seek or scan the second table however it can, then probes the hash table with values from the second table based on the ON clause.

So, if the first table returns 1,000 rows, we won’t perform an index seek (or scan) against the second; we will perform 1,000 seeks (or scans) against that index, looking for any rows that relate to the row from the first table.

The query above joins several table, using nested loops between each. we can see that the row counts for the first several tables are all 1. We read one SalesPerson record, the related SalesTerritory, an Employee record, and a Person record. When we join that to the SalesOrderHeader table, we find 234 related rows. That comes from 1 index seek, as internal result set only had 1 row thus far. If we join to LineItem next, we would perform that seek 234 times.

Key Lookups

The optimizer always uses nested loops for key lookups, so you’ll see them frequently for this purpose. I was unsure if this was the only way key lookups are implemented, but this post from Erik Darling confirms it.

In the plan above, we return 149 rows from OrderLines table. We use a nested loops join operator so we can include the UnitPrice and Description in our output, since they aren’t in the nonclustered index.

Which means we execute the key lookup 149 times. The cost for this operator is 99% of the total query, but the optimizer overestimated how many rows we would return.

I frequently look at key lookups as an opportunity for tuning. If you look at the output for that operator, you can see which columns are causing the key lookup. You can either add those columns to the index you expect this query to use (as included columns), or you can ask whether those columns really need to be included in the query. Either approach will remove the key lookup and the nested loop.

LOOP JOIN hints

You can direct SQL Server to use nested loops specifically in your query by writing the query with (INNER\LEFT\RIGHT\FULL) LOOP JOIN. This has two effects:

  • Forces the optimizer to use the specified join type. This will only force the join type for that specific join.
  • Sets the join order. This forces the tables to be joined in the order they are written (subqueries from WHERE EXISTS clauses are excluded from this). So this point may affect how the others joins operate.

I’ve blogged about using hints previously, so I won’t go on for long on this subject. I like the phrase “With great power comes great responsibility” when thinking about using hints to get a specific behavior. It can be an effective way to get consistent behavior, but you can make things worse if you don’t test and follow up to confirm it works as intended.

Summary

I’ll discuss the other two join types in another post. In short, hash matches are more efficient for large data sets, but the presence of a large data set should make us ask other questions. Merge joins are very efficient when dealing with data from two sources that are already in the same order, which is unlikely in general.

Thanks to everyone who voted for my session in GroupBy. I enjoyed speaking at the event, and had some interesting discussion and questions. I expect recordings for the May event will be available on GroupBy’s Youtube page, so keep an eye out for that if you missed the event.

As always, I’m open to suggestions on topics for a blog, given that I blog mainly on performance topics. You can follow me on twitter (@sqljared) and contact me if you have questions. You can also subscribe on the right side of this page to get notified when I post.

Hope you are all enjoying a long weekend.

In having a talk reviewed recently, it was suggested I spend more time defining some of the subject I touched on. It occurred if I should go over (or at least introduce) these ideas during a talk for a SQL Saturday audience, some might find a post on the subject useful. Hence my recent post on key lookups.

Another such topic is table variables. I use table variables frequently at my current job, but they came up very infrequently when I worked at CSS in Microsoft. I remember the conversations about them being very simple at the time, as in, “you should just use temp tables instead.” But there is a lot of utility with table variables, and they could be a useful arrow in your quiver.

Statistics

Table variables are much maligned for one issue; they don’t have statistics.

In most ways, table variables function a lot like temp tables. You can define their columns and insert rows into them. Scoping is different. If you declare a table variable inside a procedure, its scope ends when you return from that procedure. If you create a temp table inside a procedure, it remains for the connection, so the procedure that called the procedure creating the temp table can see the temp table and its contents.

Both can have indexes created on them, but the table variable will have no statistics on its indexes. The result is that the SQL Server optimizer doesn’t know how many rows are in the table variable and will assume 1 row when trying to compare possible execution plans. This may lead to worse plan than you would have with a temp table containing the same data.

This is mitigated in a feature of SQL Server 2019’s Intelligent Query Processor, table variable deferred compilation. This provides cardinality estimates based on the actual number of rows in the table variable, and statements are not compiled until their first execution.

“This deferred compilation behavior is identical to the behavior of temporary tables. This change results in the use of actual cardinality instead of the original one-row guess.”

Microsoft

This feature does require you to be using compatability level 150, but this is a significant upgrade.

Alternatively, if you want to change the behavior around a table variable and aren’t using SQL Server 2019 (or have another compatability mode), you could use a query hint. I’ve used an INNER LOOP JOIN hint many times when using a table variable I expect to have a small number of rows as a filter to find specific rows in a permanent table. This gives me the join order and join type I expect to see, but I’m less shy about hints than many as I have commented previously.

Tempdb contention

There’s an ocean of posts about how to address tempdb contention, mostly focusing on creating enough data files in your tempdb or enabling trace flags that change some behaviors. The point I think that gets understated, is that creating fewer objects in tempdb would also address the issue.

One of the ways table variables function like temp tables, is that they store data in tempdb and can drive contention there as well. The cause of that contention is different, but creating table variables across many different sessions can still cause contention.

Table Valued Parameters

TVPs allow us to pass a table variable to a procedure as an input parameter. Within the procedure, they function as read-only table variables. If the read-only part is an issue, you can always define a separate table variable within the procedure, insert the data from the TVP, then manipulate it as you see fit.

The caveats here are the same; the TVP is stored in tempdb (and can cause contention there), you may have issues with cardinality due to the lack of statistics.

But the advantage is that we can make procedure calls that operate on sets of data rather than in a RBAR fashion. Imagine your application needs to insert 50 rows into a given table. You could call a procedure with parameters for each field you are inserting, and then do that 49 more times. If you do it serially, that will take some time, in part because of the round trips from your application to the SQL Server. If you do this in parallel, it will take longer than it should because of contention on the table. I haven’t blogged about the hot page issue, but that might make a good foundation topic.

Or you could make one call to a procedure with a TVP so it can insert the 50 rows without the extra contention or delay. Your call.

Memory Optimized Table Variables

This changes this by addressing one of the above concerns. Memory optimized table variables (let’s say motv for short), store data in memory and don’t contribute to an tempdb contention. The removal of tempdb contention can be a huge matter by itself, but storing data in memory means accessing a motv is very fast. Go here if you’d like to read the official documentation about memory optimized table variables.

A motv has to have at least one index, and you can choose a hashed or nonclustered index. There’s an entire page on the indexes for motv’s, and more details for both index types on this page on index architecture.

I tend to always use nonclustered indexes in large part for the simplicity. Optimal use of hash indexes requires creating the right number of buckets, depending on the number of rows expected to be inserted into the motv. Too many or too few buckets both present performance issues. It seems unlikely to configure this correctly and then not have the code change that later. It’s too micromanage-y to me, so I’d rather use nonclustered indexes here.

We do need to keep in mind that we are trading memory usage for some of these benefits, as Eric Darling points out amusingly and repeatedly here. This shouldn’t matter the majority of the time, but we should keep in mind that it isn’t free and ask questions like:

  • Does that field really need to be nvarchar(max)?
  • I expect this INSERT SELECT statement to insert 100 rows, but could it be substantially more?
  • What if there are 100 simultaneous procedures running, each with an instance of this motv? Each with 100 rows?
  • How much total memory could this nvarchar(200) require if we’re really busy?

I certainly would want to keep my columns limited, and you can make fields nullable.

Memory Optimized Table Valued Parameters

So, we can memory optimize our TVP and get the advantages of both. I use these frequently, and one of the big reasons (and this applies to TVPs in general) is concurrency.

So, let’s compare inserting 50 rows of data into a physical table using a normal procedure versus a procedure with a memory optimized table variable.

Singleton Inserts

To test this, I used a tool provided in the World Wide Importers sample on github. Specifically, I was using the workload driver. This was intended to compare the performance of in-memory, permanent tables with on-disk tables, but the setup will work just fine for my purposes.

The VehicleLocation script enables In-Memory OLTP, creates a few tables, procedures to insert into those tables, and adds half a million rows to them. I’ll be using the OnDisk.VehicleLocations table and the OnDisk.InsertVehicleLocation to insert into it.

I wrote the following wrapper procedure to provide random inputs for that procedure:

USE WideWorldImporters
GO

CREATE OR ALTER PROCEDURE OnDisk.InsertVehicleLocationWrapper
AS
BEGIN
	-- Inserting 1 record into VehicleLocation
	DECLARE @RegistrationNumber nvarchar(20);
	DECLARE @TrackedWhen datetime2(2);
	DECLARE @Longitude decimal(18,4);
	DECLARE @Latitude decimal(18,4);

	SET NOCOUNT ON;

	SET @RegistrationNumber = N'EA' + RIGHT(N'00' + CAST(CAST(RAND() AS INT) % 100 AS nvarchar(10)), 3) + N'-GL';
	SET @TrackedWhen = SYSDATETIME();
	SET @Longitude = RAND() * 100;
	SET @Latitude = RAND() * 100;

	EXEC OnDisk.InsertVehicleLocation @RegistrationNumber, @TrackedWhen, @Longitude, @Latitude;
END;
GO

With this, I basically wrote a loop to call this to individually insert 500 rows. Query Store showed the procedure ran its one statement in 23 µs. I love measuring things in microseconds.

Parallel Inserts

Except that didn’t really test what I wanted. Running it in this fashion in SSMS, it would have run the procedure 500 times serially. I mentioned concurrency as an advantage for memory optimized TVPs, but to see that advantage we need to do singleton inserts into our table in parallel.

A little Powershell later, and I ran this again. I did only 50 inserts, but I did them using 50 parallel threads. Each thread will call the wrapper proc to generate random inserts, then call OnDisk.InsertVehicleLocation.

Since this table has clustered index based on an identity column, we will always be inserting into the last page on the table. But that will require an X lock to write the page, which means our parallel calls will be blocking each other. And the more there are, the worst the waiting will get. This is the hot page issue in a nutshell.

On average each execution took 59 µs. Nearly three times as long. Likely, the first thread completed just as fast as previous, but each subsequent thread has to get in line to insert into the OnDisk.VehicleLocation table. And the average duration increases steadily throughout.

Inserting with a Memory Optimized TVP

Here’s the script I wrote to test using a memory optimized TVP to insert 500 rows.

USE WideWorldImporters
GO

IF NOT EXISTS(
	SELECT 1 
	FROM sys.types st
	WHERE 
		st.name = 'VehicleLocationTVP'
)
BEGIN
	CREATE TYPE OnDisk.VehicleLocationTVP
	AS TABLE(
		RegistrationNumber nvarchar(20),
		TrackedWhen datetime2(2),
		Longitude decimal(18,4),
		Latitude decimal(18,4)
		INDEX IX_OrderQtyTVP NONCLUSTERED (RegistrationNumber)
	)WITH(MEMORY_OPTIMIZED = ON); 
END;
GO

CREATE OR ALTER PROCEDURE OnDisk.InsertVehicleLocationBatched
	@VehicleLocationTVP OnDisk.VehicleLocationTVP READONLY
AS
BEGIN
	SET NOCOUNT ON;
	SET XACT_ABORT ON;

	INSERT OnDisk.VehicleLocations
		(RegistrationNumber, TrackedWhen, Longitude, Latitude)
	SELECT
		tvp.RegistrationNumber, 
		tvp.TrackedWhen, 
		tvp.Longitude, 
		tvp.Latitude
	FROM @VehicleLocationTVP tvp;

	RETURN @@RowCount;
END;
GO

-- So let's insert 500 rows into this on-disk table using a TVP
DECLARE 
	@Target INT = 500,
	@VehicleLocationTVP OnDisk.VehicleLocationTVP;

DECLARE @Counter int = 0;
SET NOCOUNT ON;

WHILE @Counter < @Target
BEGIN
	-- Generating one random row at a time
	INSERT @VehicleLocationTVP
		(RegistrationNumber, TrackedWhen, Longitude, Latitude)
	SELECT
		N'EA' + RIGHT(N'00' + CAST(@Counter % 100 AS nvarchar(10)), 3) + N'-GL',
		SYSDATETIME(),
		RAND() * 100,
		RAND() * 100;

	SET @Counter += 1;
END;
	
EXEC OnDisk.InsertVehicleLocationBatched @VehicleLocationTVP;
GO

This creates the type for our memory optimized table variable and defines the procedure. The last batch generates 500 rows of random data, inserting them into the same type, and calls the new proc with the TVP as input. Executing this once to insert 500 rows took 1981 µs, which is just under 4 µs per row.

Full results for all three tests

Wrapping up

This might have been my wordiest blog, but I hope you learned something from it. I’ve rewritten a number of procedures in recent years to operate on batches, and the results have largely been great.

Again, I will frequently use a join or index hint when joining the TVP to a base table, and that’s a minor change that can prevent an expensive mistake from the lack of statistics.

If you have any topics related to performance in SQL Server you would like to hear more about, please feel free to @ me and make a suggestion.

Please follow me on twitter (@sqljared) or contact me if you have questions.