Introversion and PASS Summit

Posted on 01/01/202301/06/2023 by sqljared

This will be an unusual post for me. I normally just write technical posts about performance issues I’ve been working on. But I did go to PASS Summit this year, for only the second time in person, and the social aspect was a big part of why I went.

I’m writing this largely because of a blog post by Paul Randal. I replied in the comments but wanted to relate more of my own experience with introversion and PASS Summit.

So, networking…

I am an introvert, and I’ve never felt networking was my strong suit. Long conversations with people I don’t know aren’t something I crave, but networking can drastically increase our options career-wise. That’s how I got my current job.

Several years back I was working on a contract that had just been extended. The project I was working on had been completed, and I wasn’t sure I wanted to remain with that company long-term. I had been to meetings with the local user group over the years and spoken a few times, but hadn’t been to a meeting in months. I decided to check back in and be sociable.

So, I went to the next meeting and during the intro, the leader, Kevin Feasel, said something interesting. His company was looking for a DBE, and to speak with him after if anyone was interested. That conversation led to me applying, and one interview later I had an offer.

It’s honestly ridiculous looking back on it. It feels like I put in a minimal amount of effort but got a great opportunity with a very unique company. This was about a year after leaving Microsoft CSS (customer support), which allowed me to see a huge variety of SQL Server issues. ChannelAdvisor has a very busy OLTP environment and had great database engineers already working for it. Working on the performance issues here has been enlightening.

So, when I submitted a talk for PASS Summit this year, I had to choose whether to apply for an in-person or online session. Since I’ve been hiding out from the coronavirus like many people for the last two years, I decided to try for an in-person session at the summit and spend the extra effort to do some networking while I was there.

At the summit

I’ve been to the PASS Summit once before in Seattle. There are a ton of great sessions, and I usually want to be in three different rooms for each session. I went to sessions all day, walked around Seattle a bit each evening, and crashed. It’s surprisingly exhausting to learn all day.

During this trip, I wanted to focus more on networking. I still went to a lot of sessions, but there were some I skipped to have conversations in the exhibitor lounge and elsewhere in the convention center. And if I was in a session that wasn’t what I was looking for, I was more likely to leave and just look for a chance to talk.

I ended up talking with about a dozen well-known people in the community. Some I had already interacted with; others not at all.

And it wasn’t comfortable or easy most of the time. As I said earlier, I’m an introvert. I had to make myself engage with people I had little or no contact with before. Several times I walked past a conversation before I made myself turn around and get involved.

I was mainly interested in talking with engineers doing consulting work about the work they do. There are similarities to the work I did in Microsoft CSS previously, so I’ve been interested in that part of the business.

I had some contact with a few speakers through work. I had replied to a few on a topic or two on Twitter. A few I had spoken with because I presented for a user group they were part of. I emailed Paul Randal several times through the mentorship he offered a year or two ago. Some like Brent Ozar I had never had any contact with, but he was more than willing to talk. I also had a few good conversations with consultants in the sponsor’s lounge.

What surprised me was hearing a lot (or all?) of the presenters I talked with mention that they were introverts during their talks. It seems unbelievable for someone presenting to a ballroom or keynote with seeming ease.

My perspective

I referenced Paul Randal’s post because part of the feedback he received was from people who found the experience alienating. Some didn’t feel the community was very welcoming or felt like they were on the outside looking in or saw it as being very cliquish.

I wanted to post not to negate any of that; our feelings are always valid. But that doesn’t mean others are trying to be exclusionary.

And I did feel like I was on the outside looking in. And it was difficult\stressful to start up a conversation with people or join in one that was already going on. I’ve spoken at PASS Summit or Virtual PASS three times, but I’m not nearly as well known as many of the other presenters. So I certainly don’t feel I’m part of any clique.

But it did strike me that so many of the presenters had mentioned that they were introverts. So if people see the more well-known speakers gravitate toward people they already know, that’s what I’d expect an introvert to do.

I certainly did the same when I was feeling a bit overwhelmed and saw a friendly face.

In all of the conversations I joined in with or started, I didn’t have anyone turn me away. Once I engaged, I found everyone to be willing to answer questions, happy to give advice and giving of their time.

So as much as anything, I’d like to encourage people to be bold and engage. It may not be easy to start these contacts, but the results have been well worth the discomfort.

Stepping stones

One thing that did make this easier was something I didn’t intend. Over the last few years, I’ve presented at PASS Summit, several SQL Saturday events, and several user groups as well. So I’ve had more interactions with people in our community over the last few years as a lead-up to the summit this year.

I didn’t present at those user groups and SQL Saturday events to make networking easier, but it does help when I can say to someone, “Hi, I presented online for your user group a few months ago, and I wanted to ask you a question.”

This is not to say that you have to present to be part of the community, but any way you can engage with the community gives you more connections and more opportunities going forward.

And I should specifically thank Tracy Boggiano because many of those speaking opportunities came from her pinging me when she heard of a user group that needs a speaker. So, thank you, Tracy!

About the community…

Before and after the summit, I heard a lot of people in the community express a desire to leave Twitter. And I get why, but it’s been the main place where #sqlfamily posted. It isn’t clear how much of the community is looking to move, but I thought I’d mention the options I’ve seen people refer to:

CounterSocial: I saw people start talking about this before the summit. I haven’t seen a ton of activity here yet, but there are several names I recognized that have joined. Perhaps the chatter will pick up after the holidays, so try looking for #sqlfamily there.
Mastodon: I heard more about Mastodon at and after the summit. This works a bit differently, not being entirely centralized. People have to set up a server for it, and it seems communities have to find the right server (don’t ask me for the details). But you can search for #sqlfamily there, or go to the server that Daniel Hutmacher set up at https://dataplatform.social/home. Not a lot of activity there yet, but maybe this is where things will reform.
Linkedin: This wouldn’t have occurred to me, but Brent Ozar pointed to LinkedIn (and TikTok and Instagram) when he said he would stop tweeting. There’s some activity here, and I’ve linked my new blog posts here in the past.

So, if you are looking for people to follow or new events or groups to be part of here, there are three good places to start.

My LinkedIn should already be at the top of the page, but I have accounts on CounterSocial (https://counter.social/@sqljared) and Mastodon (https://dataplatform.social/@sqljared), and I’ll link those at the top of the page shortly.

In Summary

I hope I’ve encouraged you to engage with the community in the new year, and I hope you will find it rewarding. I’m going to do the same myself and see if there are some more user groups that need a remote present.

Happy New Year!

Merge Joins

Posted on 10/08/202210/08/2022 by sqljared

I’ve discussed the other two join types, so what is the niche for the third?

Before we get into how it works and what my experience is I want to mention a response to my last blog, because it leads into our topic.

Addendum on Hash Match Joins

My last blog post was on hash match joins, and Kevin Feasel had a response on his blog.

Hash matches aren’t inefficient; they are the best way to join large result sets together. The caveat is that you have a large result set, and that itself may not be optimal. Should it be returning this many rows? Have you included all the filters you can? Are you returning columns you don’t need?
Jared Poche

I might throw in one caveat about hash match joins and being the best performers for two really large datasets joining together: merge join can be more efficient so long as both sets are guaranteed to be ordered in the same way without an explicit sort operator. That last clause is usually the kicker.
Kevin Feasel, Curated SQL

And he is quite correct. Nested loops perform better than hash match with smaller result sets, and hash match performs better on large result sets.

Merge joins should be more efficient than both when the two sources are sorted in the same order. So merge joins can be great, but the caveat is that you will rarely have two sources that are already sorted in the same order. So if you were looking for the tldr version of this blog, this paragraph is it.

How Merge Joins Operate

Merge joins traverse both inputs once, advancing a row at a time and comparing the values from each input. Since they are in the same order, this is very efficient. We don’t have to pay the cost to create a hash table, and we don’t have the much larger number of index seeks nested loops would encounter.

The process flows like this:

Compare the current values from each data source.
If they match, add the joined row to the result set, and get the next value from both sources.
If not, get the next row from the data source with the lower sorted value.
If there are no more rows from either source, the operation ends.
Otherwise, return to step 1 with the new input.

At this point, I would create a great visual for this, but one already exists. So let me refer you a post by Bert Wagner. The video has a great visualization of the process

Input Independence

I find nested loops is probably the easiest join to understand, so I want to draw a distinction here. Using nested loops, we would get a row from the first source then seek the index against the second to get all rows related to the row from the first source. So, our ability to seek from the second depends on the first.

A merge join seeks from both independently, taking in rows and comparing them in order. So in addition to the requirement (with exception) that the sources have to be in the same order, we need a filter we can use for each source. The ON clause does not give us the filter for the second table, we need something else.

Here’s an example query and plan:

USE WideWorldImporters
GO
SELECT 
	inv.InvoiceID,
	invl.InvoiceLineID
FROM Sales.Invoices inv
INNER JOIN Sales.InvoiceLines invl
	ON invl.InvoiceID = inv.InvoiceID
WHERE
	inv.InvoiceID < 50;
GO

Both Invoices and InvoiceLines have indexes based on InvoiceID, so the data should already be in order. So this should be a good case for a merge (the nested loops below is because of the key lookup on InvoiceLines). But SQL Server’s optimizer still chose nested loops for this query.

I can hint it to get the behavior I expected, and that plan is below.

The estimates are way off for the Invoices table, which is odd considering we are seeking on the primary key’s only column; one would expect that estimate to be more accurate. But this estimate causes the cost for the seek against Invoices to be more expensive, so the optimizer chose the other plan. It makes sense.

I updated the statistics, and a different plan was chosen. One with a hash match.

???

In that case, the difference in cost was directly the cost of the join operator itself; the cost of the merge join operator was 3x the cost of the hash match operator.

Even if the merge is more efficient, it seems it’s being estimated as being more costly, and specifically for CPU cost. You’re likely to see merge joins much less often than the other two types because of the sort requirement; how it is estimated may also be a factor.

About that sort

The first several times I saw a merge join in an execution plan, the merge was basically the problem with the query. It gave me the impression at the time that merge joins aren’t great in general. But in those cases, the execution plan had a sort after one of the index operations and before the join. Sure, the merge join requires that the two sources be sorted in the same order, but SQL Server could always use a sort operator (expensive as they are) to make that an option.

This seems like an odd choice to make, so let’s consider the following query:

USE WideWorldImporters
GO
SELECT *
FROM Sales.Invoices inv
INNER JOIN Sales.InvoiceLines invl
	ON invl.InvoiceID = inv.InvoiceID
WHERE
	inv.InvoiceDate < DATEADD(month, -12, getutcdate());
GO

So, this query does a merge join between the two, but there is a sort on the second input. We scan the index, then sort the data to match the other import before we perform the actual join. A sort operator is going to be a large cost to add into our execution plan, so why did the optimizer choose this plan?

This is a bad query, and the optimizer is trying to create a good plan for it. This may explain many other situations where I have seen a sorted merge. The query is joining the two tables on InvoiceID, and the only filter is on Invoices.InvoiceDate. There is no index on Invoices.InvoiceDate, so it’s a given we’ll scan that table.

If this query used nested loops, we could use the InvoiceID for each record from Invoices to seek a useful index against InvoiceLines, but that would mean we perform 151,578 seeks against that table.

A merge join, even if we have to sort the results from the table, would allow us to perform one index operation instead. But a merge join has to seek independently from the other source, and no other filter is available. So we perform an index scan against the second table as well.

This is probably the best among poor options. To really improve this query, you’d need to add an index or change the WHERE clause.

It took some time for me to realize why I most often saw merge joins in poor execution plans; I wasn’t seeing all the plans using them that perform well. If you are troubleshooting a high CPU situation, when you find the cause you’ll likely be looking at bad plan. We don’t tend to look for the best performing query on the server, do we?

So, if merge join is more efficient than the other two join types in general, we are less likely to be looking at queries where it is being used effectively.

Summary

Hopefully I’ll be getting back to a more regular schedule for the blog. There’s been a number of distractions (an estate sale, mice, etc), but life has been more calm of late (mercifully).

I spoke at the two PASS Summit virtual events over the last two years, and this year I am happy to be presenting in person at PASS Data Community SUMMIT for the first time. So if you are interested in how you can use memory-optimized table variables to improve performance on your system, look out for that session.

Hash Match Joins

Posted on 07/06/202207/06/2022 by sqljared

When I began working at Microsoft, I was very much a novice at performance troubleshooting. There was a lot to learn, and hash match joins were pointed out to me multiple times as the potential cause for a given issue. So, for a while I had it in my head, “hash match == bad”. But this really isn’t the case.

Hash matches aren’t inefficient; they are the best way to join large result sets together. The caveat is that you have a large result set, and that itself may not be optimal. Should it be returning this many rows? Have you included all the filters you can? Are you returning columns you don’t need?

If SQL Server is using a hash match operator, it could be a sign that the optimizer is estimating a large result set incorrectly. If the estimates are far off from the actual number of rows, you likely need to update statistics.

Let’s look at how the join operates so we can understand how this differs from nested loops

How Hash Match Joins Operate

Build Input

A hash match join between two tables or result sets starts by creating a hash table. The first input is the build input. As the process reads from the build input, it calculates a hash value for each row in the input and stores them in the correct bucket in the hash table.

Creating the hash table is resource intensive. This is efficient in the long run, but is too much overhead when a small number of rows are involved. In that case, we’re better off with another join, likely nested loops.

If the hash table created is larger than the memory allocation allows, it will “spill” the rest of the table into tempdb. This allows the operation to continue, but isn’t great for performance. We’d rather be reading this out of memory than from tempdb.

The building of the hash table is a blocking operator. This means the normal row mode operation we expect isn’t happening here. We won’t read anything from the second input until we have read all matching rows from the build input and created the hash table. In the query above, our build input is the result of all the operators highlighted in yellow.

Probe Input

Once that is complete, we move on to the second input in the probe phase. Here’s the query I used for the plan above:

USE WideWorldImporters
GO

SELECT *
FROM Sales.Invoices inv
INNER JOIN Sales.InvoiceLines invl
	ON invl.InvoiceID = inv.InvoiceID
WHERE
	inv.AccountsPersonID = 3002
GO

The build input performed an index seek and key lookup against Sales.Invoices. That’s what the hash table is built on. You can see from the image above that this plan performs a scan against Sales.InvoiceLines. Not great, but let’s look at the details.

There is no predicate or seek predicate, and we are doing a scan. This seems odd if you understand nested loops, because we are joining based on InvoiceID, and there is an index on InvoiceID for this table. But the hash match join operated differently, and doesn’t iterate the rows based on the provided join criteria. The seek\scan against the second table has to happen independently, then we probe the hash table with the data it returns.

If the read against Sales.InvoiceLines table can’t seek based on the join criteria, then we have no filter. We scan the table, reading 490,238 rows. Also unlike a nested loop join, we perform that operation once.

There is a filter operator before the hash match operator. For each row we read of Sales.InvoiceLines, we create a hash value, and check against the hash table for a match. The filter operator reduces our results from 490,238 rows to 751, but doesn’t change the fact that we had to read 490,238 rows to start with.

In the case of this query, I’d want to see if there’s a filter I can apply to the second table. Even if it doesn’t change our join type away from a hash match, if we performed a seek to get the data initially from the second table, it would make a huge difference.

Remember Blocking Operators?

I mentioned the build input turns that branch of our execution plan into a blocking operator. This is something try to call out the normal flow of row mode execution.

With a nested loops join, we would be getting an individual row from the first source, and doing the matching lookup on the second source, and joining those rows before the join operator asked the first source for another row.

Here, our hash match join has to gather all rows from the first source (which here includes the index seek, key lookup, and nested loops join) before we build our hash table. This could significantly affect a query with a TOP clause.

The TOP clause stops the query requesting new rows from the operators underneath it once it has met it’s requirement. This should result in reading less data, but a blocking operator forces us to read all applicable rows first, before we return anything to upstream operators.

So if your TOP query is trying to read a small number of rows but the plan has a hash match in it, you will be likely reading more data that you would with nested loops.

Summary

Actual numbers comparing join types would depend a ton on the examples. Nested loops are better for smaller result sets, but if you are expecting several thousand (maybe ten or more) rows read from a table, hash match may be more efficient. Hash matches are more efficient in CPU usage and logical reads as the data size increases.

I’ll be speaking at some user groups and other events over the next few months, but more posts are coming.

As always, I’m open to suggestions on topics for a blog, given that I blog mainly on performance topics. You can follow me on twitter (@sqljared) and contact me if you have questions. You can also subscribe on the right side of this page to get notified when I post.

Have a good night.