So I had this plan: I was going to download the (very unfortunate) Adobe data breach, suck it into SQL Server, do the usual post-import data clean up then try and draw some insightful conclusions from what I saw. Good in theory and something I’ve done many times before after other breaches, problem is it’s abso-freaking-lutely massive. The data itself is only 2.9GB zipped and 9.3GB once extracted, but we’re talking over 153 million records here. This could well be the largest password dump we’ve seen to date.
Anyway, I download it, extract it then imported it into a SQL Server table using the bcp utility. That process alone takes a good hour and a half (just the import), which isn’t too bad as a one off, but what came next made the process unbearable. Even on my six core, hyper threaded, fast SSD, 12GB of RAM desktop, I simply couldn’t start transforming the data without things running for a hell of a long time (an hour plus for a simple query), and I also ending up with the transaction log chewing up every single spare byte of disk. Oh sure, I could have moved that over to a spare mechanical drive with heaps of capacity, but that’s not very creative now, is it?
I needed grunt. Lots of grunt. Short term grunt. Once the transformations were done and the data queried for an upcoming post, I could bring back a local copy of the DB for posterity and ditch the whole VM thing, but I had to get there in the first place. I needed… “The Cloud” (queue dramatic music)
I didn’t want to muck around here so I looked for the chunkiest SQL Server virtual machine I could find. 8 CPUs with 56GB of RAM, thank you very much. That’s the top spec SQL Server Standard VM running at the moment according to the calculator:
Now this isn’t cheap – run it for a year and you’ll pay $28k for it which is a lot of coin just to muck around with a password dump. But this is the cloud era, a time of on-demand commoditised resources available for as little or as long as you like. What’s more, these days Azure bills by the minute so if I only want to spend, say, $10 then I can run this rather chunky machine for just over 3 hours then ditch it and never pay another cent. Frankly, my time is pretty damn valuable to me so I’m very happy to pay 10 bucks to save hours of mucking around trying to analyse a huge volume of data on a home PC.
So let’s jump into it and first up we’ll need a new SQL Server VM. Now I could do this with just a SQL Server DB and not a whole virtual machine (indeed for longer term purposes I really don’t like the idea of managing an OS), but I needed to download the dump and import it using SQL Server’s magic tools so I’m going to create an entire VM:
Wait a minute – this is an “Extra Small” VM and not the full fat A7 scale we saw earlier. You see, I need to download several GB of data first, I only really need the chunky processing power once the transformations start. Doing it this way saves me $2.57 an hour! Ok, the money is incidental (although all this did take several hours), I’m doing it this way simply as an exercise to show how easily scale can be changed to meet compute demand.
You’ll also see I’ve provisioned the VM in Southeast Asia which is as close as I can get to Sydney until the Australia farm comes online sometime next year. When all this is done I want to get gigabytes of database backup down to my local machine so proximity is important.
With the machine now provisioned I can remote into a fully-fledged VM with SQL Server running in the cloud:
And just as a quick sanity check, here’s how she’s specced:
1 socket, 1 virtual processor and 768MB of RAM. Lightweight!
I’ll skip the boring bits of downloading the dump and installing 7Zip. With that done, it’s time to turn it off. Huh? No really, it’s bedtime and I don’t want to play with this again until tomorrow. That’s one of the other key lessons I wanted to instil here: When compute resources aren’t needed, you simply turn them off and no longer pay for them. It’s that simple. Now of course I’m still using storage and I’m paying for that whilst the virtual machine isn’t running, but storage is cheap (and don’t let anyone tell you otherwise). It’s $7 a month for 100GB of locally redundant storage and my virtual machine has a total of 126GB by default. Call it just over one cent an hour for all that.
Nigh nigh virtual machine:
The next day…
Did somebody say “performance”? It’s time to crank it up to 11 so it’s back over to the Azure portal and into the VM settings:
Save that and, expectedly, the machine may actually need a restart:
Of course it’s already shut down so we’re not too worried about that. This all takes about a minute to do:
Let’s fire it back up:
This is significant if you come from the pre-cloud world: One minute – that’s how long it takes to deploy another 7 cores and increase the RAM 75 times over. Let’s see how the specs look inside now:
Yep, that oughta do it! Now that' I’ve got some seriously chunky processing power I can begin the intensive processes of extracting the data from the zip then importing it into SQL Server. I won’t go into the detail of what that entails here, suffice to say it’s both boring and laborious (yes, even in the cloud!)
Right, time for queries. Frankly, remotely managing a VM is painful at the best of times so I’ll be running SSMS locally and orchestrating it all from my end. All the processing is obviously still done within the remote machine and there’s just a little bit of network use to send the query then return the results so there’ll be no noticeable performance hit doing this. There’s a neat little tutorial on Provisioning a SQL Server Virtual Machine on Windows Azure which effectively boils down to this:
- Create a TCP endpoint accessible over port 1433 using the Azure portal
- Remote into the machine and allow incoming TCP connections over 1433 for the public firewall profile
- Enable mixed mode auth on the SQL Server
- Create a local SQL login in the sysadmin role (obviously don’t use these rights for production websites connecting to the DB!)
- Register the remote server using SSMS from my local machine
Now I’ve got the entire dump running in the cloud on a high-spec machine with a remote connection. Time to start the heavy work! I won’t bore you with the nature of the queries (although I will share the results in the coming days), the point I wanted to illustrate here was the ease of provisioning and scaling resources: I got exactly what I wanted, when I wanted it and didn’t pay for anything I didn’t want. Then when I didn’t want it any more I just copied the database backup down to my local machine and deleted the VM. That was it.
So all said and done, what did this cost me? If I’m honest, the billing is a little tricky to read but working on the basis of the costs explained above, we’re looking at about $12 (unless you’re using your free Azure credits with MSDN). That’s phenomenal: For the price of about 3 cappuccinos down at my local, I had immediate access to the latest and greatest licensed SQL Server version running on 8 cores with 56GB of RAM. And I got it all on demand. For all sorts of data processing needs, this can make a fundamental difference to the resources people have at their disposal.
Of course this is also not new – people have been using resources such as Amazon’s EC2 for similar nefarious purposes for some time now. What I like about Azure is the turn-key approach to how easy it was to provision all the resources. Of course this is from a very Microsoft-centric person so naturally I’m going to lean a little that way, but it’s genuinely an awesome, impressive tool that can be had for very little money and very little commitment.