SQL Server 2016 has several new features with SQL Server R Services being one of the most interesting ones. This feature brings data science closer to where most data lives – in the database! It also opens up a world of extensibility to pure database developers by allowing them to write powerful scripts in the R language to complement the T-SQL programming surface area already available to them. In this post, we show you a great example of how you can leverage this awesome feature.
The Shortest Path Problem
Take for example, the classic problem of finding the shortest path between 2 locations. Specifically, let’s say you are a pilot or a flight planner trying to construct a flight plan between 2 airports. And let’s say that all the data about the locations of these airports, and the ‘airways’ (the well-defined paths to follow in the sky) are all stored in the database. Now, how do you find out the shortest path between these two airports? Here are two approaches to do so: the classic T-SQL way, and then the R Services way.
Data Model
But first, let’s look at the data we have. To make this realistic, we imported data from the FAA’s 56-day NASR navigation data product into SQL Server 2016. After importing and some post-processing, we end up with 2 simple tables:
The Node table has details of airports and predefined navigational points. Each such ‘Node’ is identified by the ‘Name’ column. For example, Seattle-Tacoma airport is identified by the name ‘KSEA’ which is the international aviation standard name for this airport. A numeric Id column is used as a key column.
The Edge table contains the well-known paths (in aviation parlance ‘airways’ between airports and navigational points) between the Nodes defined above. Each such path has a ‘Weight’ column which is actually the distance in meters between the 2 nodes for that path. This Weight column is therefore very important because when you are computing shortest paths, we want to minimize the distance.
In such a case, if a developer were to implement Dijkstra’s algorithm to compute the shortest path within the database using T-SQL, then they could use approaches like the one at Hans Oslov’s blog. Hans offers a clever implementation using recursive CTEs, which functionally does the job well. This is a fairly complex problem for the T-SQL language, and Hans’ implementation does a great job of modelling a graph data structure in T-SQL. However, given that T-SQL is mostly a transaction and query processing language, this implementation isn’t very performant, as you can see below.
-- The T-SQL way (from http://www.hansolav.net/sql/graphs.html)
-- The below query executes Hans Oslav’s implementation of Dijkstra’s algorithm on with the Node ID values corresponding to Seattle (airport code SEA) and Dallas / Fort-Worth (airport code DFW) respectively.
exec usp_Dijkstra 24561, 22699
The execution of the above completes in a little under a minute on a laptop with an i7 CPU. Later in this post you can review the timings for this route and another route from Anchorage, Alaska (airport code ANC) to Miami, Florida (airport code MIA).
Enter SQL Server R Services
In case you have not used SQL Server R Services previously, our previous blog post will be a great starting point. In that post, Joe Sack provides many ‘getting started’ links, and a comprehensive description of real-world customer scenarios where this feature is being used.
Getting Started: setting up R Packages
Firstly, we need to ensure that we have correctly configured and validated the installation of R Services (in-database). Follow the instructions here to make sure R Services is working correctly within SQL Server 2016.
Now, one of the most powerful things with R is the extensibility it allows in the form of packages. Developers can tap into this extensible set of libraries and algorithms to improve certain cases which T-SQL does not handle very well – one example being the above shortest path algorithm.
It turns out that R has a very powerful graph library – igraph – which also offers an implementation of Dijkstra’s algorithm! Let’s see how we can leverage that to achieve our purpose. So we need to follow the steps here to install the igraph and jsonlite packages. Exactly how these packages help us in solving this shortest path problem is explained in the next section.
Calling the R Script from T-SQL
Take a minute to review the completed script in the next section. That script accomplishes the same task (finding the shortest flight path between Seattle and Dallas Fort-Worth) but by using SQL Server R Services.
Let’s break down what is in the completed script. Note that the R script itself is stored in the T-SQL string variable called @RScript. Let us further break down what that R script is actually doing:
Notice the use of the R library() function to import the igraph and jsonlite libraries that we previously installed.
Later, we use the jsonlite library’s fromJSON function to parse the values in the Nodes and Edges variables, which are supplied from the T-SQL side of this script in JSON format. The reason for using this approach is because SQL Server R Services today only supports one input dataset to be supplied to the R script.
We then use data.frame to use the edges and nodes supplied to construct the graph using the igraph library.
Then, we use the paths function on the graph, specifying the source and destination nodes.
Then we compute the total distance travelled on this shortest path and stores it in the TotalDistance
We then compute the actual path in the form of Node IDs (stores that in the PathIds variable) and also in the form of human-readable navigation point identifers (stored into the PathNames variable).
The final part of the R Script uses frame to build what will eventually be returned as a T-SQL result set with 5 columns, so that it is equivalent to what Hans Oslav’s stored procedure was returning.
Now that you have understood the R portion of the script, let’s look at how the R script is invoked from the main T-SQL body.
We use the sp_execute_external_script system stored procedure to invoke the R script that we just declared a bit earlier.
As mentioned earlier, the current version of sp_execute_external_script only allows one input dataset. So we have to pass in the Nodes and Edges required for Dijkstra’s algorithm as parameters. The new FOR JSON clause in T-SQL allows us to pass this data in is an efficient way.
The call to sp_execute_external_script also shows how variables from the R script are mapped to T-SQL variables:
R Variable
T-SQL output parameter
TotalDistance
distOut
PathIds
PathIdsOut
PathNames
PathNamesOut
4. The last part of the T-SQL script simply converts the distance returned by the script (which is in meters) to miles and the aviation unit of nautical miles.
Complete Script
Here is the complete script for ready reference:
-- Dijkstra’s algorithm using R (runs in a few seconds)
declare @SourceIdent nvarchar(255) = 'KSEA'
declare @DestIdent nvarchar(255) = 'KDFW'
declare @sourceId int = (select Id from Node where Name = @SourceIdent)
declare @destId int = (select Id from Node where Name = @DestIdent)
DECLARE @RScript nvarchar(max)
SET @RScript = CONCAT(N'
library(igraph)
library(jsonlite)
mynodes <- fromJSON(Nodes)
myedges <- fromJSON(Edges)
destNodeId <- ', @destId,'
destNodeName <- subset(mynodes, Id == destNodeId)
g <- graph.data.frame(myedges, vertices=mynodes, dir = FALSE)
(tmp2 = get.shortest.paths(g, from=''', @sourceId, ''', to=''',@destId , ''', output = "both", weights = E(g)$Weight))
TotalDistance <- sum(E(g)$Weight[tmp2$epath[[1]]])
PathIds <- paste(as.character(tmp2$vpath[[1]]$name), sep="''", collapse=",")
PathNames <- paste(as.character(tmp2$vpath[[1]]$Name), sep="''", collapse=",")
OutputDataSet <- data.frame(Id = destNodeId, Name = destNodeName$Name, Distance = TotalDistance, Path = PathIds, NamePath = PathNames)
')
DECLARE @NodesInput VARCHAR(MAX) = (SELECT * FROM dbo.Node FOR JSON AUTO);
DECLARE @EdgesInput VARCHAR(MAX) = (SELECT * FROM dbo.Edge FOR JSON AUTO);
declare @distOut float
DECLARE @PathIdsOut VARCHAR(MAX)
DECLARE @PathNamesOut VARCHAR(MAX)
EXECUTE sp_execute_external_script
@language = N'R',
@script = @RScript,
@input_data_1 = N'SELECT 1',
@params = N'@Nodes varchar(max), @Edges varchar(max), @TotalDistance float OUTPUT, @PathIds varchar(max) OUTPUT, @PathNames varchar(max) OUTPUT',
@Nodes = @NodesInput, @Edges = @EdgesInput, @TotalDistance = @distOut OUTPUT, @PathIds = @PathIdsOut OUTPUT, @PathNames = @PathNamesOut OUTPUT
WITH RESULT SETS (( Id int, Name varchar(500), Distance float, [Path] varchar(max) , NamePath varchar(max)))
-- here we format the result in different units of distance - miles and nautical miles
SELECT @distOut * 0.00062137 AS DistanceInMiles, @distOut * 0.00053996 AS DistanceInNauticalMiles
Test Results
This script is much quicker and produces similar output to the T-SQL implementation. Figure 2 compares the execution times using the two implementations:
Figure 2: Execution time for the shortest path problem, using T-SQL and R implementations
Conclusion
R Services extends the programming surface area that a Data Engineer has. R Services offers capabilities which nicely complement what T-SQL classically offers. There are some things which R does very well (such as computing shortest paths efficiently) which T-SQL does not do all that well. On the other hand, T-SQL will still excel in tasks for which the database engine is optimized for (such as aggregation). These two are here to play together and bring Data Science closer to where the Data is!
Reviewed by: Gjorgji Gjeorgjievski, Sunil Agarwal, Vassilis Papadimos, Denzil Ribeiro, Mike Weiner, Mike Ruthruff, Murshed Zaman, Joe Sack
In a previous post we have introduced you to the parallel INSERT operator in SQL Server 2016. In general, the parallel insert functionality has proven to be a really useful tool for ETL / data loading workloads. As an outcome of various SQLCAT engagements with customers, we learnt about some nuances when using this feature. As promised previously, here are those considerations and tips to keep in mind when using parallel INSERT…SELECT in the real world. For convenience we have demonstrated these with simple examples!
Level Set
To start with, our baseline timing for the test query which used serial INSERT (see Appendix for details) is 225 seconds. The query inserts 22,537,877 rows into a heap table, for a total dataset size of 3.35GB. The execution plan in this case is shown below, as you can see both the FROM portion and the INSERT portion are serial.
As mentioned in our previous post, we currently require that you use a TABLOCK hint on the target of the INSERT (again this is the same heap table as shown above) to leverage the parallel INSERT behavior. If we do this, you will see the dramatic difference with the query taking 14 seconds. The execution plan is as below:
For row store targets, it is important to note that the presence of a clustered index or any additional non-clustered indexes on the target table will disable the parallel INSERT behavior. For example, here is the query plan on the same table with an additional non-clustered index present. The same query takes 287 seconds without a TABLOCK hint and the execution plan is as follows:
When TABLOCK is specified for the target table, the query completes in 286 seconds and the query plan is as follows (there is still no parallelism for the insert – this is the key thing to remember.)
It is quite common to find IDENTITY columns being used as the target table for INSERT…SELECT statements. In those cases, the identity column is typically used to provide a surrogate key. However, IDENTITY will disable parallel INSERT, as you can see from the example below. Let’s modify the table to have an identity column defined:
… (table definition is truncated for readability). When we run the below INSERT query:
INSERT tempdb.[dbo].[DB1BCoupon_New] WITH (TABLOCK)
(ItinID, Coupons, ... Gateway, CouponGeoType)
SELECT ItinID, Coupons, ... Gateway, CouponGeoType
FROM DB1b.dbo.DB1BCoupon_Rowstore AS R
WHERE Year = 1993
OPTION (MAXDOP 8);
We see that the parallel insert is disabled (query plan below). The query itself completes in 104 seconds, which is a great improvement, but that is primarily because of the minimal logging. As an aside, the highlighted Compute Scalar below is because of the identity value calculation.
It is important to know that if there is an IDENTITY column in the target table or if a SEQUENCE object is referenced in the query, the plan will be serial. To work around this limitation, consider using a ROW_NUMBER() function as shown below. Do note that in this case, you can either leverage IDENTITY_INSERT (which has its own considerations), or declare the column in the table without the IDENTITY property. For this demo, I set IDENTITY_INSERT ON:
SET IDENTITY_INSERT [dbo].[DB1BCoupon_New] ON
Here is the abridged version of this query:
INSERT tempdb.[dbo].[DB1BCoupon_New] with (TABLOCK)
(IdentityKey, ItinID, Coupons, ..., CouponType, TkCarrier,
OpCarrier, FareClass, Gateway, CouponGeoType)
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS IdentityKey,
ItinID, Coupons, ..., CouponType, TkCarrier,
OpCarrier, FareClass, Gateway, CouponGeoType
FROM DB1b.dbo.DB1BCoupon_Rowstore AS R
WHERE Year = 1993
OPTION (MAXDOP 8);
It turns out that this re-write of the query actually performs worse than the serial INSERT with the identity being generated. This re-write with the window function took 161 seconds in our testing. And as you can see from the plan below, the main challenge here seems to be that the source data is being read in serial.
This looks disappointing, but there is hope! Read on…
Batch Mode Window Aggregate to the rescue
In the test setup, we also had created a clustered ColumnStore (CCI) version of the source table. When the test query is modified to read from the CCI instead of the rowstore version (with the above re-write for identity value generation), the query runs in 12 seconds! The main difference between this and the previous case is the parallelism in the data read, and the parallel window function which is new to SQL 2016. And the Columnstore scan does run in Batch mode.
Here is the drilldown into the Window Aggregate. As you can see in the ‘Actual Execution Mode’ attribute, it is running in Batch mode. And it uses parallelism with 8 threads. For more information on batch mode Window Aggregate, one of the best references is Itzik Ben-Gan’s two part series: Part 1 and Part 2.
When the target table is a clustered Columnstore index, it is interesting to note the ‘row group quality’ (how ‘full’ are the compressed row groups) after the insert. To test this, I re-created the target table with a clustered Columnstore index (CCI) defined on it. The table started as empty, and the INSERT statement was issued with a TABLOCK hint. The insert took 77 seconds (this is somewhat expected due to the compression required for the CCI) and the query plan is shown below:
The compute scalar operator above is purely because of the partition scheme applied. Now, let’s look at the Columnstore row groups created, by using the DMV query below:
select partition_number, row_group_id, state_desc, transition_to_compressed_state_desc, trim_reason_desc, total_rows, size_in_bytes, created_time
from sys.dm_db_column_store_row_group_physical_stats
order by created_time desc
The important thing to note for this parallel insert case is that multiple row groups are created and inserted into concurrently, each by one of the CCI insert threads. If we compare this to a case where parallel insert is not used, you will see differing timestamps for the various segments, which is an indirect way of telling that the insert was serial in that case. For example, if I repeat this test without TABLOCK on the destination table, then the query takes 418 seconds. Here is the query plan:
Let’s review briefly the row groups created in this case. We will use these results to discuss ‘row group quality’ in the next section. Here is the output from the row group DMV for the serial INSERT case:
The point of the previous two examples above is that in general, the parallel INSERT operation prefers throughput as opposed to segment (a.k.a. row group) quality. In some cases, if row group quality (having as ‘full’ a row group / having close to a million rows each) is important, then you may need to carefully adjust the degree of parallelism. For example:
Let’s say we use parallel insert to insert 10 million rows
Let’s also imagine a hypothetical degree of parallelism as 100
In that case, we end up with most row groups with around 100,000 rows each. This may not be ideal for some workloads. For an in-depth discussion on segment / row group quality, please see this article.
Therefore, it is critical to adjust the degree of parallelism to balance throughput and row group quality.
Degree of parallelism and INSERT Throughput
Now, back to the heap, let’s see the effect of varying the degree of parallelism (DoP). Any allocation bottlenecks (primarily the number of data files and the I/O bandwidth) are the main constraint when it comes to increasing throughput with parallel INSERT. To overcome these, in our test setup, we have 480 data files for TEMPDB. This may sound excessive, but then we were testing on a 240 processor system! And this configuration was critical for testing parallel insert ‘at-scale’ as you will see in the next section.
For now, here are the test results with varying the DoP. For all cases, TABLOCK was used on the target table. In each case the INSERT query was the only major query running on the system. The chart and table below show the time taken to insert 22,537,877 rows into the heap, along with the Log I/O generated.
Here’s the raw data in case you prefer to see numbers:
Degree of parallelism
Time taken in seconds
Log I/O KB/sec
1
95
122
2
53
225
4
27
430
8
14
860
15
7
1596
16
7
1597
24
6
1781
30
6
2000
32
7
1200
48
10
1105
64
13
798
128
26
370
240
14
921
What can we conclude here? The ‘sweet spot’ seems to be the number 15, which (not coincidentally) is the number of cores per physical CPU in the test setup. Once we cross NUMA node boundaries, the costs of cross-node memory latency are steep – more details on this in the next section. An important note here is that your mileage will vary depending on the specific system configuration. Please ensure adequate tests are done in the specific environment before concluding on an optimal value for DoP.
Pushing things to the max: concurrent parallel INSERTs
Next, we decided to stress the system with multiple such parallel INSERT statements running. To do this optimally we used the RML utilities and created a simple SQL script which would each create a #temp table, parallel insert into it and then drop the table. The results are impressive, we are able to max out the system on the CPU front in some cases (given that these operations are minimally logged and in TEMPDB, there is no other major bottleneck.)
Here are the test results with various combinations of MAXDOP and concurrent requests into temporary tables. The MAXDOP 15 value seems to be the most efficient in this case because that way, each request lines up nicely with the NUMA node boundaries (each NUMA node in the system has 30 logical CPUs.) Do note that the values of MAXDOP and the number of connections were chosen to keep 240 threads totally active in the system.
MAXDOP
Number of connections
Total rows inserted
End to end test timing seconds
Effective rows / second
CPU%
Log I/O KB / sec
5
48
1,081,818,096
50
21,636,362
100
15850
8
30
676,136,310
33
20,488,979
90
15300
15
16
360,606,032
14
25,757,573
100
15200
30
8
180,303,016
11
16,391,183
75
13100
Transaction Logging
When we used the TABLOCK hint in the previous tests on heap tables, we also ended up leveraging another important optimization which has been around for a while now: minimal logging. When we monitor the amount of log space utilized in these cases, you will see a substantially lower amount of log space used in the case where TABLOCK is specified for the (heap) target table. Do note that for Columnstore indexes, minimal logging depends on the size of the insert batch, as is described by Sunil Agarwal in his blog post. Here’s a chart which compares these cases (Note: the graph below has a logarithmic scale for the vertical axis to efficiently accommodate the huge range of values!)
In the case of the CCI insert, the amount of transaction logging is very comparable. However, the insert into heap still requires a TABLOCK for minimal logging as is clearly evident in the large amount of transaction logging when TABLOCK is not specified.
Special case for temporary tables
In the previous post, we mentioned that one of the key requirements for the INSERT operation to be parallel is to have a TABLOCK hint specified on the target table. This requirement is to ensure consistency by blocking any other insert / update operations.
Now when it comes to ‘local’ temporary tables (the ones which have a single # prefix), it is implicit that the current session has exclusive access to the local temporary table. In turn, this satisfies the condition that otherwise needed a TABLOCK to achieve. Hence, if the target table for the INSERT is a ‘local’ temporary table, the optimizer will consider parallelizing the INSERT in case the costs are suitably high. In most cases, this will result in a positive effect on performance but if you observe PFS resource contention caused by this parallel insert, you can consider one of the following workarounds:
Create an index on the temporary table. The described issue only occurs with temporary table heaps.
Use the MAXDOP 1 query hint for the problematic INSERT…SELECT operations.
Fine Print
A few additional points to consider when leveraging this exciting new capability are listed below. We would love your feedback (please use the Comments section below) on if any of the items below are blocking you in any way in your specific workloads.
Just as it is with SQL Server 2016, in order to utilize the parallel insert in Azure SQL DB, do ensure that your compatibility level is set to 130. In addition, it is recommended to use a suitable SKU from the Premium service tier to ensure that the I/O and CPU requirements of parallel insert are satisfied.
The usage of any scalar UDFs in the SELECT query will prevent the usage of parallelism. While usage of non-inlined UDFs are in general ‘considered harmful’ they end up actually ‘blocking’ usage of this new feature.
Presence of triggers on the target table and / or indexed views which reference this table will prevent parallel insert.
If the SET ROWCOUNT clause is enabled for the session, then we cannot use parallel insert.
If the OUTPUT clause is specified in the INSERT…SELECT statement to return results to the client, then parallel plans are disabled in general, including INSERTs. If the OUTPUT…INTO clause is specified to insert into another table, then parallelism is used for the primary table, and not used for the target of the OUTPUT…INTO clause.
Summary
Whew! We covered a lot here, so here’s a quick recap:
Parallel INSERT is used only when inserting into a heap without any additional non-clustered indexes. It is also used when inserting into a Columnstore index.
If the target table has an IDENTITY column present then you need to work around appropriately to leverage parallel INSERT.
Choose your degree of parallelism carefully – it impacts throughput. Also, in the case of Columnstore it impacts the quality of the row groups created.
To maximize the impact and benefits of the parallel INSERT operation, the system should be configured appropriately (no I/O bottlenecks, sufficient number of data files).
Be aware of the power and benefit of minimal logging – something you get for free when parallel INSERT is used in databases with the simple recovery model.
Be aware of the fact that large INSERTs into local temporary tables are candidates for parallel insert by default.
We hope you enjoyed this post, and if you did, we’d love to hear your comments! If you have questions as well please do not hesitate to ask!
Appendix: Test Setup
For the tests in this post, we are using the Airline Origin and Destination Survey (DB1B) Coupon dataset. There are large number of rows in that table (we tested with one slice, for the year 1993) and this being a real-world dataset, it is quite representative of many applications. The destination table schema is identical to the source table schema. The test query is a very simple INSERT…SELECT of the form:
INSERT tempdb.[dbo].[DB1BCoupon_New]
(ItinID, Coupons, ..., Gateway, CouponGeoType)
SELECT ItinID, Coupons, ..., Gateway, CouponGeoType
FROM DB1b.dbo.DB1BCoupon_Rowstore AS R
WHERE Year = 1993
OPTION (MAXDOP 8);
The use of the MAXDOP query hint is so that we can test with differing parallelism levels. The tests were performed on a SQL Server 2016 instance running on Windows Server 2012 R2. The storage used was high-performance local PCIe storage cards. SQL Server was configured to use large pages (-T834) and was set to a maximum of 3.7TB of RAM.
Appendix: Table schemas
Here’s the definition for the partition function:
CREATE PARTITION FUNCTION [pfn_ontime](smallint) AS RANGE RIGHT FOR VALUES (1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016)
Here’s the definition for the partition scheme used:
CREATE PARTITION SCHEME [ps_ontime] AS PARTITION [pfn_ontime] ALL TO ([PRIMARY])
Reviewed by Jeff Papiez, Mike Weiner, Troy Moen, Suresh Kandoth
It has been a while now since SQL Server 2016 has been generally available. We trust you are excited with the great capabilities that SQL Server 2016 brings to you, and have either already installed or will be installing it soon.
Critical Visual C++ Runtime Update
At this time, we want to remind you of a critical Microsoft Visual C++ 2013 runtime pre-requisite update that may be* required on machines where SQL Server 2016 will be, or has been, installed. Installing this, via either of the two methods described below, will update the Microsoft Visual C++ 2013 runtime to avoid a potential stability issue affecting SQL Server 2016 RTM.
* You can determine if an update is required on a machine via one of the two methods below:
Select View Installed Updates in the Control Panel and check for the existence of either KB3164398 or KB3138367. If either is present, you already have the update installed and no further action is necessary.
Check if the version of %SystemRoot%\system32\msvcr120.dll is 12.0.40649.5 or later. If it is, you already have the update installed and no further action is necessary. (To check the file version, open Windows Explorer, locate and then right-click the %SystemRoot%\system32\msvcr120.dll file, click Properties, and then click the Details tab.)
Obtaining the critical update
As described in KB3164398 and in the SQL 2016 Release notes, there are three methods to obtain the fix for the Microsoft Visual C++ 2013 runtime if required:
The quickest and simplest method is to install the update provided by Visual Studio, KB3138367 – Update for Visual C++ 2013 and Visual C++ Redistributable Package. This will mitigate the potential SQL Server 2016 stability issue and negate the need for applying the alternative (and much larger) SQL Server 2016 update described below. Applying KB3138367 can be performed before, or after, SQL Server 2016 has been installed on a machine. KB3138367 is available on the Microsoft Download Center.
The updated Visual C++ 2013 runtime binaries are also included in SQL Server 2016 RTM Cumulative Update #1 (CU1). You can optionally download CU1 rather than KB3164398 and utilize the UPDATESOURCE method described above to receive other valuable product updates also included in CU1 and subsequent CUs.
If you determine the update is required on a machine where SQL Server 2016 will be installed, and select to apply KB3164398 via method 2 above, you have the option to download the update and have it applied as part of the installation without internet connectivity present.
This blog post details the steps to integrate KB3164398 when you install SQL Server 2016 RTM installation on a computer with no access to the Internet (a.k.a. offline install.)
Step 1: Download, but do not execute, the KB3164398 update package
Download the correct file (SQLServer2016-KB3164398-x64.exe) from the Microsoft Download Center link mentioned in the KB article 3164398.
For example, let’s say that you downloaded the SQL installation media to C:\temp\SQL2016_GDR.
Step 2: Execute SQL Server 2016 RTM setup.exe from the command line and include the /UPDATESOURCE parameter
This step is where we ‘tell’ SQL 2016 RTM setup.exe to incorporate (slipstream) the now accessible KB3164398 update into the desired installation or upgrade without internet connectivity. To do this, we must use the /UPDATESOURCE parameter to RTM setup.exe from an administrative command prompt:
The important thing to note above is the /ACTION parameter. Failure to specify a valid action will cause the /UPDATESOURCE parameter to be ignored. Typical valid values for the /ACTION parameter include the following:
Install (to install a new standalone instance of SQL Server 2016)
Upgrade (to upgrade an existing instance to SQL Server 2016)
InstallFailoverCluster (to install a failover clustered instance of SQL Server 2016)
In subsequent screens, you will see the ‘Extract Setup files’ step below will have an ‘In Progress’ status. That means that the update package is being extracted and will be installed.
Skipping forward to the last ‘Ready to Install’ screen, you will observe that the ‘Product Update’ section (as highlighted in the below screenshot) has the properties as below.
Step 3: Validate the version of the Visual C++ 2013 runtime loaded by SQL Server 2016
To validate that the correct version of the VC++ runtime has now been installed and loaded, execute the following query using SQL Server Management Studio or SQLCMD:
SELECT name, file_version
FROM sys.dm_os_loaded_modules
WHERE name like '%msvcr120.dll%'
The version should be 12.0:40649.5. If that checks out, then you are good to go! If it does not, you are most likely missing a reboot. Did you skip that reboot when prompted by setup?
Step 4: Validate the update has been applied
You may also validate successful installation of the update in the new instance by executing:
SELECT @@VERSION
Given the various options described above, please note the following:
If you had simply installed KB3138367 (Method 1 described in the ‘Obtaining the critical update’ section), then the version number for SQL Server will remain at 13.0.1601.5.
If you followed Method 2, the output of SELECT @@VERSION will be as shown below. Notice the RTM-GDR keyword, which tells you that the GDR update has been applied:
Microsoft SQL Server 2016 (RTM-GDR) (KB3164398) - 13.0.1708.0 (X64)
If you followed Method 3 and used CU1, the output of SELECT @@VERSION would return the below. The RTM-CU1 clearly indicates that the SQL engine has been updated to CU1.
Microsoft SQL Server 2016 (RTM-CU1) (KB3164674) - 13.0.2149.0 (X64)
We hope these steps clarify the method of integrating the critical update for Microsoft Visual C++ runtime with SQL Server 2016 setup. In case of any questions, please leave your Comments below!
Reviewed by: Denzil Ribeiro, Mike Weiner, Arvind Shyamsundar, Sanjay Mishra, Murshed Zaman, Peter Byrne, Purvi Shah
SQL Server 2016 introduces changes to the default behavior of checkpoint. In a recent customer engagement, we found the behavior change to result in higher disk (write) queues on SQL Server 2016 vs. the same workload on SQL Server 2012. In this blog we’ll describe the changes, options are available to control these and what impact they might have on workloads that are upgrading to SQL Server 2016. In this specific case changing the database to use the new default behavior of checkpoint proved to be very beneficial.
Checkpoints in SQL Server are the process by which the database engine writes modified data pages to data files. Starting with SQL Server 2012 more options have been provided to better control how checkpoint behaves, specifically indirect checkpoint. The default checkpoint behavior in SQL Server prior to 2016 is to run automatic checkpoints when the log records reach the number of records the database engine estimates it can process within the “recovery interval” (server configuration option). When an automatic checkpoint occurs the database engine flushes the modified data pages in a burst fashion to disk. Indirect checkpoint provides the ability to set a target recovery time for a database (in seconds). When enabled, indirect checkpoint results in constant background writes of modified data pages vs. periodic flushes of modified pages. The use of indirect checkpoint can result in “smoothing” out the writes and lessoning the impact short periodic bursts of flushes have on other I/O operations.
In addition to configuring indirect checkpoint SQL Server also exposes the ability to utilize a startup parameter (-k) followed by a decimal value which will configure the checkpoint speed in MB per second. This is also documented in the checkpoint link above. Keep in mind this is an instance level setting and will impact all databases which are not configured to use indirect checkpoint.
For further internals around checkpoint reference: “How It Works: Bob Dorr’s SQL Server I/O Presentation”. For the purposes of this blog we’ll focus on what has changed and what this means for workloads that are upgrading to SQL Server 2016.
Key Changes to Checkpoint Behavior in SQL 2016
The following are the primary changes which will impact behavior of checkpoint in SQL Server 2016.
Indirect checkpoint is the default behavior for new databases created in SQL Server 2016. Databases which were upgraded in place or restored from a previous version of SQL Server will use the previous automatic checkpoint behavior unless explicitly altered to use indirect checkpoint.
When performing a checkpoint SQL Server considers the response time of the I/O’s and adjusts the amount of outstanding I/O in response to response times exceeding a certain threshold. In versions prior to SQL Server 2016 this threshold was 20ms. In SQL Server 2016 the threshold is now 50ms. This means that SQL Server 2016 will wait longer before backing off the amount of outstanding I/O it is issuing.
The SQL Server engine will consolidate modified pages into a single physical transfer if the data pages are contiguous at the physical level. In prior versions, the max size for a transfer was 256KB. Starting with SQL Server 2016 the max size of a physical transfer has been increased to 1MB potentially making the physical transfers more efficient. Keep in mind these are based on continuity of the pages and hence workload dependent.
To determine the current checkpoint behavior of a database query the sys.databases catalog view.
SELECT name, target_recovery_time_in_seconds FROM sys.databases WHERE name = ‘TestDB’
A non-zero value for target_recovery_time_in_seconds means that indirect checkpoint is enabled. If the setting has a zero value it indicates that automatic checkpoint is enabled.
This setting is controlled through an ALTER DATABASE command.
Example of Differences in Checkpoint Behavior by Version
Below are some examples of the differences in behavior across versions of SQL Server, and with/without indirect checkpoint enabled. Notice the differences in disk latency (Avg. Disk sec/Write) in each of the examples. Each of the examples below is from an update heavy transactional workload. For each a 30-minute comparable sample has been captured and displayed.
Figure 3 – Checkpoint Pattern on SQL Server 2016 (Using Automatic Checkpoint – Maintains 2012 Behavior on Upgrade)
After moving to SQL Server 2016 notice that the latency and amount of I/O being issued (Checkpoint pages/sec) during the checkpoints increases. This is due to the change in how SQL determines when to back off the outstanding I/O being issued.
Figure 4 – Checkpoint Pattern on SQL 2016 (After Changing to Indirect Checkpoint)
After changing the configuration of the database to utilize indirect checkpoint the SQL engine issues a constant stream of I/O flushes the modified buffers. This is represented as Background writer pages/sec on the graph above. This change has the effect of smoothing the checkpoint spikes and results in providing a more consistent response time on the disk.
Table 1 – Checkpoint and I/O Performance Metrics for Different SQL Versions and Checkpoint Configurations
In the above observe the following:
Automatic checkpoint in SQL Server 2012 can Issue less outstanding I/O than SQL Server 2016. For this particular hardware configuration, the result is higher disk latency on SQL Server 2016 (and more queued I/O’s) than on SQL Server 2012.
Indirect checkpoint in SQL Server 2016 has the effect of “smoothing” out the I/O requests for checkpoint operations and significantly reducing disk latency. So while this results in a more constant stream of I/O to the disks the impact of the checkpoint on the disk as well as any other queries running is lessoned.
The counters which measure the amount of work being performed by checkpoint are different and depend on the type of checkpoint enabled. The different counters can be used to quickly expose which type of checkpoint and how much work the operations are doing on any given system.
Automatic checkpoints are exposed as “Checkpoint Pages/sec”
Indirect checkpoints are exposed as “Background Writer pages/sec”
Summary
There are subtle differences in checkpoint behavior when migrating applications from previous versions of SQL Server to SQL Server 2016 and also differences in configurations options you have available to control these. When migrating applications from to SQL Server 2016 make sure to understand the difference in behavior of databases newly created on SQL Server 2016 vs. those created on previous versions and the configurations options you have available to control these. Indirect checkpoint is the new default and you should consider changing the configuration of existing databases to use indirect checkpoint. Indirect checkpoint can be a very effective approach at minimizing the impact of the more aggressive automatic checkpoint in SQL Server 2016 for systems with I/O configurations that cannot handle the additional load.
Reviewed by Panagiotis Antonopoulos, Jakub Szymaszek, Raghav Kaushik
Always Encrypted is one of the compelling features in SQL Server 2016 and in Azure SQL DB which provides a unique guarantee that data in the database cannot be viewed, accidentally or intentionally by users who do not have the ‘master key’ required to decrypt that data. If you want to know more about this feature, please review the product documentation at the previous link or watch the Channel 9 video on this topic.
Customer Scenario
In a recent case, we were working with a customer who was trying to use Table Valued Parameters (TVPs) to do a ‘batch import’ of data into the database. The TVP was a parameter into a stored procedure, and the stored procedure was in turn joining the values from the TVP ‘table’ with some other tables and then performing the final insert into a table which had some columns encrypted with Always Encrypted.
Now, most of the ‘magic’ behind Always Encrypted is actually embedded in the client library which is used. Unfortunately, none of the client libraries (.NET, JDBC or ODBC) support encrypted columns passed within TVPs. So, we needed a viable workaround in this case to unblock the customer. In this blog post, we explain this workaround by using a simple example.
Walkthrough: Working with Bulk data in Always Encrypted
We first proceed to create Column Master Key (CMK) and a Column Encryption Key (CEK). For the CMK, we used a certificate from the Current User store for simplicity. For more information on key management in Always Encrypted, please refer to this link.
Then we create the final table, with the encrypted column defined. Note that in the real application, this table already exists, with data in it. We’ve obviously simplified the scenario here!
Reworking the application to use SqlBulkCopy instead of TVPs
With this setup on the database side of things, we proceed to develop our client application to work around the TVP limitation. The key to doing this is to use the SqlBulkCopy class in .NET Framework 4.6 or above. This class ‘understands’ Always Encrypted and should need minimal rework on the developer front. The reason for the minimal rework is that this class actually accepts a DataTable as parameter, which is previously what the TVP was passed as. This is an important point, because it will help minimize the changes to the application.
Let’s get this working! The high-level steps are outlined below; there is a full code listing at the end of this blog post as well.
Populate the DataTable as before with the bulk data
As mentioned before the creation and population of the DataTable does not change. In the sample below, this is done in the MakeTable() method.
Using client side ad-hoc SQL, create a staging table on the server side.
This could also be done using T-SQL inside a stored procedure, but we had to uniquely name the staging table per-session so we chose to create the table from ad-hoc T-SQL in the application. We did this using a SELECT … INTO with a dummy WHERE clause (in the code listing, please refer to the condition ‘1=2’ which allows us to efficiently clone the table definition without having to hard-code the same), so that the column encryption setting is retained on the staging table as well. In the sample below, this step is done in the first part of the DoBulkInsert method.
Use the SqlBulkCopy API to ‘bulk insert’ into staging table
This is the core of the process. The important things to note here are the connection string (in the top of the class in the code listing) has the Column Encryption Setting attribute set to Enabled. When this attribute is set to Enabled, the SqlBulkCopy class interrogates the destination table and determines that a set of columns (in our sample case, it is just one column) needs to be encrypted before passing to server. This step is in the second part of the DoBulkInsert method.
Move data from staging table into final table
In the sample application, this is done by using an ad-hoc T-SQL statement to simple append the new data from staging table into final table. In the real application, this would typically be done through some T-SQL logic within a stored procedure or such.
There is an important consideration here: encrypted column data cannot be transformed on the server side. This means that no expressions (columns being concatenated, calculated or transformed in any other way) are permitted on the encrypted columns on server side. This limitation is true regardless of whether you use TVPs or not, but might become even more important in the case where TVPs are used.
In our sample application we just inserted the data from the staging table into the final table, and then drop the staging table. This code is in the InsertStagingDataIntoMainTable method in the listing below.
Conclusion
While Always Encrypted offers a compelling use case to protect sensitive data on the database side, there are some restrictions it poses to the application. In this blog post we show you how you can work around the restriction with TVPs and bulk data. We hope this helps you move forward with adopting Always Encrypted! Please leave your comments and questions below, we are eager to hear from you!
Appendix: Client Application Code
Here is the client application code used.
namespace TVPAE
{
using System;
using System.Collections.Generic;
using System.Data;
using System.Data.SqlClient;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
class Program
{
static private string TVPAEConnectionString = "Server=.;Initial Catalog=TVPAE;Integrated Security=true;Column Encryption Setting=enabled;";
static void Main(string[] args)
{
var stgTableName = DoBulkInsert(MakeTable());
InsertStagingDataIntoMainTable(stgTableName);
}
private static DataTable MakeTable()
{
DataTable newData = new DataTable();
// create columns in the DataTable
var idCol = new DataColumn()
{
DataType = System.Type.GetType("System.Int32"),
ColumnName = "idCol",
AutoIncrement = true
};
newData.Columns.Add(idCol);
var somePII = new DataColumn()
{
DataType = System.Type.GetType("System.String"),
ColumnName = "somePII"
};
newData.Columns.Add(somePII);
// create and add some test data
var rand = new Random();
for (var loopCount = 0; loopCount < 10000; loopCount++)
{
var datarowSample = newData.NewRow();
datarowSample["somePII"] = DateTime.Now.ToLongDateString();
newData.Rows.Add(datarowSample);
}
newData.AcceptChanges();
return newData;
}
private static void InsertStagingDataIntoMainTable(string stgTableName)
{
using (var conn = new SqlConnection(TVPAEConnectionString))
{
conn.Open();
using (var cmd = new SqlCommand("BEGIN TRAN; INSERT FinalTable SELECT * FROM [" + stgTableName + "]; DROP TABLE [" + stgTableName + "]; COMMIT", conn))
{
Console.WriteLine("Inserted rowcount: " + cmd.ExecuteNonQuery().ToString());
}
}
}
private static string DoBulkInsert(DataTable stagingData)
{
string stagingTableName = "StagingTable_" + Guid.NewGuid().ToString();
using (var conn = new SqlConnection(TVPAEConnectionString))
{
conn.Open();
// create the staging table - note the use of the dummy WHERE 1 = 2 predicate
using (var cmd = new SqlCommand("SELECT * INTO [" + stagingTableName + "] FROM FinalTable WHERE 1 = 2;", conn))
{
cmd.ExecuteNonQuery();
}
using (var bulkCopy = new SqlBulkCopy(conn))
{
bulkCopy.DestinationTableName = "[" + stagingTableName + "]";
bulkCopy.WriteToServer(stagingData);
}
}
return stagingTableName;
}
}
}
Deleting all rows from a given partition is a very common operation on a partitioned table, especially in a sliding window scenario. In a sliding window scenario, when a new period starts, a new partition is created for the new data corresponding to this period, and the oldest partition is either removed or archived.
To remove or archive the oldest partition, the general practice is to switch the partition out to a temporary staging table. The SWITCH operation for a partition is a simple statement, but it takes a bit of preparation for the SWITCH to work. The staging table needs to follow certain rules:
the staging table must have the same structure as the main partitioned table
the staging table must be empty
must reside on the same file group as partition being switched out
must create all matching clustered and non-clustered indexes
If the data from the oldest partition need to be archived and saved somewhere, it makes sense to switch the data out to a staging table and process for archiving. However, if the goal is simply to delete the data from the partition, then the programming needed for creating the staging table and switching partition may be cumbersome.
SQL Server 2016 addresses this by allowing TRUNCATE operation on individual partitions of a table. SQL Server 2016 introduces a WITH PARTITIONS clause for TRUNCATE TABLE statement that allows specifying a selected set of partitions (you can specify more than one partition at once). Needless to say that like truncating a table, truncating a partition is also a meta-data operation.
Example:
TRUNCATE TABLE DB1BTicket WITH (PARTITIONS (7, 8))
GO
DBCC CHECKDB is a common database maintenance task. It can take up significant amount of system resources, and can impact the performance of the production workload. There are some very good articles on the web on optimizing performance of DBCC CHECKDB and minimizing performance impact. SQL Server 2016 (and now backported to SQL Server 2014 SP2) provides another lever to manage resources consumed by DBCC CHECKDB. Now you can apply a MAXDOP option to the DBCC CHECKDB command (and to DBCC CHECKTABLE and DBCC CHECKFILEGROUP commands as well).
When MAXDOP is not specified with DBCC CHECKDB, the command uses the instance level “max degree of parallelism” configuration option. If the instance level configuration is 0 (default), DBCC CHECKDB could employ all the processors on the server and consume lots of resources, leaving very little room for the application workload. When a lower MAXDOP is used, less resources are used, but CHECKDB would take longer to finish.
The syntax of specifying MAXDOP to DBCC CHECKDB is pretty simple:
DBCC CHECKDB WITH MAXDOP = 4
Note that this command respects the MAX_DOP value that may be specified for the Resource Governor workload group used for the session running the command. If the MAXDOP value specified in the DBCC CHECKDB command is greater than the one in the Resource Governor configuration, then the latter will be used.
Figure 1 shows the elapsed time and CPU percentage for a DBCC CHECKDB test with and without MAXDOP.
In the above test, the server has default MAXDOP setting of 0. The server is 24-cores and the database size is about 190 GB. This shows that as the MAXDOP for the DBCC CHECKDB command is lowered from 0 (meaning all 24 cores) to 4, the time it takes to run increased from about 400 seconds to about 1100 seconds, while average CPU utilization is reduced from about 70% to about 10%, making the impact of DBCC CHECKDB on the application workload nearly negligible. Your mileage will vary, depending upon your hardware configuration.
Contributors and Reviewers: John Hoang, Sanjay Mishra, Alexei Khalyako, Sourabh Agarwal, Osamu Hirayama, Shiyang Qiu
Overview: Migrate data to Azure SQL Data Warehouse
Azure SQL Data Warehouse is an enterprise-class, distributed database, capable of processing massive volumes of relational and non-relational data. It can deploy, grow, shrink, and pause in seconds. As an Azure service, Azure SQL Data Warehouse automatically takes care of software patching, maintenance, and backups. Azure SQL Data Warehouse uses the Microsoft massive parallel processing (MPP) architecture. MPP was originally designed to run large on-premises enterprise data warehouses. For more information on Azure SQL Data Warehouse, see What is Azure SQL Data Warehouse?
This article focuses on migrating data to Azure SQL Data Warehouse with tips and techniques to help you achieve an efficient migration. Once you understand the steps involved in migration, you can practice them by following a running example of migrating a sample database to Azure SQL Data Warehouse.
Migrating your data to Azure SQL Data Warehouse involves a series of steps. These steps are executed in three logical stages: Preparation, Metadata migration and Data migration.
Figure 1: The three logical stages of data migration
In each stage, tasks to be executed involve the on-premises database system, the on-premises local storage, the network connecting the local system to Azure (either internet or a dedicated circuit) and Azure SQL Data Warehouse. This results in a physical data movement from the source database to Azure as shown below. (These steps are also similar in moving data from any other source system on cloud instead of on-premises to Azure SQL Data Warehouse)
Figure 2: Physical data movement from the source database to Azure
Steps in the preparation stage start at the source database, where you choose the entities and attributes to migrate. You allocate local storage for further steps to come, establish a network to Azure, create a storage account and create an instance of Azure SQL Data Warehouse on Azure.
Metadata migration involves compatibility assessment and corrections, exporting the metadata, copying the metadata from the source system to Azure, and importing the metadata onto Azure SQL Data Warehouse.
Data Migration involves making the data-level compatibility changes if any, filtering and extracting the data to migrate, performing format conversions on the extracted data as necessary, compressing the data, copying the data to Azure, loading the transferred data, and doing post-load transformations and optimizations.
These steps are illustrated in the diagram below. The steps result in a logical flow from top to bottom and a physical flow from left to right.
(Arrows indicate a dependency: the latter step depends on the successful completion of former steps)
Figure 3: Data migration process that results in a logical flow from top to bottom and a physical flow from left to right
If the volume of the data to migrate is large, some steps can be time consuming. These steps are rate-determining because they influence the overall migration time. Such steps are shaded in color.
Some migration steps may be optional depending on the size of the data, the nature of the network, and the tools and services used for migration. Optional steps are shown with dotted lines.
Example
To practice and understand the steps, you can follow a running example that migrates a sample database to Azure SQL Data Warehouse. To try out the sample, you’ll need:
On Azure, an:
Azure subscription
Azure storage account
Azure SQL Data Warehouse database
On a local computer:
The latest SQL Server 2016 build installed: Download Link.
Note: The sample database accompanying this document is a backup created from SQL Server 2016. You need a version of SQL 2016 to restore it. The steps described in this document can also be applied with a database created in earlier versions of SQL Server 2012 and 2014.
The Azure Storage Explorer Tool installed: Download
Choose a migration approach
The data migration steps usually affect the performance, maintainability and reliability of the migration. Approaches for migrating data to Azure SQL Data Warehouse can be classified based on where the data migration is orchestrated from, and based on whether the migration operations are individualized or combined.
Source controlled or Azure controlled:
Source Controlled: Here the logic for data export, transfer and import steps runs mostly from the source data system. Source-controlled migrations can reuse existing computer and storage resources at the source system for some of the migration steps. Source-controlled migrations don’t require connectivity to Azure for some of the steps. Source-controlled migrations may use custom scripts and programs or ETL tools like SSIS run from the source database server.
Azure Controlled: Here the logic for the data export, transfer and import steps runs mostly from Azure. Azure-controlled migrations aim to reduce non-Azure assets for greater maintainability, and to do on-demand migrations by spinning up or scheduling Azure resources as needed. Azure- controlled migrations require connectivity from Azure to the source system to export data. Azure-controlled migrations may run migration logic in virtual machines running on Azure with the virtual machines being allocated and deallocated on demand.
Differentiated or integrated:
Differentiated approach: Here the data export, transfer and import are distinctly executed with each reading from or writing to intermediate files. File compression is invariably used to reduce the cost and time in transferring files from the source system to Azure. Compressed files are transferred to Azure-based storage before import. When the connectivity from source to Azure is lower in bandwidth or reliability, this approach may turn out to be more feasible. Differentiated approaches are typically realized by custom programs and scripts working independently, using tools such as bcp.exe, AzCopy, and compression libraries.
Integrated approach: Here the data export, transfer and import are combined and entities are transferred directly from the source data system to Azure SQL Data Warehouse with no intermediate files created. This approach has fewer moving pieces and tends to be more maintainable. However, it does not compress data in bulk, and can result in slower data transfer to Azure. It needs good connectivity from the source to Azure for repeatable and reliable migrations. Integrated approaches are typically realized by ETL tools like SSIS or with Azure Data Factory with the Data Management Gateway, which is an on-premises installable agent to enable data movement from on premise to Azure. Refer to the documentation on moving data to Azure with Data Management Gateway for more information.
It’s possible to use a hybrid approach, where operations are partly controlled from source and partly from Azure. With Data Factory and the Data Management Gateway, you can also build data pipelines that do one or more operations in the differentiated approach such as for example, moving data from SQL Server to File system/Blob and moving blobs from blob storage to Azure SQL Data Warehouse.
Often the speed of migration is an overriding concern compared to ease of setup and maintainability, particularly when there’s a large amount of data to move. Optimizing purely for speed, a source controlled differentiated approach relying on bcp to export data to files, efficiently moving the files to Azure Blob storage, and using the Polybase engine to import from blob storage works best.
Example:
In our running example, we choose the Source controlled and Differentiated approach, as it favors speed and customizability.
Note: You can also migrate the AdventureWorksDW sample Database to Azure SQL Data Warehouse by the other strategies, using SSIS or Azure Data Factory.
Preparation steps
Source data system: preparation
On the source, establish connectivity to the source data system, and choose which data entities and which attributes to migrate to Azure SQL Data Warehouse. It’s best to leave out entities and objects that aren’t going to be processed on Azure SQL Data Warehouse. Examples of these are log or archival tables and temporarily created tables.
Tip: Don’t migrate more objects than you need. Moving unnecessary data to cloud and having to purge data and objects on Azure SQL Data Warehouse can be wasteful. Depending on the sizes of unused objects, the cost and time of the data export, local transformations, and transfer increase.
Example
In our example, we migrate all the tables in the AdventureWorks DW database, since it’s a relatively small database.
Local storage: preparation
If the exported data will be stored locally prior to transfer (the differentiated approach), on the local storage system, ensure, at a minimum, that there is sufficient space to hold all of the exported data and metadata, the locally transformed data, and the compressed files. For better performance, use a storage system with sufficient independent disk resources allowing read/write options with little contention.
If the data transfer will be directly from the source data system to Azure SQL Data Warehouse (the Integrated approach), skip this step.
Example
In our example, you need about 500 MB of free space on the SQL Server Machine to hold the exported, format converted, and compressed data files for the AdventureWorksDW sample database tables.
Network: preparation
You can establish a connection to Azure via the public internet or using dedicated connectivity. A dedicated connection can provide better bandwidth, reliability, latency, and security compared to the public internet. On Azure, dedicated networking is offered through theExpressRoute service. Depending on the migration approach, the connectivity establishedbe used to move data to Azure SQL Data Warehouse directly, or move intermediate files to Azure storage.
Tip: If the size of the data to transfer is large, or you want to reduce the time it takes to transfer data or improve the reliability in data transfer, try ExpressRoute.
Example
In our example, we transfer the data over the public internet to an Azure Storage location in the same region as the Azure SQL Data Warehouse because the data to transfer is relatively small. This requires no special network establishment step, but make sure that you’re connected to the internet during the following steps:
Azure preparation
Metadata copy to Azure and metadata import
The steps in the Data Migration section involving data transfer and import
Azure preparation
Prepare to receive the data on Azure:
Choose an Azure region where the Azure SQL Data Warehouse is available.
Create the Azure SQL Data Warehouse database.
Create a storage account.
Prepare the Azure SQL Data Warehouse for data Import.
Tip: For speedy data movement, choose the Azure region closest to your data source that also has Azure SQL Data Warehouse, and create a storage account in the same region.
Example
To find the regions where Azure SQL Data Warehouses are located, refer to Azure Services by Region. Choose the region closest to you.
Create a storage account in the same Azure region where you created the Azure SQL Data Warehouse using the steps described in About Azure storage Accounts. A locally redundant storage (LRS) is sufficient for this example.
Create at least one container in the storage account. To continue with this example, you’ll need the following:
Name of the container you created above.
Name of the storage account you created.
Storage access key for the storage account. You can get this by following the steps under “View and copy storage access keys” in About Azure storage Accounts.
Server name, user name, and password for the Azure SQL Data Warehouse.
Prepare the Azure SQL Data Warehouse for data Import: For fast and parallel data imports we choose Polybase within the Azure SQL Data Warehouse to load data. To prepare the target database for import you need to use this information:
IF NOT EXISTS (SELECT * FROM sys.symmetric_keys)
CREATE MASTER KEY
Create a database scoped credential
IF NOT EXISTS (SELECT * FROM sys.database_credentials WHERE name='AzSqlDW_AzureStorageCredentialPolybase' )
CREATE DATABASE SCOPED CREDENTIAL AzSqlDW_AzureStorageCredentialPolybase
WITH IDENTITY = 'AzSqlDW_Identity' , SECRET = '<YourStorageAccountKey>'
Create an external data source
IF NOT EXISTS (SELECT * FROM sys.external_data_sources WHERE name = 'AzSqlDW_AzureBlobStorage')
CREATE EXTERNAL DATA SOURCE AzSqlDW_AzureBlobStorage WITH (TYPE = HADOOP ,
LOCATION=
'wasbs://<YourStorageContainerName>@r<YourStorageAccountName>.blob.core.windows.net',
CREDENTIAL = AzSqlDW_AzureStorageCredentialPolybase);
Create an external file format
IF NOT EXISTS(SELECT * FROM sys.external_file_formats WHERE name = 'AzSqlDW_TextFileGz')
CREATE EXTERNAL FILE FORMAT AzSqlDW_TextFileGz WITH(FORMAT_TYPE = DelimitedText,
FORMAT_OPTIONS (FIELD_TERMINATOR = '|'),
DATA_COMPRESSION = 'org.apache.hadoop.io.compress.GzipCodec' );
In the above TSQL code replace YourStorageAccountName, YourStorageAccountKey and YourStorageContainerName with your corresponding values.
Tip: To prepare for a parallelized data import with Polybase, create one folder in the storage container for each source table—the folder name could be the same as the table name. This allows you split the data from large tables into several files and do a parallel data load into the target table from the multiple blobs in the container. You can also create a subfolder hierarchy based on how the source table data is grouped. This allows a control on the granularity of your load. For example, your subfolder hierarchy could be Data/Year/Quarter/Month/Day/Hour. This is also handy for incremental loads. For example, when you want to load a month of new data.
Metadata migration
Compatibility checks and changes
The source objects to migrate need to be compatible with Azure SQL Data Warehouse. Resolve any compatibility issues at the source before starting migration.
Tip: Do compatibility assessment and corrections as the first step in migration.
Note: The Data Warehouse Migration Utility can also help automate the migration itself. Note that the tool does not compress files, move data to Azure storage or use Polybase for import. Certain other steps, such as the “Azure Preparation” steps and the UTF 8 conversion are not supported. The tool generates bcp scripts that will move your data first to flat files on your server, and then directly into your Azure SQL Data Warehouse. The tool may be simple to use for small amounts of data.
A list of SQL Server functionality that is not present in Azure SQL Data Warehouse can be found in the migration documentation. In each table, make sure:
There are no incompatible column types.
There are no user-defined columns.
In addition, when using Polybase for data loading following limitations need to be checked
The total size of all columns is <= 32767 bytes
There are no varchar(max), nvarchar(max), varbinary(max) columns
The maximum length of individual columns is <= 8000 bytes
Note: Azure SQL Data Warehouse currently supports rows larger than 32K and data types over 8K. Large row support adds support for varchar(max), nvarchar(max) and varbinary(max). In this first iteration of large row support, there are a few limits in place which will be lifted in future updates. In this update, loads for large rows is currently only supported through Azure Data Factory (with BCP), Azure Stream Analytics, SSIS, BCP or the .NET SQLBulkCopy class. PolyBase support for large rows will be added in a future release. This article demonstrates data load using Polybase.
Example
Check the tables in the same database (except for the total column size) for compatibility using the following query:
SELECT t.[name],c.[name],c.[system_type_id],c.[user_type_id],y.[is_user_defined],y.[name]
FROM sys.tables t
JOIN sys.columns c ON t.[object_id] = c.[object_id]
JOIN sys.types y ON c.[user_type_id] = y.[user_type_id]
WHERE y.[name] IN
('geography','geometry','hierarchyid','image','ntext','numeric','sql_variant'
,'sysname','text','timestamp','uniqueidentifier','xml')
OR (y.[name] IN ( 'varchar','varbinary') AND ((c.[max_length] = -1) or (c.max_length > 8000)))
OR (y.[name] IN ( 'nvarchar') AND ((c.[max_length] = -1) or (c.max_length > 4000)))
OR y.[is_user_defined] = 1;
When you run this query against the sample database, you’ll find that the DatabaseLog table is incompatible. There are no incompatible column types, but the TSQL column is declared as nvarchar (4000) = 8000 bytes in max length.
To resolve the incompatibility, find the actual sizes of this and other variable columns in the DatabaseLog table and their total length using the following TSQL queries:
SELECT MAX(DATALENGTH([DatabaseUser])),MAX(DATALENGTH([Event])),MAX(DATALENGTH([Schema])),MAX(DATALENGTH([Object])),MAX(DATALENGTH([TSQL]))
FROM DatabaseLog
You’ll find that the actual maximum data length of the TSQL column is 3034. The total of the maximum data lengths of the columns is 3162. These are within the maximum allowed column lengths and row lengths in Azure SQL Data Warehouse. No data needs to be truncated to meet the compatibility requirement, and we can instead modify the TSQL column as nvarchar(3034) in the exported schema.
Similarly, the sum of declared column lengths in the DimProduct exceeds the maximum allowed column length. This can be resolved in a similar way.
Metadata export
After you’ve made the necessary changes for Azure SQL Data Warehouse compatibility, export your metadata (schema) so that the same schema can be imported onto Azure SQL Data Warehouse. Script or otherwise automate the metadata export so that it can be done repeatedly without errors. A number of ETL tools can export metadata for popular data sources. Note that some further tasks will be needed after the export. First, while creating tables in Azure SQL Data Warehouse you need to mention the distributed table type (ROUND_ROBIN/HASH). Second, if you are using Polybase to import data, you need to create external tables that refer to the locations of the exported files for each table.
Tip: Refer to the SQLCAT guidance for choosing the type of distributed table in Azure SQL Data Warehouse Service.
Note that Azure SQL Data Warehouse does not support a number of common table features, such as primary keys, foreign keys, and unique constraints. For a full list, please refer to Migrate your schema to Azure SQL Data Warehouse.
Example
The table creation statement for the AdventureWorksBuildVersion table compatible with Azure SQL Data Warehouse is as follows:
IF NOT EXISTS (SELECT * FROM sys.tables WHERE schema_name(schema_id) = 'dbo' AND name='AdventureWorksDWBuildVersion')
CREATE TABLE [dbo].[AdventureWorksDWBuildVersion]([DBVersion] nvarchar(100) NOT NULL,[VersionDate] datetime NOT NULL) WITH(CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = ROUND_ROBIN)
A full list of sample table creation commands for the AdventureWorks database can be found here.
The external table creation statement for the AdventureWorksBuildVersion table compatible with Azure SQL DW is as follows:
IF NOT EXISTS (SELECT * FROM sys.tables WHERE schema_name(schema_id) = 'dbo' AND name='AdventureWorksDWBuildVersion_External')
CREATE EXTERNAL TABLE [dbo].[AdventureWorksDWBuildVersion_External]([DBVersion] nvarchar(100) NOT NULL,[VersionDate] datetime NOT NULL)
WITH(LOCATION = '/dbo.AdventureWorksDWBuildVersion.UTF8.txt.gz', DATA_SOURCE = AzSqlDW_AzureBlobStorage, FILE_FORMAT = AzSqlDW_TextFileGz);
In the part of the statement starting from the WITH keyword, you need to provide for the values for the parameters – LOCATION, DATA_SOURCE and FILE_FORMAT.
The value of the LOCATION parameter should be the path where the data file for the table will reside on Azure blob storage.
The value of the DATA_SOURCE parameter should be the name of the data source as created in the “Prepare the Azure SQL Data Warehouse for data Import” section of “Azure preparation”
The value of the FILE_FORMAT parameter should be name of the file format created in the “Prepare the Azure SQL Data Warehouse for data Import” section of “Azure preparation”
Note: /dbo.AdventureWorksDWBuildVersion.UTF8.txt.gz refers to a file location relative to the Azure storage container created under “Azure: preparation”. This file itself does not exist yet—it will be created during data export. So you can’t yet execute the External table creation commands just yet.
A full list of sample external table creation commands for the AdventureWorks database can be found here .
Metadata copy to Azure and metadata import
Since the metadata is usually small in size and the format well known, you don’t need further optimization or format conversions. Use SQL Server Data Tools (SSDT) or SSMS (July 2016 release) to execute the table creation statements against the target Azure SQL Data Warehouse database. To connect to Azure SQL Data Warehouse, specify the server name (of the form YourSQLDWServerName.database.windows.net), user name and database name (not the master database, which is the default) as chosen at the time of creation.
Example
Execute the statements using SSDT or SSMS (July 2016 release) to create the tables on Azure SQL Data Warehouse.
Note: You cannot yet execute the External Table Create statements, as the table data needs to be exported and moved to Azure Blob Storage before you can do this.
Data migration
Data: compatibility changes
In addition to changes to metadata for compatibility, you might need to convert data during extraction for error-free import with Azure SQL Data Warehouse. In importing with Polybase, dates must be in the following formats when the DATE_FORMAT is not specified.
DateTime: ‘yyyy-MM-dd HH:mm:ss’
SmallDateTime: ‘yyyy-MM-dd HH:mm’
Date: ‘yyyy-MM-dd’
DateTime2: ‘yyyy-MM-dd HH:mm:ss’
DateTimeOffset: ‘yyyy-MM-dd HH:mm:ss’
Time: ‘HH:mm:ss’.
Depending on your locale and current date format, you may need to convert date formats during export. Additionally, bcp exports data to field and row delimited files, but bcp by itself does not escape delimiters. You choose a delimiter that does not occur in any of the data in the table. Also, if you have used a data type for a column in an Azure SQL Data Warehouse table that is different from the corresponding column in the source table, ensure that during extraction, the data is converted to a format compatible with the target.
Tip: Invalid export files can result in data being rejected by Azure SQL Data Warehouse during import. Preventing these errors saves you from file correction or re-extraction and retransfer efforts.
The most common mistakes include:
Malformed data files.
Un-escaped or missing field/row delimiters.
Incompatible date formats and other representations in extracted files.
The order of extracted columns being different from the order in import.
The column name or number of supplied values don’t match the target table definition.
When there are individual rows with errors, you can get error messages like the following which will help determine what went wrong:
“Query aborted– the maximum reject threshold (… rows) was reached while reading from an external source: YYY rows rejected out of total ZZZ rows processed. (…) Column ordinal: .., Expected data type: …Offending value:”
Example
If you use an (un-escaped) comma as a field delimiter, you’ll have import errors with a number of tables in the sample database. A field delimiter not found in the tables is the pipe character. You can extract dates to a target format using the CONVERT function. An example follows for one of the tables in the sample database:
SELECT REPLACE([DBVersion],'|','||'),CONVERT(varchar(32), [VersionDate], 121)
FROM [AzureSQLDWAdventureWorks].[dbo].[AdventureWorksDWBuildVersion]
For a full list of extraction commands, refer to the “Data: export and format conversion” section.
Data: export and format conversion
When you don’t use an ETL tool like SSIS to integrate the steps of export, transfer, and load, or you’re following the differentiated approach in migration as discussed earlier, choose an extraction tool and optionally specify the extraction query to choose columns and filter rows. Data export can be CPU, memory, and IO intensive. To speed up data export, use bulk /batched extraction, parallelize extraction, and scale compute/memory/IO resources as needed.
You can use the bcp utility, which bulk copies data from an instance of Microsoft SQL Server to a data file in a user-specified format. Note that bcp data files don’t include any schema or format information. An independent schema import is essential before you import data generated by bcp on Azure SQL Data Warehouse. bcp can export data in character format (-c option) or Unicode character format (-w option).
Note: Bcp version 13 (SQL Server 2016) supports code page 65001 (UTF-8 encoding). This article demonstrates UTF-8 conversion as earlier versions of bcp did not have this support.
In importing data into Azure SQL Data Warehouse, with Polybase, non-ASCII characters need to be encoded using UTF-8. Hence if your tables have data with extended ASCII characters you need to convert the exported data to UTF-8 before importing. Also, in creating the bcp commands, note the need to escape delimiters, as mentioned in the earlier section.
Tip: If invalid characters in the exported files don’t conform to the expected encoding, data import into Azure SQL Data Warehouse can fail. For example, if you have extended characters in tables, convert the files generated by bcp to UTF-8 programmatically or by using PowerShell commands.
The System.Encoding class in .NET provides support for programmatic conversion between Unicode and UTF-8.
Tip: The speed at which bcp exports data to files depends on a number of factors including command options such as batch_size, packet_size, rows_per_batch, query hints used such as TABLOCK, the extent of parallelism, the number of processing cores and the performance of the IO subsystem. For more information on bcp options, refer to the documentation on the bcp utility.
You can also experiment with parallelizing the process by running bcp in parallel for separate tables, or separate partitions in a single table.
Tip: Export data from large tables into multiple files so that they can be imported in parallel. Decide on a way to filter records based on attributes to implement multi-file export so that batches of records go into different files.
When Azure has network reliability issues, implementing multi-file export per table for large tables increases the chances of individual file transfers Azure being successful.
Example
A sample bcp command to export one of the tables in the sample database follows:
A full list of sample bcp commands for the sample database can be found here.
After bcp execution is complete, there should be 34 files created on disk, ending with .txt, corresponding to the 34 tables in the sample database.
The sample database has a number of tables with extended characters. Importing the bcp-generated files directly into Azure SQL Data Warehouse can fail. Sample code in C# to do the conversion is as follows:
public void ConvertTextFileToUTF8(string sourceFilePath, string destnFilePath)
{
string strLine;
using (StreamReader reader = new StreamReader(sourceFilePath, true))
{
using (StreamWriter writer = new StreamWriter(destnFilePath))
// Encoding is UTF-8 by default
while (!reader.EndOfStream)
{
strLine = reader.ReadLine();
writer.WriteLine(strLine);
}
}
}
You can also chain the Power Shell get-content and set-content cmdlets with the -encoding parameter option to change the encoding, as follows:
We assume that after implementing one of the above approaches, the files in UTF-8 are named with the convention filename.UTF8.txt. For example, AdventureWorksDWBuildVersion.UTF8.txt.
At the end of this step, there should be 34 UTF-8 encoded files created on disk ending with .UTF8.txt and corresponding to the 34 tables in the sample database.
Data: compression
In transferring large amounts of data to Azure or while working with networks that are limited in bandwidth or reliability, compression can cut down migration times. Exported files from data sources with text content tend to yield good compression ratios, resulting in significant size reduction and file transfer times. Delimited files compressed with the gzip compression format can be imported using Polybase (DATA COMPRESSION = ‘org.apache.hadoop.io.compress.GzipCodec’) into Azure SQL Data Warehouse. This way, you don’t need to decompress the files on Azure.
Tip: Note that Polybase supports gzip which is different from the popular Zip format. Choosing an unsupported compression format can result in import failures.
Tip: Create one compressed file for each export file. For easy import logic, avoid putting exported files of multiple tables in the same compressed archive file.
Tip: Split large files—larger than 2 GB—before compression. Each compressed file then has to spend a smaller amount of time on the network. It has a greater chance of getting across without interruption.
A popular tool that supports gzip compression is the 7-Zip compression utility. You can also compress files to the gzip format programmatically. In .NET, support for gzip compression is provided through the GZipStream class in the System.IO.Compression namespace.
Example
Sample code in C# that illustrates how to compress all files in a folder to the gzip format follows:
public static void Compress(string sourceFolderPath, string destnFolderPath)
{
string compressedFileName = null;
string compressedFilePath = null;
DirectoryInfo dirInfo = new DirectoryInfo(sourceFolderPath);
foreach (FileInfo fileToCompress in dirInfo.GetFiles())
{
using (FileStream originalFileStream = fileToCompress.OpenRead())
{
if ((File.GetAttributes(fileToCompress.FullName) &
FileAttributes.Hidden) != FileAttributes.Hidden &
fileToCompress.Extension != ".gz")
{
compressedFileName = Path.GetFileNameWithoutExtension(fileToCompress.FullName) + ".gz";
compressedFilePath = Path.Combine(destnFolderPath, compressedFileName);
using (FileStream compressedFileStream = File.Create(compressedFilePath))
{
using (GZipStream compressionStream = new GZipStream(compressedFileStream, CompressionMode.Compress))
{
originalFileStream.CopyTo(compressionStream);
}
}
}
}
}
}
You can also use 7-Zip or any other compatible compression utility for this purpose.
After completing this step, you can see that the exported files are about 116 MB in size. The compressed files are about 16.5 MB in size—about seven times smaller.
The sample code shown above stores the compressed files with an extension of .gz. For example, dbo.AdventureWorksDWBuildVersion.UTF8.gz.
At the end of this step, there should be 34 compressed files created on disk with the .gz extension, corresponding to the 34 tables in the sample database.
Data: transfer to Azure
Improving data transfer rates is a common problem to solve. Using compression and establishing a dedicated network to Azure using ExpressRoute have already been mentioned.
Other good approaches are to do data copies concurrently, execute the copy asynchronously, maintain a log of completed options and errors, and build in the ability to resume failed transfer. The AzCopy tool is optimized for large scale copy scenarios. It includes these techniques and many other options. The key features of interest are below:
Concurrency: AzCopy will begin concurrent operations at eight times the number of core processors you have. The /NC allows you to change the concurrency level.
Resuming: AzCopy uses a journal file to resume the incomplete operation. You can specify a custom journal location with the /Z option.
Logging: AzCopy generates a log file by default. You can provide a custom location with the /V option.
Restarting from the point of failure: AzCopy builds in a restartable mode that allows restart from the point of interruption.
Configurable Source and Destination Type: AzCopy can be used to copy from on-premises to an Azure storage account or from one Azure storage account to another.
Tip: Run one AzCopy instance on one machine. Control the concurrency using the /NC option instead of launching more instances.
Tip: A large number of concurrent operations in a low-bandwidth environment may overwhelm the network connection. Limit concurrent operations based on actual available network bandwidth.
Please read the AzCopy documentation to understand the utility and its parameters.
Tip: For some source locations, the network connectivity may be poor, establishing an Express Route connectivity may not be possible, and the size of the data to transfer may be large. In such cases, if data transfers become infeasible to implement—even with compression and AzCopy—explore the Azure Import Export Service. You can transfer data to Azure blob storage using physical hard drives with this service.
Example
You can execute AzCopy after installation from a command prompt using the following syntax:
Note the following with respect to the placeholders in the above command:
<YourAzCopyPath>: Provide the AzCopy install path such as C:\Program Files (x86)\Microsoft SDKs\Azure\AzCopy/AzCopy.exe or modify your path variable if you want to avoid specifying the full path.
<YourLocalPathToCompressedFiles>: Provide the path to the folder containing the gzip files.
<YourStorageAccount>, <YourStorageContainer>, <YourStorageAccountKey>: Provide these based on the storage account created in the Azure Preparation step.
/NC: <YourConcurrencyLevel>: Set the value to be the number of cores on the source machine on which AzCopy is executed.
If the parameters are supplied correctly, AzCopy will start copying the files and report running progress on the number of files copied and the transfer rate as follows (your transfer rate can be different):
AzCopy maintains a log file and journal file at %LocalAppData%\Microsoft\Azure\AzCopy.
If the journal file does exist, AzCopy will check whether the command line that you input matches the command line in the journal file. If the two command lines match, AzCopy resumes the incomplete operation. If the two command lines don’t match, you’ll be prompted to overwrite the journal file to start a new operation, or to cancel the current operation with a message like the one below:
Incomplete operation with same command line detected at the journal directory "<YourAzCopyLocation>", do you want to resume the operation? Choose Yes to resume, choose No to overwrite the journal to start a new operation. (Yes/No)
At the end of this step, 36 files should have been transferred to your Azure storage account. You can use the Azure Storage Explorer GUI tool to check if the files are available in the storage account.
Data Movement Library: Use the Azure Storage Data Movement Library that is based on the core data movement framework that powers AzCopy. Here is sample code in C#.NET that demonstrates how to upload a blob to Azure using the Data Movement Library.
Data: import
PolyBase is the fastest mechanism to import data into Azure SQL Data Warehouse. PolyBase parallelizes loads from Azure Blob storage, reads all the files inside a folder and treats them as one table, supports the gzip compression format and UTF-8 encoding, and Azure Blob store as the storage mechanism. Loading with PolyBase data allows data import to scale in speed and proportion to the allocated data warehouse units (DWUs) on Azure SQL Data Warehouse. For a more detailed discussion on data loading strategies and best practices, refer to the Azure CAT Guidance on Azure SQL Data Warehouse loading patterns and strategies.
Tip: Choices in the overall migration process contribute to fast loading with Polybase. These are:
Creating folders for each table Creating multiple files for the tables.
Creating multiple files for each large table.
Converting the exported data to UTF-8.
Creating multiple compressed files.
Compressing the data to the gzip format.
Copying the compressed data to Azure Blob storage.
Doing the Polybase preparation steps (including creating external tables.
Doing final loading from external tables with Polybase queries.
If you’ve been following the running example, you’ve practiced most of these techniques already!
Tip: The DWUs allocated for the target Azure SQL Data Warehouse make a difference to the load speed. For more information, refer to the “Data Reader, Writers consideration” section in the Azure CAT guidance on Azure SQL Data Warehouse loading patterns and strategies.
Tip: Depending on your specific scenario, there could be a performance advantage in one of two possible techniques, both using Polybase:
Transferring and importing from uncompressed files (slower export and transfer, faster load)
Transferring and importing compressed files (faster export and transfer, slower load)
What matters is the overall process performance. Depending on the network speed and the data sizes tests, a few tests with both techniques may help determine which works best in your context.
There are two ways to import data with Polybase from blobs in Blob storage:
CREATE TABLE AS: This option creates the target table and load the table. Use this for first time loading.
INSERT INTO… SELECT * FROM: This option loads data into an existing target table. Use this with subsequent loads.
Example
Before you can execute the load queries, you need to execute external table creation queries that were created in the “Metadata export” step. Since the files referred to by the External Table creation queries have been transferred to Azure Blob Storage, the external table locations are valid. Those queries can be executed at this time. Ensure that the table creation and external table creation steps are successful before attempting to import data.
In our example, we use the INSERT INTO … SELECT * FROM method to import data into Azure SQL Data Warehouse for easy illustration so you can run it multiple times. This requires you to generate an INSERT INTO … SELECT * FROM query for each table in the sample database.
A sample query is as follows:
INSERT INTO dbo.AdventureWorksDWBuildVersion
SELECT * FROM dbo.AdventureWorksDWBuildVersionExternalGz
A full list of sample INSERT…SELECT queries can be found here .
During import if you receive errors, correct the root cause using the error messages. In the “Data: Compatibility changes” section, we mentioned the causes of most common errors. Note that formats incompatible with Polybase will be rejected. For example, UTF-16 encoding, Zip Compression, and JSON format.Note that Polybase supports:
Encoding: UTF-8
Format: delimited text files, Hadoop file formats RC File, ORC, and Parquet
Compression: gzip, zlib, and Snappy compressed files
Once Import is successful, check the source database tables row counts against the row counts in the corresponding Azure SQL Data Warehouse tables.
This completes our example.
Data transformation and optimization
Once you have successfully migrated your data into Azure SQL Data Warehouse, the next immediate step is to create statistics on your newly loaded data using the CREATE STATISTICS statement on all columns of all tables.
If you plan to query data using external tables, you need to create statistics on external tables also. After this, you may want to do transformations on the data prior to executing query workloads.
Tip: Distinguish between conversionsbefore load for compatibility (such as date format conversion, and UTF-8 encoding) and data transformations after load that can be done on Azure SQL Data Warehouse after loading is complete. These transformations are better done on Azure SQL Data Warehouse instead of on the source, exploiting the full processing power and scale of Azure SQL Data Warehouse. An Extract Load Transform (ELT)pattern rather than an Extract Transform Load (ETL) pattern may work better for you.
Reviewed by: Dimitri Furman, Kun Cheng, Denzil Ribeiro
Database Instant File Initialization helps improve performance of certain file operations. Prior to SQL Server 2016, enabling instant file operation has been cumbersome (editing the Local Security Policy to add the SQL Server service account to the Perform Volume Maintenance Tasks policy, followed by restarting SQL Server instance), therefore some administrators missed out on this performance improvement technique.
If you want to enable instant file initialization, SQL Server 2016 makes life simpler for DBAs and System Administrators by providing a simple checkbox during the install of SQL Server, as shown in Figure 1.
Figure 1: Option to enable instant file initialization while installing SQL Server 2016
The checkbox “Grant Perform Volume Maintenance Task privilege to SQL Server Database Engine Service” is unchecked by default. To enable instant file initialization, all you need to do is check that box. No need to edit the security policies through the Local Security Policy application any more.
Notably, setup grants the privilege to the per-service SID for the SQL Server instance, e.g. to the NT SERVICE\MSSQL$SQL2016 security principal, for an instance named SQL2016. This is preferable to granting the privilege to the SQL Server engine service account, which is still sometimes done by administrators. The service account is subject to change, and if changed, SQL Server could unexpectedly lose the IFI privilege. But the per-service SID remains the same for the lifetime of the instance, which avoids this risk.
To emphasize the impact of instant file initialization, I installed a SQL Server 2014 instance on a server and restored a database (of size 190 GB). By default, this SQL Server instance doesn’t have instant file initialization enabled. And, then installed a SQL Server 2016 instance on the same server (checked the above mentioned checkbox during the install), and restored from the same database backup. The results are in Figure 2.
Figure 2: Improved restore time with instant file initialization in SQL Server 2016.
How to know if instant file initialization was used while restoring your database? Use the simple techniques described here: https://blogs.msdn.microsoft.com/sql_pfe_blog/2009/12/22/how-and-why-to-enable-instant-file-initialization/, or (another improvement in SQL Server 2016) check the server error log. If IFI is enabled, the following message is logged during server startup: “Database Instant File Initialization: enabled. For security and performance considerations see the topic ‘Database Instant File Initialization’ in SQL Server Books Online. This is an informational message only. No user action is required.”
Instant file initialization not only helps improve restore performance, but helps other operations as well, such as creating a database or adding new files to an existing database, extending a file or autogrow operations.
In the old days of Azure SQL Database (prior to V12), SQL Database used what is called a gateway to proxy all connections and communications between clients and user databases. With V12, the gateway is still there, but it helps to establish the initial connection, and then gets out of the way in some cases. In the cases where direct connection can be established, subsequent communication happens directly between client and user database without going through the gateway anymore. This feature is also known as client “redirection”. The benefit of this “redirection” is faster response time for each database call, and better performance.
So how do you know if your application is taking advantage of the “redirection”?
The first restriction is that “redirection” by default is only supported for connections originating within Azure IP address space, so your application and Azure SQL database must both be deployed in Azure. However, an application outside Azure can also use “redirection” when a server connection policy is properly created (connectionType should be set as “Redirect” to enable “redirection”) against the target Azure SQL Database server. Keep in mind though the latency/perf benefit of redirection is very much diminished in the latter scenario since internet connection latency from outside the Azure data center would be much higher.
Second, your application must be using a SQL Server driver that supports TDS 7.4. Those drivers include (not a comprehensive list):
ADO.Net 4.5 or above
Microsoft SQL Server JDBC 4.2 or above (JDBC 4.0 actually supports TDS 7.4 but does not implement “redirection”)
Microsoft SQL Server ODBC 11 or above
— Note: Tedious for Node.js and JDBC 4.0 don’t implement redirection.
A simple way to find out what version of TDS the application is using is by querying:
SELECT session_id, protocol_type, protocol_version = SUBSTRING(CAST(protocol_version AS BINARY(4)),1,1)
FROM sys.dm_exec_connections
Sample output:
session_id protocol_type protocol_version
89 TSQL 0x74
105 TSQL 0x74
If protocol_version is equal to or greater than 0x74 then the connection would support “redirection.”
Third, as documented here, even applications using the right SQL Server drivers aren’t guaranteed to make successful connections via “redirection”. You also need to make sure the following ranges of outbound TCP ports (in addition to 1433) are open on the application instance: 11000-11999, 14000-14999. This is the reason why “redirection” is not enabled by default for connections originating outside of Azure – in some on-premises environments, network administrators may be unwilling to open these additional outbound port ranges, causing connection attempts to fail.
Use Wireshark to look deeper how redirection works
Now let’s use Wireshark (a network tracing tool) to examine the network traffic of a sample application running on an Azure VM that connects to an Azure SQL database, so we can see how it works. (If your application is deployed in a VM or cloud service, you can RDP into your app instance and install 3rd-party tools like Wireshark. Azure App Service doesn’t allow RDP.)
Sample application connection step through:
Open a new connection to an Azure SQL database
Execute command to run Ad-hoc query 1
Execute command to run Ad-hoc query 2
In step #1, when new connection is being established, we can see in Wireshark the TCP connection handshake pre-login as shown below (starting at time 2.702112). 10.5.0.4 is local VM IP address where the application is running. 191.235.193.75 is the gateway IP address, used for inbound traffic on default port 1433.
To finish establishing the connection, a dynamically identified port, in this case 11142, was sent to the application (time 2.790811). The application used that port and connected to the target user database (time 2.791394), with the IP address 191.235.193.77. The application then executed the first command (time 2.792376+).
Let’s proceed with executing the 2nd Ad-hoc query command. Remember that the connection is still open at this point, so when the application sends the command, it doesn’t need to go through the gateway (191.235.193.75) anymore. Instead it uses the “redirection” to communicate with the user database (191.235.193.77) directly (time 8.891064+).
Reviewed by: Kun Cheng, Sanjay Mishra, Denzil Ribeiro, Arvind Shyamsundar, Mike Weiner, and Murshed Zaman
The Problem: A Production Outage
A customer using Azure SQL Database recently brought an interesting problem to our attention. Unexpectedly, their production workload started failing with the following error message: “The database ‘ProdDb’ has reached its size quota. Partition or delete data, drop indexes, or consult the documentation for possible resolutions.” The database was in a Premium elastic pool, where the documented maximum size limit for each database is 500 GB. But when they checked the size of the database shown in the Azure Portal, it was only 10 GB, and the portal was showing that all available database space has been used. Naturally, they were wondering why the database was out of space even when they were not near the maximum database size limit for their premium elastic pool.
Explanation
One of the established capacity limits of each Azure SQL DB database is its size. The maximum size limit is determined by the service objective (a.k.a. performance tier, or service tier) of the database, as documented in resource limit documentation. To determine the size limit, or size quota, that is set for a particular database, the following statement can be used, in the context of the target database:
When a new database is created, by default its size quota is set to the maximum allowed for the service objective. However, it is possible to set the limit to a lower value, either when creating the database, or later. For example, the following statement limits the size of an existing database named DB1 to 1 GB:
ALTER DATABASE DB1 MODIFY (MAXSIZE = 1 GB);
Customers can use this ability to allow scaling down to a lower service objective, when otherwise scaling down wouldn’t be possible because the database is too large.
While this capability is useful for some customers, the fact that the actual size quota for the database may be different from the maximum size quota for the selected service objective can be unexpected, particularly for customers who are used to working with the traditional SQL Server, where there is no explicit size quota at the database level. Exceeding the unexpectedly low database size quota will prevent new space allocations within the database, which can be a serious problem for many types of applications.
In this context, there is one particular scenario that we would like to call out. Specifically, when a database is scaled up to a higher service objective, its size quota, whether the default for the previous service objective, or an explicitly lowered one, remains unchanged. For an administrator expecting the maximum size quota for the new service objective to be in effect after the scaling operation completes, this may be an unpleasant surprise.
Let’s walk through an example. First, let’s create an S2 database without specifying an explicit database size quota:
CREATE DATABASE DB1 (SERVICE_OBJECTIVE = 'S2');
Once the database is created, we can query its current size quota, and see that it is set to the expected maximum for S2:
We see that the quota has been lowered to 10 GB as expected. Now, let’s scale the database up to P1:
ALTER DATABASE DB1 MODIFY (SERVICE_OBJECTIVE = 'P1');
Note that scaling operations are asynchronous, so the ALTER DATABASE command will complete quickly, while the actual change can take much longer. To determine if the scaling operation on the DB1 database has completed, query the sys.dm_operation_status DMV in the context of the master database.
SELECT operation, state_desc, percent_complete, start_time, last_modify_time
FROM sys.dm_operation_status
WHERE resource_type_desc = 'Database'
AND
major_resource_id = 'DB1'
ORDER BY start_time;
/*
operation state_desc percent_complete start_time last_modify_time
--------------------------- ---------------- ----------------------- -----------------------
CREATE DATABASE COMPLETED 100 2016-09-02 15:11:28.243 2016-09-02 15:12:09.933
ALTER DATABASE COMPLETED 100 2016-09-02 15:16:49.807 2016-09-02 15:16:50.700
ALTER DATABASE COMPLETED 100 2016-09-02 15:23:26.623 2016-09-02 15:25:24.837
*/
This shows all recent operations for the DB1 database. We see that the last ALTER DATABASE command has completed. Now we can query the size quota again (in the context of the DB1 database):
We see that even though the maximum size limit for a P1 database is 500 GB, the quota is still set to 10 GB.
Conclusion
It is important to know that in Azure SQL DB databases, an explicit database size quota always exists. This quota can be lower than the maximum (and default) quota for a given service objective. While for some customers this may be intentional, most would prefer the maximum quota to be in effect, particularly after scaling the database up.
We recommend that customers:
1. Proactively check the current size quota for your databases, to make sure it is set as expected. To do this, the following statement can be used in the context of the target database:
2. When scaling up to the service objective with a larger maximum size quota, explicitly change the quota to match the maximum by using the ALTER DATABASE … MODIFY (MAXSIZE = …) command as shown above (unless a lower quota is desired to guarantee being able to scale down in the future). The change is executed in an online manner.
This is what the customer we mentioned in the beginning of this article did in order to resolve their application outage, and to proactively prevent a reoccurrence of the same problem.
Reviewed by: Dimitri Furman, Jakub Szymaszek, Sanjay Mishra, Kun Cheng, Mike Ruthruff
Background
A common scenario today involves migrating a web application (based on IIS) and the on-premises SQL Server database to either Azure SQL DB or Azure SQL VM. One of the important thoughts in the mind of customers embarking on such projects is about data security and privacy. The good news is, for data stored in the relational database, the Always Encrypted feature in Azure SQL Database (and SQL Server) offers a unique end-to-end way to protect sensitive data from hostile or accidental disclosure.
For the purposes of this post, it is assumed that you have some familiarity with how Always Encrypted works. If you are new to this subject, please first read more about the feature at the Always Encrypted page. If you are interested in security as it applies to Azure SQL Database in general, this page is a great place to start as it has links to other key features such as Auditing, Threat Detection etc.
Scenario
Azure App Service is the cloud platform for web applications in Azure. This is a Platform as a Service (PaaS) service, so when using a feature like Always Encrypted in SQL Server some considerations arise from an encryption key management perspective. As a quick reminder, Always Encrypted uses 2 keys:
Column Encryption Key (CEK), which is resident in an encrypted format within the database itself
Column Master Key (CMK), which is only present on authorized computers
The CMK is used by the application to decrypt the encrypted CEK received from the SQL instance. The decrypted CEK is in turn used to decrypt and encrypt actual data. This is the unique value proposition of Always Encrypted: the SQL instance never has access to the plaintext data. While there are options like Azure Key Vault or Hardware Security Modules (HSMs), quite commonly, the CMK is actually a certificate containing a private key.
For on-premises or VM based deployments of an application, it is fairly easy to manage the deployment of such a certificate which contains the CMK. However, in Azure App Service, which is a PaaS service, some simple steps are required to get the web application to ‘find’ the certificate containing the CMK.
Detailed Steps
To get the Azure web application to correctly work with Always Encrypted, here are the steps you need to do. Note that these steps assume that you have correctly encrypted the data in the column(s), if any, using a tool like SQL Server Management Studio or other methods like PowerShell / BulkCopy. We also assume that you have the certificate containing the CMK installed on your local machine. Finally, we also assume that you are using ASP.NET and referencing.NET Framework 4.6 or higher.
Locate the certificate using MMC
To begin, locate and export the certificate corresponding to your CMK. To do this, you need to use the Certificates MMC add-in. An important assumption here is that the certificate is stored in the ‘Current User’ store. This is because Azure App Service does not expose the equivalent of the ‘Local Machine’ store for web applications, and the default target certificate container is recorded within the certificate when it is exported.
Export the certificate
Next, we can proceed to export the certificate as a PFX file. To do this, right click on the correct certificate (located as per above steps) and click on Export.
Protect it with a secure password, and save it as a .PFX file on your local computer.
Upload and use the certificate
We can now upload this certificate (in the form of a .PFX file) to Azure. A pre-requisite for doing this is that your Web App must be in the Basic or higher App Service plan / tier (Free and Shared tiers do not permit the upload of certificates.)
Azure Portal
The easiest way is to use the current Azure Portal and navigate to the Web App under your App Service. Once you locate it, in the Application Settings for the Web App, you will find an option to define ‘SSL Certificates’ for the application. Here is where you can use the Upload Certificate button as shown below to upload the PFX file that we generated previously. Do note that you will have to supply the password used to protect the certificate:
Once this is done, you also need to add a ‘WEBSITE_LOAD_CERTIFICATES’ setting with the thumbprint of the certificate that you noted previously. This setting is discussed in detail here.
FYI, you can also do this in the ‘classic’ Azure portal as described here. Once in the Azure portal, select the web application and click on the Configure tab. There, you will find the Certificates section, where you can upload the PFX file which we just generated. Here too you have to supply the password which was used to protect the certificate:
Once the certificate has been uploaded, note the ‘Thumbprint’ for the same. This is the key identifier for the web application to later ‘load’ the certificate at runtime. To make the web application ‘load the certificate’ you must scroll down to the ‘app settings’ section and add a ‘WEBSITE_LOAD_CERTIFICATES’ setting with the thumbprint of the certificate. Make sure you click the SAVE button at the bottom of the screen after you do these changes – it’s easy to miss it otherwise!
Once this is done, the web application will be able to load the certificate when the Always Encrypted client driver code internally requests it. That’s it – your web application and data are a lot more secure now!
Handling CMK Rotation
If you already were using Always Encrypted, you probably know that rotating CMKs periodically is a common requirement. The process of CMK rotation is documented here. For example, if you do rotate your keys using SQL Management Studio (SSMS), you must ensure that the certificate corresponding to the new CMK is uploaded to the Azure portal as described above. The overall process would look like this:
From an administrative workstation where SSMS is installed, create a new CMK stored in the local user certificate store
Export the certificate corresponding to that CMK just as described in this article
Follow the steps as shown above to import the new certificate into the portal, and add the new certificate’s thumbprint ID as well to the WEBSITE_LOAD_CERTIFICATES setting
Note: use a comma character to separate the old and new thumbprint values. Do not leave any spaces in between
At this point you would have both certificates uploaded to the Azure Portal and Azure Web App Service
From the administrative workstation use SSMS to perform CMK key rotation; you can also use PowerShell cmdlets to do this
Eventually after the key rotation has completed, use SSMS to perform cleanup of the CEKs associated with the old CMK
Drop the old CMK from SQL DB – you can use T-SQL or the SSMS GUI to easily do this. You can also do this via the PowerShell cmdlet for Always Encrypted – specifically Remove-SqlColumnMasterKey
Using the current Azure portal, delete the certificate containing the old CMK from the Web App
Again in the Azure portal, navigate to App Settings and remove the old CMK’s thumbprint from the WEBSITE_LOAD_CERTIFICATES setting. Ensure that you remove the comma character as well!
In Closing
Always Encrypted is a unique feature which offers declarative encryption with little or no change to applications. Knowing how this feature operates in conjunction with other services, such as Azure Web Apps is very important to successful implementation. We hope you find the above steps useful. Do let us know if you have further questions and / or feedback. You can also reach us on Twitter if you prefer that!
A question that is frequently asked by customers using Azure SQL Database is “How can I determine the size of my database programmatically?” Interestingly, different people may be talking about different things when asking this question. Is it the size of all database files on disk? Is it the size of just the data files? Is it the size of used space in the database? Is it the total size of allocated and empty space in the database? Depending on the context, all these things may be the right answer to the question.
Today, if you do a web search on this topic, the most frequent answer to this question will point you to querying the sys.dm_db_partition_stats DMV, and looking at the reserved_page_count column. Other solutions involve querying sys.allocation_units and sys.resource_stats DMVs, or using sp_spaceused stored procedure.
In the context of Azure SQL Database, the measurement that most customers would be interested in is the size used by the Azure SQL Database service to govern the size of the database, i.e. the 161.29 GB that is shown in Azure Portal in this example:
However, none of the methods mentioned earlier will accurately provide that measurement for V12 databases. sys.dm_db_partition_stats and sys.allocation_units may underestimate the size, because they do not consider empty space in data files. sys.resource_stats averages database size over five minute intervals, and therefore does not consider the most recent changes in space usage. sp_spaceused returns several size values, however the size of data files on disk, which is used by the service, is not one of them.
For V12 databases, the measurement we are interested in is determined using the size column in the sys.database_files DMV, which returns the number of pages in a data file. Only ROWS files are considered. Log and XTP files are excluded for the purposes of determining database size.
The following statement is an example of the correct way to determine the size of an Azure SQL Database V12 database programmatically:
SELECT SUM(CAST(size AS bigint) * 8192.) AS DatabaseSizeInBytes, SUM(CAST(size AS bigint) * 8192.) / 1024 / 1024 AS DatabaseSizeInMB, SUM(CAST(size AS bigint) * 8192.) / 1024 / 1024 / 1024 AS DatabaseSizeInGB FROM sys.database_files
WHERE type_desc = 'ROWS';
Hi all, we are looking forward, as we are sure you are, to a great Microsoft Ignite 2016! Three members of the SQLCAT team will be in Atlanta (September 26th-30th) and of course we would love to see everyone and talk about SQL Server, Azure SQL DB, and SQL DW.
We have 3 sessions where we will share some great customer experiences and learnings with SQL Server 2016. Mike Weiner (@MikeW_SQLCAT) will co-present with early adoptors from Attom Data Solutions and ChannelAdvisor:
BRK2231 Understand how ChannelAdvisor is using SQL Server 2016 to improve their business (Wednesday, September 28th from 2-2:45PM EST)
BRK3223 Underststand how Attom Data Solutions is using SQL Server 2016 to accelerate their business (Thursday, September 29th from 9-9:45AM EST)
Then, be sure to keep your Friday open for Arvind Shyamsundar (@arvisam) and Denzil Ribeiro’s (@DenzilRibeiro) presentation:
BRK3094 Accelerate SQL Server 2016 to the max: lessons learned from customer engagements (Friday, September 30th from 12:30-1:45PM EST)
When we are not presenting we’ll primarily be at the Expo in Hall B at the Microsoft Showcase – Data Platform and IoT, eager to talk to you! Look forward to seeing everyone there!
Reviewed by: Dimitri Furman, Denzil Ribeiro, Mike Ruthruff, Mike Weiner, Ryan Stonecipher, Nitish Upreti, Panagiotis Antonopoulos, Mirek Sztajno
Last year, in the SQLCAT lab we were working with an early adopter customer running pre-release SQL Server 2016 on Windows Server 2016 Tech Preview 4. The workload being used was the ASP.NET session state workload: the session state data was stored in a non-durable memory-optimized table and for best performance, we used natively compiled stored procedures.
In the lab tests, we were easily able to exceed the performance of the same workload running on SQL Server 2014 with comparable hardware. However, at some point, we found that CPU was our sole bottleneck, with all the available CPUs maxed out at 100% usage.
Now, on one hand we were glad that we were able to maximize the CPU usage on this system – this was because in SQL Server 2016, we were able to natively compile all the stored procedures (which was not possible earlier in in SQL Server 2014 due to some T-SQL syntax not being supported in native compiled procedures.)
We started our investigation by using the regular tools such as DMVs, performance counters and execution plans:
From performance counter data, we could see that the SQL Server process (SQLServr.exe) was taking up the bulk of CPU time. The OS and device driver CPU usage (as measured by the %Privileged Time counter) was minimal.
From DMVs, we identified the top queries which were consuming CPU time and looked at ways to optimize the indexes on the memory-optimized tables and T-SQL code; those were found to be optimal.
There was no other evident bottleneck, at least from a user-controllable perspective.
At this point it was clear we needed to ‘go deeper’ and look at the lower-level elements within the SQL process contributing to CPU usage on this box to see if we could optimize the workload further.
Going Deeper
Since this was a high CPU problem and we wanted to get right to the ‘guts’ of the issue, we used the Windows Performance Toolkit, a great set of tools provided as part of the Windows SDK. Specifically, we used the command line tool called XPERF to capture a kernel level trace of the activity on the system. Specifically, we used a very basic and relatively lightweight capture initiated by the following command line:
xperf -On Base
This command ‘initializes’ a kernel logger which starts capturing critical traces. To stop the tracing (and we did this within a minute of starting, because these traces get very large very quickly!) we issued the following command:
xperf -d c:\temp\highcpu.etl
For the purposes of this post, we reproduced the ASP.NET session state workload with SQL Server 2016 RTM (the final released version) in the lab, and collected the XPERF trace as our ‘baseline’. Later in this post, we will compare how things look with SQL Server 2016 Cumulative Update 2.
Analyzing the Trace
The output produced by the above command is an ETL file which can then be analyzed using the Windows Performance Analyzer (WPA) tool. WPA is typically run on another system, in my case it was my own laptop where I had access to the Internet. The reason Internet access is important is because WPA can use ‘symbol files’ which contain more debugging information, specifically the DLL and function names associated with the machine code which was found to be running on the CPU during the trace collection. Once we get function names, it is generally much easier to understand the reason for the high CPU usage.
When we used WPA to analyze the ETL trace collected for the workload running on the RTM installation of SQL Server 2016, we obtained the following result.
The important thing to notice is the function which starts with the words Spinlock<61,16,1>. Notice that the %Weight column for that row is 31.81% which is a sizeable number. To break this down further, we must query the sys.dm_xe_map_values view:
SELECT *
FROM sys.dm_xe_map_values
WHERE map_key = 61
AND name = 'spinlock_types';
Here is the spinlock associated with the number 61. We get this ‘magic number’ from the output in the XPerf trace. The name of this spinlock with the key 61 is CMED_HASH_SET:
For more information on how to troubleshoot spinlocks and why they can on occasion cause high CPU usage, please refer to this guidance paper which is still very relevant even though it was written in the SQL Server 2008 timeframe. One change to note for sure is that the whitepaper refers to the older ‘asynchronous_bucketizer’ targets for identifying the call stack for the spinlocks with the most contention. In later versions of SQL Server, we need to use the ‘histogram’ target instead. For example, we used this script to create the extended event session:
CREATE EVENT SESSION XESpins
ON SERVER
ADD EVENT sqlos.spinlock_backoff
(ACTION
(package0.callstack)
WHERE type = 61
)
ADD TARGET package0.histogram
(SET source_type = 1, source = N'package0.callstack')
WITH
( MAX_MEMORY = 32768KB,
EVENT_RETENTION_MODE = ALLOW_SINGLE_EVENT_LOSS,
MAX_DISPATCH_LATENCY = 5 SECONDS,
MAX_EVENT_SIZE = 0KB,
MEMORY_PARTITION_MODE = PER_CPU,
TRACK_CAUSALITY = OFF,
STARTUP_STATE = OFF
);
We use the extended event session output to validate the ‘call stack’ (the sequence of calls which led to the spinlock acquisition calls.) This information is very useful for the engineering team to understand the root cause of the spinlock contention, which in turn is useful to arrive at possible fixes.
Why is this spinlock a bottleneck?
The spinlock bottleneck described in this blog post is especially prominent in this ASP.NET session state workload, because everything else (natively compiled procedures, memory optimized tables etc.) is so quick. Secondly, because these spinlocks are shared data structures, contention on them becomes especially prominent with larger number of CPUs.
This CMED_HASH_SET spinlock is used to protect a specific cache of SQL Server metadata, and the cache is looked up for each T-SQL command execution. At the high levels of concurrency (~ 640000 batch requests / second), the overhead of protecting access to this cache via spinlocks was a huge chunk (31.81%) of overall CPU usage.
The other thing to note is that the cache is normally read-intensive, and rarely updated. However, the existing spinlock does not discriminate between read operations (multiple such readers should ideally be allowed to execute concurrently) and write operations (which have to block all other readers and writers to ensure correctness.)
Fixing the issue
The developers then took a long hard look at how to make this more efficient on such large systems. As described before, since operations on the cache are read-intensive, there was a thought to leverage reader-writer primitives to optimize locking. However, any changes to this spinlock had to be validated extensively before releasing publicly as they may have a drastic impact if incorrectly implemented.
The implementation of the reader / writer version of this spinlock was an intricate effort and was done carefully to ensure that we do not accidentally affect any other functionality. We are glad to say that the final outcome, of what started as a late night investigation in the SQLCAT lab, has finally landed as an improvement which you can use! If you download and install Cumulative Update 2 for SQL Server 2016 RTM, you will observe two new spinlocks in the sys.dm_os_spinlock_stats view:
LOCK_RW_CMED_HASH_SET
LOCK_RW_SECURITY_CACHE
These are improved reader/writer versions of the original spinlocks. For example, LOCK_RW_CMED_HASH_SET is basically the replacement for CMED_HASH_SET, the spinlock which was the bottleneck in the above case.
CU2 Test Results
Putting this to the test in our lab, we recently ran the same workload with CU2 installed. For good measure, we collected a similar XPERF trace. On analyzing the XPERF trace, we can clearly see the spinlock is gone from the top consumer list, and that we have some in-memory OLTP code (as identified by the HkProc prefixes) on top of the list. This is a good thing because our workload is comprised of natively compiled procedures and that is getting most of the time to execute!
Figure 3: Spinlock CPU% and Throughput in SQL Server 2016: RTM vs CU2
RTM
CU2
% CPU
85
72
Spinlock % CPU Contribution
30
Not observed
Throughput (Batch Requests / sec)
640000
720000
As you can see, there was an almost 13% improvement in workload throughput, at a corresponding reduced CPU usage of 13%! This is great news for those mission critical workloads, because you can do even more in the headroom made available now.
Conclusion
Hardware is constantly evolving to previously unthinkable levels. Terabytes of RAM and hundreds of cores on a single server is common today. While these ‘big boxes’ eliminate some classic bottlenecks, we may end up unearthing newer ones. Using the right tools (such as XPerf, WPA, XEvents in the above example) is critical to precisely identifying the bottleneck.
Finally, identifying these issues / tuning opportunities during the pre-release phase of the product lifecycle is really useful as it gives adequate time for the appropriate fixes to be thought through, implemented correctly, validated in lab tests and then finally released.
We hope you enjoyed this walkthrough! Do let us know if you have any questions or comments. We are eager to hear from you!
Reviewers: Dimitri Furman, Benjin Dubishar, Raghav Kaushik, Jakub Szymaszek
Always Encrypted is one of the highly acclaimed features in SQL Server 2016. The key value prop in Always Encrypted is that SQL Server itself cannot decrypt the data as it will not have access to the ‘Column Master Key’ (CMK). This also poses a challenge for application developers / administrators as the only way to encrypt existing data is to essentially ‘pump it out’ into an application which has access to the CMK. Typically, this application for DBAs and developers is SQL Server Management Studio (SSMS), and using SSMS is acceptable when encrypting a few columns of data under human supervision. But in an environment with alrge number of tables and columns, or when the schema of the database is dynamic, or when the column encryption has to be triggered from a user application, using SSMS to do this manually is not an option.
iCertis is an early adopter of Always Encrypted in their application. The schema of the databases involved in this application is highly customizable by the end customer. Some columns in this schema might need to be encrypted using Always Encrypted and hence there is a need to (programmatically) automate the encryption from the application. This blog post shows how iCertis achieved that requirement.
Introducing the Always Encrypted PowerShell cmdlets
The July 2016 release of SSMS (and later versions) introduced a set of PowerShell cmdlets through a new ‘SqlServer’ module. This page describes the various capabilities that these cmdlets bring to the table. Of most interest to the specific scenario described above is the Set-SqlColumnEncryption cmdlet. In the post below, we will walk through the steps required to use this – first from a PowerShell session to test the capability, and then finally from a C# application which is using PowerShell Automation to invoke the cmdlets from an application.
As a side note it is worth knowing that the cmdlets in the ‘SqlServer’ PowerShell module can also be used for automating key setup and management (and are, in many ways, more powerful than SSMS – they expose more granular tasks, and thus can be used to achieve role separation and to develop a custom key management workflow – but that is likely a topic for a separate post!)
Encrypting data
Here is sample code which uses SMO classes to establish a connection to the database and then invokes the Always Encrypted PowerShell cmdlets to encrypt data in a column.
# Import the SqlServer module
Import-Module "SqlServer"
# Compose a connection string
$serverName = "SQLServerNetworkName\InstanceName"
$databaseName = "AETest"
$connStr = "Server=$serverName; Database=$databaseName; Integrated Security=true;"
# Connect to the database
$connection = New-Object Microsoft.SqlServer.Management.Common.ServerConnection
$connection.ConnectionString = $connStr
$connection.Connect()
# Get an instance of the SMO Database class
$server = New-Object Microsoft.SqlServer.Management.Smo.Server($connection)
$database = $server.Databases[$databaseName]
# Create a class to define the column(s) being encrypted and their CEK name.
# In this sample we are just encrypting one column.
$ces = New-SqlColumnEncryptionSettings -ColumnName "dbo.SampleTable.SampleColumn" -EncryptionType "Deterministic" -EncryptionKey "SampleCEK"
$cesarray = @()
$cesarray += $ces
# The most important step: encrypt the data
$database | Set-SqlColumnEncryption -ColumnEncryptionSettings $ces
Invoking the script from C#
In this section we show you how an application developer can invoke the above cmdlets from C# code. For simplicity we demonstrate how to do this with a C# console application. Once the project is opened, add a Nuget package for System.Management.Automation:
Installing Nuget package for System.Management.Automation
You can also use the following command from the Package Manager console in VS.NET:
Install-Package System.Management.Automation.dll
Once the reference to the PowerShell 3.0 library has been added, you can use code such as the sample code below to execute the PowerShell script (which has been added to the project folder as script.txt)
namespace SampleApp
{
using System;
using System.IO;
using System.Management.Automation;
using System.Management.Automation.Runspaces;
class Program
{
static void Main(string[] args)
{
PowerShell ps = PowerShell.Create();
using (var fs = new FileStream(@".\script.txt", FileMode.Open))
{
using (var sr = new StreamReader(fs))
{
var cmd = new Command(sr.ReadToEnd(), true);
var pipeline = ps.Runspace.CreatePipeline();
pipeline.Commands.Add(cmd);
try
{
var results = pipeline.Invoke();
}
catch (RuntimeException ex)
{
Console.WriteLine("Error executing script: exception details: " + ex.GetType().Name + "; " + ex.Message + "\r\n" + ex.StackTrace);
}
}
}
}
}
}
Note that the exception handling above will be useful in case the PowerShell script reports errors, for example if the CEK name is incorrect and so on. Such diagnostics are critical for production usage.
Lessons Learned
Here are some important learnings and considerations from this exercise:
Currently the only supported way of getting this SqlServer PowerShell module is to install SSMS (as per above paragraph.) This is a constraint if deploying the application into an Azure App Service or such ‘unattended deployment’ environments. This will be addressed in due course by the SQL engineering team.
Set-SqlColumnEncryption can take a very long time for a complex database schema with large number of tables, constraints etc. It will also take time when there is a large amount of data to be encrypted. In such cases, it is better to perform these operations on a background worker thread and not on a UI thread which may be subject to a request timeout setting.
If you are using or planning to use Always Encrypted, we would love to hear from you.
Are you coming to the PASS Summit 2016 in Seattle? SQLCAT will be in full force at the PASS Summit 2016, and we will also bring along our colleagues from the broader AzureCAT team as well.
SQLCAT / AzureCAT Sessions
SQLCAT / AzureCAT sessions are unique. We bring in real customer stories and present their deployments, architectures, challenges and lessons learned. This year at the PASS Summit, we will have 9 sessions – each one filled with rich learnings from real world customer deployments.
8 customers will join us as co-speakers in various sessions to present their workloads, deployment scenarios and lessons learned.
bwin
bwin, part of GVC Holdings PLC, is one of Europe’s leading online betting brands and is synonymous with sports. Having offices situated in various locations across Europe, India and US, bwin is a leader in a number of markets including Germany, Belgium, France, Italy and Spain. Rick Kutschera, Engineering Manager at bwin, will share how bwin has adopted SQL Server 2016 in the session session “SQLCAT: Firsthand Customer Experiences Running SQL Server 2016 for their Most Business Critical Solutions”.
Datacastle is a Microsoft Gold Cloud Platform partner that specializes in protecting enterprises from mobile data loss and data breach with simplified and scalable endpoint backup, archiving and insights. Alex Laskos, VP engineering at Datacastle, will co-present in the session “SQLCAT: Azure SQL Data Warehouse Customer Stories from Early Adopters”.
Snelstart
SnelStart, based in Holland, makes line of business administrative applications for Dutch SMEs and self-employed entrepreneurs. Henry Been is a Software Architect at Snelstart and he will co-present in the session “SQLCAT: Azure SQL Data Warehouse Customer Stories from Early Adopters”. Snelstart is also a prominent user of Azure SQL Database, as described in their recent case study.
M-Files Corporation is a provider of enterprise information management (EIM) solutions that dramatically improve how businesses manage documents and other information. With flexible on-premises, cloud and hybrid deployment options, M-Files has thousands of organizations in over 100 countries using the M-Files EIM system. Antti Nivala is the founder and chief technology officer (CTO) of M-Files and he will co-present in the session “SQLCAT: Lessons Learned from Customers Adopting Azure SQL Database Elastic Pool”.
PROS
PROS is a Revenue and Profit realization company that provides customers with real-time software applications that will help drive pricing and sales effectiveness. Justin Silver is a Scientist at PROS and he will co-present in the session SQLCAT: Early Customer Experiences with SQL Server R Services.
Greenfield Advisors
Greenfield Advisors is a real estate and business consulting firm headquartered in Seattle, Washington. They are internationally recognized in the real estate appraisal profession as the leading authorities on the analysis and valuation of property impacted by environmental factors. Cliff Lipscomb is Vice Chairman and Co-Managing Director at Greenfield Advisors, and he will co-present in the session SQLCAT: Early Customer Experiences with SQL Server R Services.
ATTOM Data Solutions
ATTOM Data Solutions is a leading provider of property data – including tax, deed, mortgage, foreclosure, environmental risk, natural hazard, health hazard, neighborhood characteristics and property characteristics – for more than 150 million U.S. properties. Richard Sawicky is Chief Data Officer at ATTOM Data Solutions, and Eric Nordlander is a Principal Database Platform Architect, also with ATTOM data Solutions. Both of them have contributed to the session SQLCAT: Early Customer Experiences with SQL Server R Services. Learn more about the ATTOM Data Solutions scenario from this case study.
SQL Clinic
Have a technical question, a troubleshooting challenge, want to have an architecture discussion, or want to find best ways to upgrade your SQL Server? SQL Clinic is the place you want to be at. SQL Clinic is the hub of technical experts from SQLCAT, Tiger team, SQL Product Group, SQL Customer Support Services (CSS) and others. Whether you want a facelift of your SQL Server deployment or an open heart surgery, the experts at SQL Clinic will have the right advice for you. Find all your answers in one place!
And More …
That’s not all. SQLCAT will be involved in many more events and customer conversations during the Summit. If you have a suggestion on how we can make your experience at the PASS Summit more effective and more productive, don’t hesitate to leave a note.
Thanks, and see you all at the PASS Summit 2016 in Seattle. You are coming, right?
Reviewed By: Dimitri Furman, Murshed Zaman, Kun Cheng
If you have tried to use BULK INSERT or bcp utilities to load UTF-8 data into a table in SQL Server 2014 or in an earlier release (SQL Server 2008 or later), you have likely received the following error message:
Msg 2775, Level 16, State 13, Line 14The code page 65001 is not supported by the server.
The requirement to support UTF-8 data for these utilities has been extensively discussed on various forums, most notably on Connect.
This requirement has been addressed in SQL Server 2016 (and backported to SQL Server 2014 SP2). To test this, I obtained a UTF-8 dataset from http://www.columbia.edu/~fdc/utf8/. The dataset is translation of the sentence “I can eat glass and it doesn’t hurt me” in several languages. A few lines of sample data are shown here:
(As an aside, it is entirely possible to load Unicode text such as above into SQL Server even without this improvement, as long as the source text file uses a Unicode encoding other than UTF-8.)
-- SQL Server 2014 SP1 or earlier
CREATE DATABASE DemoUTF8_2014
GO
USE DemoUTF8_2014
GO
CREATE TABLE Newdata
(
lang VARCHAR(200),
txt NVARCHAR(1000)
)
GO
BULK INSERT Newdata
FROM 'C:\UTF8_Test\i_can_eat_glass.txt'
WITH (DATAFILETYPE = 'char', FIELDTERMINATOR='\t', CODEPAGE='65001')
GO
Msg 2775, Level 16, State 13, Line 14
The code page 65001 is not supported by the server.
-- SQL Server 2016 RTM or SQL Server 2014 SP2 or later
CREATE DATABASE DemoUTF8_2016
GO
USE DemoUTF8_2016
GO
CREATE TABLE Newdata
(
lang VARCHAR(200),
txt NVARCHAR(1000)
)
GO
BULK INSERT Newdata
FROM 'C:\UTF8_Test\i_can_eat_glass.txt'
WITH (DATAFILETYPE = 'char', FIELDTERMINATOR='\t', CODEPAGE='65001')
GO
(150 row(s) affected)
SELECT * FROM Newdata
GO
Image may be NSFW. Clik here to view.
You can now use CODEPAGE=’65001′ with BULK INSERT, bcp and OPENROWSET utilities.
Note that this improvement is only scoped to input processing by bulk load utilities. Internally, SQL Server still uses the UCS-2 encoding when storing Unicode strings.
Reviewed by: Sanjay Mishra, Umachandar Jayachandran, Dimitri Furman, Jeannine Nelson-Takaki, Joe Sack, Kun Cheng, Eric Burgess
One of the most exciting features in SQL Server 2016 is R Services (in-database). This feature has been getting a lot of interest and attention, as we described in a past blog post. If you are curious to know more, the Related Viewing section at the end of this post links to some useful videos.
Background
When deploying SQL Server R Services, it is important to note that the setup components for SQL Server do not include the Microsoft R Open and Microsoft R Server components. Those ‘R Components’ (as we will refer to them later in this post) are provided as separate downloadable components. SQL Server will automatically download these when executed on computer which is connected to the Internet. But in cases where setup is done on a computer without Internet access (quite typical of many SQL Server deployments) we need to handle things differently. There is a documented process for doing this. But even with the documentation, we still had some customers with questions on the process.
Inspired by those customer engagements, this blog post walks through the process of setting up SQL Server R Services in environments without Internet access. We walk through a number of scenarios, right from the very basic scenario to the more complex ones involving unattended and ‘smart setup’.
Scenario 1: Interactive setup of SQL Server 2016 RTM
Let’s begin with the easiest ‘offline setup’ scenario for SQL Server R Services. In this scenario, setup is launched by double-clicking setup.exe from the SQL Server 2016 installation media. After accepting the license agreement, selecting the edition etc. you are prompted for the features to install. We will select just the SQL engine and R Services (in-database) for simplicity:
After accepting the license agreement for R components, if the computer were able to access the Internet, it would automatically download the requisite R components and proceed. But given that we are in an ‘offline’ scenario, setup needs additional information to proceed. This is apparent when you look at the list of steps on the left of the wizard, a new step ‘Offline installation of Microsoft R Open and Microsoft R Server’ has appeared:
This new screen (figure 5 below) is where you are prompted to direct setup to the requisite copies of the R components. In one-off scenarios, you can look at the links on the screen, and download those files on a computer which has Internet access; then move the 2 files back to the computer where SQL setup is running (this is the computer which does not have Internet access). But to be proactive you can download these CAB files ahead of time based on the Installing R Components without Internet Access page. That webpage (figure 4) lists the URLs to the CAB files containing the correct version of the R components for each version of SQL Server 2016:
Figure 6: After selecting the path to the CAB files for the R components
Note: If the path does not contain the correct files, the ‘Next’ button on the above dialog is disabled.
That’s it! Once the installation succeeds, validate the installation as per the steps in this article. For example, we use the simple R script below to check that the installation functions correctly.
exec sp_execute_external_script @language =N'R',
@script=N'OutputDataSet<-InputDataSet',
@input_data_1 =N'select 1 as hello'
with result sets (([hello] int not null));
Scenario 2: Interactive patching with SQL Server 2016 Cumulative Update
Now that you have a RTM instance installed above, let’s patch it with Cumulative Update 1. CU1 also has a different version of the R components, so the process described above (downloading the correct CAB files and moving them to the ‘offline’ computer) is still applicable. Let’s see how this goes – first, let’s launch CU1 setup as below:
Figure 7: Features being patched by the SQL Server Cumulative Update
CU1 setup recognizes that we are ‘offline’ and will prompt us to provide the location of the correct version of the CAB files. We do that in the screen below:
Figure 8: Specifying the folder location for the R components
That’s it! We will then have CU1 run to completion.
Scenario 3: Interactive, ‘slip-streamed’ setup of SQL Server 2016 RTM
‘Smart setup’ in SQL Server has been around since SQL 2012, and even prior to that we had the ability to ‘slip-stream’ a hotfix / Cumulative Update / Service Pack into SQL Server.
In SQL Server 2016, to invoke smart setup on a computer without Internet access, you start SETUP.EXE from the command line and use the /UPDATESOURCE switch to specify the location of an ‘offline’ / local copy of SQL Server updates (Cumulative Update and / or Service Pack). The advantage of ‘smart setup’ is that SQL Server is patched right at the time of setup. This way, you avoid a separate restart / reboot to install CU1 / CU2 later on. Also importantly, setup executes with an updated (CU1 / CU2 in this case) version. That helps to proactively fix any issues that the setup engine itself may have (such as this issue.)
So in our case, let’s say we have downloaded a copy of the CU2 installer (SQLServer2016-KB3182270-x64.exe) to C:\InstallMedia\CU2 on our ‘offline’ computer. Then to invoke an ‘offline smart setup’ (with interactive setup screens) you start setup from an administrator command prompt:
Figure 10: Specifying the location of the CU2 version of the R Components
The rest of the steps are just as per any other normal installation. At this stage, you will get an instance of SQL Server which is already at CU2 patch level!
Scenario 4: Unattended setup of SQL Server 2016 RTM
The above 3 scenarios were installations done interactively – they had a GUI popping up and prompting for input, and finally displaying progress graphically. While that works great for a doing a few installations of SQL Server, when you have to deploy SQL Server on hundreds of instances, or if you need to provision SQL Server automatically, we need to use unattended setup for SQL Server.
Unattended setup is well-known and well understood. But when deploying SQL Server R Services you do need to handle the R component dependencies which we described previously. To do this, we have two command line switches:
IACCEPTROPENLICENSETERMS – this switch is the equivalent of pressing the ‘Accept’ button in the GUI when prompted to accept the R Open licensing terms. Note that this is in addition to the IACCEPTSQLSERVERLICENSETERMS required for any unattended SQL Server setup.
MRCACHEDIRECTORY – this switch is critical, because it is the way we tell setup to look for the R components in a specific folder.
Putting together these switches with a sample command line shown below, we will install a named instance of SQL Server (called SQL_RTM) with just the 2 features (SQL Database Engine and R Services in-database) selected. The command line below also relies on defaults for service accounts etc. So it does need customization, but if you are reading this, you probably know how to add more parameters! If you are not sure, the unattended setup help page will be a great starting point on what other parameters you can customize the setup with. OK, let’s see the command line now:
That’s it! The above command line setup will install without any human intervention, and will correctly provision the required R components as well.
Scenario 5: Unattended patching of SQL Server 2016 Cumulative Update
To add on to the previous scenario, let’s imagine that you wanted to automate the rollout of CU1 to their (existing) SQL instance which already has R Services installed. Again, they want to do this completely unattended. So you can write a script which does the following:
Firstly, it will extract a copy of CU1 to a folder on the ‘offline’ computer as shown below. The /X: switch is a standard parameter for any servicing update packages (Cumulative Updates, Service Packs) for SQL Server:
Secondly, the script will copy the CAB files for the R Components (previously downloaded from the Internet) to a folder on the ‘offline’ computer. In our case, this location is C:\RComponentsOffline\CU1.
Finally, the script that you have created would proceed to run the extracted copy of CU1 setup, additionally passing in the 2 additional R component related switches described previously (IACCEPTROPENLICENSETERMS and MRCACHEDIRECTORY):
When the above script completes, SQL Server would have been patched to CU1 and our R Services installation would have also been updated correctly!
Scenario 6: Unattended, ‘slip-streamed’ setup of SQL Server 2016 RTM
This ‘mother of all scenarios’ is common in large enterprises where there is a high degree of automation and equal emphasis on patching out-of-box. In such cases, administrators do not want to deploy an RTM installation as-is if there is already a patch available. Basically, this is Scenario 3, but done in an unattended way!
As in scenario 5, imagine that you wrote a script to do the following on the ‘offline’ computer:
Copy the RTM bits to a folder called C:\InstallMedia\RTM
Extract CU2 bits to the ‘offline’ computer to a folder called C:\InstallMedia\CU2
Copy the CU2 version of the R components to a folder called C:\RComponentsOffline\CU2
Then the script would have a command line such as the below. The below command installs a named instance of SQL Server 2016 called ‘SQL_CU2’. By now, you would easily recognize the critical switches IACCEPTROPENLICENSETERMS and MRCACHEDIRECTORY:
This ‘offline, unattended smart setup’ results in a clean deployment of a CU2 version of SQL Server 2016, with R Services in working order!
You might also need this…
If you are reading this post, you will most likely also run into another common issue – how do you obtain and install additional R packages (those which are not included in the standard Microsoft R distribution) on that ‘offline’ computer? There is again a well documented way to do this. Start with the official documentation here and then a related blog post here.
Conclusion
We think SQL Server R Services is the best thing since sliced bread J But seriously, a feature with such compelling value is really exciting. We want to ensure that there are no obstacles in your evaluation and adoption of this feature. Hopefully this blog post will complement the official documentation to make things much easier for you. If you still have questions, comments on this topic, do not hesitate to let us know!
Related Viewing
There are 2 videos from the ‘Data Driven’ series which are a must-watch. Here’s the first one and here’s the second.
Bob Ward has a great talk about SQL Server R Services. View the recording here.
Bill Jacobs and Sumit Kumar talked about SQL Server R Services at the recent Ignite conference. View the video here.
Our customer – PROS, was an early adopter of SQL Server 2016 R Services and they share their experiences here.
For a broader overview of Microsoft’s various offerings in the ‘Intelligence’ space, view Rafal Lukawiecki’s presentation at Ignite 2016.
We trust you found this blog post interesting. Please leave your feedback and questions in the comments section below. Till next time, ciao!
Reviewed by: Steven Green, Peng Song, Xiaochen Wu, Kun Cheng, Sanjay Mishra
Introduction
Database migration from SQL Server to Azure SQL Database is a process that many organizations must implement as they move to Azure public cloud. This article is a guide that describes one specific implementation path for the migration process, that has been commonly used by Microsoft customers. To aid others in the same task, in this article we present lessons learned, recommendations, examples, caveats, potential issues, solutions, and workarounds. A sample PowerShell script that automates some migration steps is available as well.
This article is long, compared to a typical blog post. We intentionally include a lot of conceptual and background material to help the reader understand the entire flow of the migration process. Optimally, this article should be read during migration planning, rather than during implementation or troubleshooting. The target audience is Architects, Database Engineers, DBAs, and Developers who are considering or implementing database migration from SQL Server to Azure SQL Database.
As a prerequisite, we recommend reviewing Azure SQL Database documentation on the topic of migration. That provides a broad overview of multiple migration paths and tools that are available. In the current article, rather than trying to encompass the breadth of possible migration approaches discussed in documentation, we concentrate on an in-depth review of one of the most common ways to migrate, namely on using bacpac files and the DacFx framework.
In practice, a migration project will likely include multiple SQL Server databases. For simplicity, we will consider a single database migration in this guide. Unless noted differently, all migration steps are equally applicable to migrations of multiple databases.
At a high level, the migration process consists of the following activities:
1. The first step is to ensure the application is compatible with Azure SQL Database.
2. Validate migration process functionality, and minimize required application downtime.
3. Ensure adequate database performance, once migrated to Azure SQL Database.
4. Operationalize migrated database. Establish standard operating procedures for monitoring, troubleshooting, change management, support, etc.
These activities are not in a strict order; they overlap and blend with each other. They also tend to be iterative. For example, to minimize application downtime during actual migration, it will likely be necessary to repeat step 2 multiple times, to tune the process optimally for the specific application and database being migrated.
In this article, we will concentrate on the first two activities, i.e. on ensuring compatibility, and on the actual migration steps. Future articles are planned on the topics of performance and operations for the migrated database.
Figure 1 is a diagram describing the primary phases of the migration process.
The rest of this article describes the first four phases in detail.
Database Compatibility Considerations
There are three main areas to consider to ensure application and database compatibility with Azure SQL Database:
1. Feature compatibility. If the application uses any SQL Server features that are not available in Azure SQL Database, then this is an area that will likely require the most investment in terms of application re-engineering, which would be a prerequisite for migration to Azure SQL Database.
2. T-SQL language compatibility. In Azure SQL Database V12, this is usually not a major obstacle. Most of the T-SQL language constructs supported in SQL Server are also supported in Azure SQL Database. For details, see Azure SQL Database Transact-SQL differences. Nevertheless, this still requires attention; the specifics of validating language compatibility are described later in this article.
3. Behavior compatibility. While built on a common codebase, Azure SQL Database behaves differently from SQL Server in several important areas, i.e. high availability, backup, connectivity, security, resource governance, default concurrency model, etc. Most importantly, the application must implement robust retry logic to work reliably once migrated to the cloud.
The only fully reliable way to guarantee that the migration process is sound, and the database workload is compatible with Azure SQL Database, is comprehensive and thorough testing of both the migration process and the application in a test environment. That is a standing recommendation; however, many compatibility issues can and should be addressed ahead of time, before any migrations occur.
The preferred way to determine the extent to which a given SQL Server database is compatible with Azure SQL Database, in terms of T-SQL language and features, is to use SQL Server Data Tools (SSDT). In a nutshell, the approach is to create a Visual Studio database project for the database to be migrated, set the target platform for the project to “Microsoft Azure SQL Database V12”, and make changes to database schema and code to fix build errors until the project builds successfully. A successful build would indicate that the project is compatible with the target platform in terms of language and features (though not necessarily in terms of behavior and performance). For a detailed walkthrough, see Migrate a SQL Server Database to Azure SQL Database Using SQL Server Data Tools for Visual Studio. It is important to install the latest version of SSDT, to validate the project against the current surface area of Azure SQL Database, which expands frequently.
Once the database project is created, make sure to set the database compatibility level option (under project Properties, Project Settings, Database Settings, Miscellaneous) to whichever compatibility level the migrated database will have once it is migrated. Supported features and language constructs depend on both the platform (Azure SQL Database V12), and the selected compatibility level. In general, the latest compatibility level (130 as of this writing) is recommended to enable many recent improvements in the database engine. However, if the source database currently runs under a lower compatibility level, then a possibility of query performance regressions under a higher compatibility level exists. For additional details and guidance on choosing the right compatibility level for your database, see Improved Query Performance with Compatibility Level 130 in Azure SQL Database.
Another database level option that should be considered is Read-Committed Snapshot Isolation (RCSI). In Azure SQL Database, this is the default for new databases, while in SQL Server, the default transaction isolation level continues to be Read-Committed. Most applications experience improved concurrency with RCSI, however it comes at a cost of increased tempdb utilization; also, there are some application patterns, e.g. certain queue implementations, that require Read-Committed transaction isolation (at least at the statement level with the READCOMMITTEDLOCK table hint). RCSI can be enabled and disabled at the database level as needed, and can also be set in the SSDT project settings (under project Properties, Project Settings, Database Settings, Operational, Read committed snapshot). In the context of migration, we generally recommend using the default RCSI enabled option for migrated databases; however, if there is any application functionality dependent on Read-Committed, then code may have to be modified to use the READCOMMITTEDLOCK hint, or, if such modification is not feasible, the RCSI option would have to be disabled at the database level.
In addition to verifying database compatibility with Azure SQL Database, SSDT provides another advantage – it ensures the validity of object references in database schema and code. For example, if the database to be migrated has any objects that reference non-existing objects, e.g. views that reference dropped tables, or any three-part references (which are not supported in Azure SQL Database, unless referring to tempdb), then they will be identified via build errors. This is important, because bacpac creation later in the migration process may fail if any broken references exist. Since SSDT database projects and bacpacs use the same DacFx framework internally, this is a good way to discover potential problems in advance. While not directly related to migration, an important side benefit of using SSDT is detecting and removing any invalid objects and code in the source database.
Once the changes to make the database compatible with Azure SQL Database have been made in the SSDT project, they will also have to be made in the actual SQL Server database to be migrated. This can be done by using the “publish” functionality in SSDT, or using any other approach that may be more optimal for a given change, in the context of a particular application and database. For example, if the SSDT publish script rebuilds a large table to implement a column data type change, then a more optimal approach may be to use the online alter column functionality available in SQL Server 2016.
It should be noted that in the recent versions of DacFx, the SqlPackage tool that creates the bacpac became smarter in its handling of Azure SQL Database import/export scenarios. For example, it will no longer block bacpac export, if filegroups other than PRIMARY are present in the source database. Since Azure SQL Database does not support any database filegroups other than PRIMARY, all references to such filegroups present in the bacpac will be automatically replaced with references to the PRIMARY filegroup when SqlPackage imports a bacpac. Nevertheless, filegroups other than PRIMARY are still treated as build errors in an SSDT project, when target platform set to “Microsoft Azure SQL Database V12”.
Therefore, even though getting the database project to build in SSDT guarantees that bacpac export will not be blocked, it may, in some cases, entail more work than what is actually needed to successfully migrate the database. If you find that the effort to get the SSDT project to build successfully is overly large, then you may attempt to migrate the database as is. However, if export fails due to compatibility issues, then they would have to be addressed in the source database prior to attempting the actual migration.
SSDT will discover SQL Server features and T-SQL language constructs in the database schema and code that are incompatible with Azure SQL Database. It will not, however, discover them in queries that are embedded or generated in the application code, because they are not a part of the database project. Also, SSDT will not do anything to address the behavior compatibility of the database, e.g. the extent to which the differences between SQL Server and Azure SQL Database in the connectivity, high availability, and resource governance behavior could affect the application. For these two reasons in particular, comprehensive testing against the migrated database is recommended to address all compatibility concerns, prior to production migration.
Database Size Considerations
Currently, the maximum database size in Azure SQL Database is limited to 1 TB. However, for the purposes of database migration, several other size limits must be considered.
1. Disk space requirements on the machine where the bacpac is created need to be considered when dealing with large databases. During bacpac export, DacFx temporarily writes table contents to the %TEMP% directory, as well as the directory used by .Net Isolated Storage. These typically both reside on the C drive. Free disk space that will be required on this drive is roughly equal to double the size of the largest table in the source database. If free space on C is insufficient for these temporary allocations and for the bacpac itself, then the bacpac can be created on a different drive with sufficient space.
2. The same size considerations exist for import operations as well, if the import operation is done using SqlPackage on an Azure VM, as opposed to using the Import/Export Service. Using an Azure VM may be needed for network security reasons, as described in the Import the Bacpac section later in the article.
3. Size limits also exist in the Import/Export Service. Currently, the size of the bacpac file, plus double the size of the largest table, should not exceed 400 GB.
4. The maximum size of a block blob in Azure Blob Storage is limited to slightly more than 195 GB. Since Import/Export Service only supports block blobs, this limits bacpac size that can be imported by the service. For larger bacpac files, a workaround would be to use an Azure VM to import the bacpac.
We should note that the data in the bacpac file is compressed, therefore the actual source database size may be larger than some of these limits. For example, if a 500 GB source database compresses to a 190 GB bacpac, then the block blob limit in the list above (#4) will not prevent the use of Import/Export Service.
It is worth mentioning explicitly that the amount of downtime required to create, upload, and import a bacpac of a large database may be prohibitively large in the context of an actual production application migration. In those cases, the migration approach using transactional replication, which significantly reduces the necessary downtime at the cost of added complexity, may be a feasible solution.
Service Tier Considerations
Each database in Azure SQL Database is assigned a service tier, or service objective (sometimes also known as service level objective, or SLO). Examples are S0, P2, etc. This defines the amount of resources allocated to the database, the features that are available, and, no less importantly, the cost of service. It is not surprising that customers migrating to Azure SQL Database want to know the service tier that is optimal for the database being migrated, i.e. the one that maximizes performance and available features, and minimizes cost.
However, our experience shows that answering this question definitively before the workload is tested against the migrated database is often challenging, and may not be feasible at all. A tool that is often mentioned in this context is Azure SQL Database DTU Calculator. The tool works by using a Performance Monitor log for a SQL Server machine, collected while the application workload is running. The tool analyzes several performance counters, and predicts DTU consumption for the migrated database. However, while the tool is helpful (if used correctly), we consider the results it provides only an initial approximation of the optimal service tier at best, for the following reasons:
1. The database engines being compared are usually different, e.g. an older version of SQL Server running on-premises, and Azure SQL Database. Resource utilization by the same workload on different database engines may be different.
2. Resource governance models are different. Resource consumption of a typical SQL Server workload is not constrained by anything other than the platform (hardware) limits; Azure SQL Database, on the other hand, explicitly limits available resources in each service tier using its own implementation of the resource governance model. This may create unexpected resource bottlenecks for workload running against the migrated database.
3. The infrastructure platforms are usually different. This includes the hardware, the hypervisors, and the operating systems. CPU core speeds, cache sizes, generations, etc. may differ, storage type is often different (spinning media vs. SSD), and hypervisor type and configuration are likely different as well.
4. Database internal structures get modified as a by-product of migration, e.g. indexes are getting rebuild, and statistics are getting updated. This is an additional factor affecting performance and resource consumption post-migration.
5. For many real-life migrated applications, workload patterns in the cloud are different from what they were on-premises. The application may get utilized more or less, and application features may change as a part of migration. In that case, resource consumption will be different as well.
Therefore, accurately predicting required Azure SQL Database service tier in advance, based on SQL Server performance data, tends to be a futile undertaking. Instead, we recommend first selecting the Edition (Basic, Standard, Premium) based on service features that are required, taking an initial guess at the service tier, and then iteratively testing a representative application workload against Azure SQL Database. As long as the workload is indeed representative, the database can be scaled up or down from the initially selected service tier, to right-size it according to the workload, all prior to the production migration.
If this kind of iterative testing and scaling approach is not feasible for any reason, e.g. if the representative workload is not available, then a reasonable approach is to select a relatively high service tier (or a relatively high elastic pool size, if migrating multiple databases directly into an elastic pool) at first, monitor resource consumption and application performance, and then gradually scale down until the service tier or elastic pool size is appropriate for the workload (taking into consideration possible workload spikes), and the cost is acceptable.
Finally, we should note that Azure SQL Database alleviates these concerns by providing the ability to easily and relatively quickly change database service tier over the lifetime of the application, so that the importance of selecting the “exactly right” service tier at the outset is greatly reduced.
Database Migration
Once the database is deemed compatible with Azure SQL Database, migration testing can start. Note that here we intentionally focus on the database migration only; the scope of the overall application migration to Azure is much larger, and is mostly beyond the scope of this article.
At a high level, migration steps are as follows:
1. Provision Azure resources; create logical server(s) and server level objects, and configure Azure SQL Database features. These preliminary steps should be done well in advance of the actual migration.
2. Declare the start of application downtime.
3. Export a bacpac of the source database. This bacpac contains all database scoped objects and data to be migrated.
4. Upload the bacpac to Azure Blob Storage, or to an Azure VM.
5. Start and monitor a bacpac import operation.
6. Once the import operation completes successfully, verify objects and data in the migrated database.
7. Grant database access.
8. Verify application functionality, as well as database management and monitoring functionality.
9. Declare the end of application downtime.
If there are multiple databases to be migrated, then steps 3-8 may be executed concurrently for all databases, to minimize application downtime.
Let’s discuss each step in detail.
Provision and Configure Azure Resources
The activities to provision and configure Azure resources can and should be done well in advance of the actual migration, to ensure proper configuration, and to avoid spending time on these tasks during application downtime. A large part of these activities is configuring application resources and deploying application code, which is a core task in the overall application migration process. Here we concentrate on provisioning and configuration of database resources specifically.
Develop a Naming Convention
The naming convention will govern the names given to resource groups, logical servers, logins, firewall rules, etc. Keep in mind that for some resource types, i.e. for Azure SQL Database servers and Azure Storage accounts, names must be globally unique. While many adequate naming conventions are possible, one that is commonly used is where separate name segments identify the organization (i.e. the Azure tenant), the Azure region, the application, the environment, etc., and include a numeric resource identifier for disambiguation. For example, for a fictitious organization named Wide World Importers, migrating an application named Import All, one of the production Azure SQL Database server names may be wwi-ia-pd-041. Many other conventions can be used, as long as they provide name clarity, consistency, and uniqueness.
For different Azure resource types, limits on the length of the name, its case sensitivity, allowed characters, etc. may be significantly different. For example, the name of an Azure SQL Database server is limited to 63 characters, while the name of an Azure IaaS Virtual Machine is limited to 15 characters. To simplify the naming convention, it may be tempting to adopt the most restrictive set of rules for all Azure resource types. However, that would negatively affect clarity and readability, and in extreme cases could force you into violating the naming convention just to make a name sufficiently descriptive. Therefore, our recommendation is to prioritize name clarity and descriptiveness over using the same rigid naming scheme for different resource types. Instead, consider using customized naming conventions for different resource types, to accommodate Azure name limits while ensuring that names are sufficiently descriptive.
Develop a Resource Grouping Strategy
Each Azure resource belongs to a resource group. A resource group is treated as a unit for provisioning and de-provisioning purposes. A good rule of thumb for grouping resources is to ask the question: “Would all resources in this group be created or deleted at the same time and as a part of the same task?” If yes, then the grouping strategy is probably sound. The extremes to be avoided are using a single group for all resources, or a separate group for each resource.
Provision and Configure Azure SQL Database Logical Servers
Azure SQL Database servers act as logical containers for databases. Unlike a traditional SQL server, a logical Azure SQL Database server is not associated with a pool of resources (CPU, memory, storage, network) that is shared among all databases on the server. Resources are allocated to each individual database, regardless of its placement on any specific logical server.
That said, some limits at the logical server level do exist: one is a quota on the total number of Database Transaction Units (DTUs) for all databases hosted on the same logical server (45,000 DTUs); another is the hard limit on the total number of databases per logical server (5,000 databases). As of this writing, the recommended maximum on the number of databases per server is in the 1000-2000 range. As with many limits, this is subject to change over time, and is highly dependent on the workload.
The reason for considering limits at the server level is that operations such as login processing, server firewall rule processing, and querying of server-level DMVs use resources allocated to the master database. Just like resources allocated to user databases, master database resources are subject to limits and governance. For highly intensive workloads, particularly those with high rate of connections, these limits could reduce scalability. Recently, an improvement has been made to mitigate this problem by caching logins and firewall rules in user databases, so this became less of a concern. For the uncommon workloads where this still causes a scalability issue, the dependency on the master database can be reduced further by eliminating server firewall rules and logins altogether, and instead using database-level firewall rules and database authentication (i.e. users with password and/or external provider users). That also reduces the need for configuration at the server level, making databases more portable.
Since Azure SQL Database servers are logical, and limits at the server level that we described above are not a concern for the vast majority of customer workloads, it may be advantageous to host multiple databases on the same logical server, in order to simplify environment configuration and improve manageability. In other words, just because two databases have been hosted on two different SQL Server instances, does not necessarily mean that they must be migrated to two different logical servers in Azure SQL Database. If the databases are related, i.e. used by the same application, consider hosting them on the same logical server.
Once a logical server is provisioned, some of the server level settings such as Firewall, Azure Active Directory admin (if used, see the section on server logins later in the article), and Auditing & Threat Detection can be configured. Note that Firewall and Auditing & Threat Detection can also be configured for each individual database later, if required. If Auditing & Threat Detection is configured at the server level, this configuration will be used for all databases on this server.
Configure Access Control and Network Security
This includes Azure Role-Based Access Control (RBAC), network security (firewall), and authentication (SQL authentication and/or Azure Active Directory authentication).
The principle of least privilege should be the guiding rule. To the extent possible, Azure SQL Database firewall rules should restrict inbound traffic to well-known sources, e.g. the application, management, and monitoring endpoints only. If any exceptions become necessary, e.g. for ad-hoc troubleshooting, they should be made just-in-time, and removed promptly once no longer needed.
If multiple databases are hosted on the same logical server, consider whether the firewall rules should be the same for all of them. If not, create firewall rules at the database level, rather than at the server level. As noted earlier, this also makes the database more portable, and reduces its configuration and performance dependencies on the logical server.
If using RBAC, assign principals to logical servers, and grant them membership in either built-in RBAC roles or custom RBAC roles. This will restrict access to resource management operations at the Azure Resource Management API level, to control the operations allowed for Azure Active Directory users and groups in Azure Portal and Azure PowerShell.
Note that RBAC controls Azure resource management operations only, and not authorization within the database. In Azure SQL Database, that is controlled via database role membership, and GRANT/DENY statements, just like in SQL Server.
Create Server Logins
A single SQL authentication login acting as a server administrator is always created as a part of server provisioning. That login is conceptually similar to the sa login in SQL Server. As a security best practice, we recommend restricting the use of this login as much as possible. Avoid using it for routine administration work, let alone for application connections. Instead, use dedicated logins for each functional role, i.e. application, DBA, monitoring tool, reporting tool, data reader, etc. Depending on the purpose of each login, it may require access to the master database, which is granted by creating a user in the master database for the login, and making that user a member of built-in roles in master (i.e. dbmanager, loginmanager, etc.) as needed.
We should mention that the logins mentioned so far are SQL authentication logins, and as such, are subject to known limitations and drawbacks of SQL authentication. Among them are the overhead of managing a separate set of credentials specifically for database authentication, the need to store passwords in application configuration assets, an increased likelihood of unintentional sharing and disclosure of credentials, etc.
From this perspective, Azure Active Directory (AAD) authentication is a better authentication mechanism, addressing these concerns for Azure SQL Database similar to the way Windows Authentication does it for SQL Server. With AAD authentication, a single Azure Active Directory group can be designated as a server administrator, to grant all members of this group the same privileges that are held by SQL authentication server administrator. Furthermore, Azure Active Directory users and groups can be associated with users in individual databases, to grant all AAD group members required access in a specific database.
There is an important consideration for SQL authentication logins used in the context of Geo-replication. If server level SQL authentication is used, then the same logins must be created on all logical servers that may host geo-replicas of a given database. “The same” here means that the name, the security identifier (SID), and the password of the logins must be identical, to be able to fail over from a primary geo-replica to a secondary geo-replica without making any other database or application changes. If login SID on the server hosting the secondary geo-replica is different, then database access for that login will be lost after failover.
Making logins the same on multiple servers requires explicitly specifying the SID parameter when logins are created. To determine the SID value to use, first create a login without specifying the SID parameter, as in the following example:
CREATE LOGIN AppLogin1 WITH PASSWORD = 'SecretPwd1-';
Then, query the sys.sql_logins catalog view in the context of the master database, restricting the result set by login name:
SELECT sid
FROM sys.sql_logins
WHERE name = 'AppLogin1';
Now, to create the same login on any other server, use the following statement:
CREATE LOGIN AppLogin1 WITH PASSWORD = 'SecretPwd1-', SID = 0x01060000000000640000000000000000C6D0091C76F7144F98172D155AD531D3;
If not explicitly specified, SID values are generated during login creation. For any new login, the SID value can be initially obtained by creating the login without specifying the SID as in the previous example, and then used in the CREATE LOGIN statements for other servers explicitly. This would be useful when creating similar servers in multiple environments, to make login creation process identical for all servers. Note that SID values must be unique per server, therefore for each new login, its SID value must be first generated by creating the login without specifying the SID. SID values are structured, and must be generated by Azure SQL Database; using an arbitrary 32 byte value as the SID is unlikely to work.
Create an Empty Database
This step is optional, because the import operation later in the process will create the database, if a database with the same name does not exist on the target logical server. Nevertheless, it may be advantageous to do this now, in order to configure database level settings in advance of data migration, and thus reduce application downtime.
A key point here is that the pre-created database must remain empty until the start of the import operation. If the import operation detects any user objects in the target database, it will fail to prevent potential data loss.
The settings that can be enabled in advance include Geo-replication, Auditing & Threat Detection, and Transparent Data Encryption (TDE). If business continuity and/or compliance requirements mandate the use of these features, then this approach ensures that they are enabled before any data is migrated to the cloud. A tradeoff between compliance and performance is that enabling TDE in advance of the import operation will slow down the import to some extent.
The service objective of the empty database can be set relatively low to reduce cost in the interim period prior to migration. Service objective can be scaled up later as a part of the import operation.
Create a Storage Account for Bacpacs
This Azure Blob Storage account will be used to upload and store the bacpac(s) to be imported. Create the account, and create a container for bacpac blobs, making sure to use private access type for the container.
For most migrations, a Standard Storage account is sufficient. For very large bacpacs and tight migration downtime windows, increased IOPS and bandwidth of a Premium Storage account could make import operations faster. Regardless of the storage account type, make sure to create the account in the same region where the migrated databases will be hosted, to avoid slower cross-region data transfer during import.
If migrating multiple databases concurrently, use a locally-redundant storage account to maximize ingress bandwidth, and thus avoid a bottleneck during concurrent upload of many bacpacs. Multiple storage accounts may be used if the ingress bandwidth of a single account is exceeded while uploading many bacpacs concurrently.
Migrate the Database
Once the preliminary activities described in the previous sections are complete, we are ready to declare application downtime, and start actual database migration. Figure 2 outlines the core activities to be performed.
In this step, we will create a bacpac file for the database to be migrated. SQL Server Management Studio (SSMS) provides a UI for creating bacpacs (under the Database context menu, Tasks, Export Data-tier Application). However, using the SqlPackage tool from the command line, or as a command in a script, provides additional flexibility, and is preferred in most migrations.
The SqlPackage tool is a part of the DacFx framework, which needs to be installed on the machine where the bacpac will be created. Note that it is not required to create the bacpac on the same machine where the source database is hosted, though connectivity to the SQL Server instance hosting the source database is required. It is recommended to use the most current release of the DacFx framework. New releases of DacFx are announced on the SSDT team blog.
The DacFx framework is available in x86 and an x64 versions. It is recommended to install both. SSMS, being a 32-bit application, will use the x86 version, which imposes a limitation on the amount of memory the bacpac export process can use. This is one of the reasons to use the command prompt (either cmd or PowerShell), where the 64-bit version of SqlPackage can be used.
By default, the current version of the DacFx framework is installed in C:\Program Files\Microsoft SQL Server\130\DAC\bin, assuming the x64 version. The x86 version is in a similar location under C:\Program Files (x86). These paths are not added to the system PATH environment variable; therefore, we need to either execute SqlPackage commands from one of these directories, or use a fully qualified path.
A very important point to make is that unlike a native SQL Server backup, a bacpac is not guaranteed to contain transactionally consistent data. If the source database is modified while the export process is running, then data in the bacpac can be transactionally and referentially inconsistent, resulting in failed bacpac import and potential data loss later in the migration process. To prevent this, and to ensure transactional consistency and referential integrity of the migrated database, all write access to the source database must be stopped before the export operation starts, e.g. by making the database read-only.
An important note from SqlPackage documentation, with emphasis added: “Validation for the Export action ensures Windows Azure SQL Database compatibility for the complete targeted database even if a subset of tables is specified for the export.” In other words, this means that the database model of the entire database must be consistent (no broken references) and compatible with Azure SQL Database, in order to create the bacpac. Using an SSDT database project to model the database, as described in an earlier section, helps ensure this.
As a practical matter for many actual database migrations, objects in the source database may need to be modified, sometimes significantly, to enable successful bacpac export and import. For example, unused objects may need to be dropped, broken references fixed, code incompatible with Azure SQL Database rewritten, etc. Some of these changes may be trivial, while others may be quite involved.
Recall that at this point in the migration process, we are in the middle of application downtime. We need to minimize the time taken to create the bacpac. Internally, a bacpac is a ZIP archive that contains schema metadata (represented in XML), and data (in native BCP format). During bacpac creation, all this data is extracted from the source database, and compressed in a single bacpac file. Compression is a CPU intensive operation. To make it fast, it is important to use a machine with plenty of CPU capacity. For this reason, it is often best to use a machine other than the source SQL server for bacpac creation, to avoid CPU contention between data extraction threads in SQL Server and data compression threads in SqlPackage. We recommend monitoring CPU utilization during bacpac export. If it is close to 100%, then using a machine with more CPU capacity could speed up the export. The assumption here is that network bandwidth between the two machines is sufficiently high (e.g. they are on the same LAN), and therefore we are not creating a different (network) bottleneck for data extraction by using separate machines.
One rather obvious, but at the same time often overlooked way to shorten the duration of bacpac creation, as well as the duration of the following phases of the migration process (upload, import, verification), is to remove any data in the database that does not need to be migrated. This includes data that is past its required retention period, temporary copies of tables that are no longer needed, data no longer used by the application, etc.
Upload the Bacpac
Once the bacpac is created successfully, we need to upload it to Azure Blob Storage.
One obvious prerequisite is that connectivity to Azure Blob Storage must be available from the on-premises network. Today, unintentional or malicious data exfiltration is a major concern for many organizations. Therefore, multiple barriers for outbound data transfers may exist, by design. But in our scenario, intentional data exfiltration (with respect to the on-premises network) is actually a required step in the migration process. In many environments, this is a challenge that will require advance cooperation and coordination within the IT organization to resolve.
For similar reasons, accessing outside network resources directly from database servers is not allowed in many environments. An acceptable compromise may be to allow temporary outbound access from a different dedicated machine on the internal network, which is also allowed to connect to the database server to extract data. This is yet another reason, besides performance considerations, to create and upload the bacpac on the machine other than the source database server.
Assuming that a reliable connection to Azure Blob Storage is available, our goal is to make the bacpac upload as fast as possible, given available network bandwidth, in order to minimize application downtime. A solution that has worked well for many customers is AzCopy, a tool for working with Azure Blob Storage that supports multi-threaded network transfers. Multi-threading provides a major increase in transfer speed. In fact, on networks with limited outbound bandwidth, it may be necessary to reduce the number of threads, to avoid saturating the network link and starving other on-premises systems of network bandwidth during bacpac upload.
Here is an example of the AzCopy command that uploads all bacpac files found in the F:\Temp directory to the specified storage account:
By default, AzCopy is installed in C:\Program Files (x86)\Microsoft SDKs\Azure\AzCopy. This path is not added to the system PATH environment variable; therefore, we need to either execute the AzCopy command from this directory, or use a fully qualified path.
Import the Bacpac
Once the bacpac is uploaded to Azure Blob Storage, we can start the import operation. This can be done interactively from Azure Portal, or using the New-AzureRmSqlDatabaseImport PowerShell cmdlet (using the older Start-AzureSqlDatabaseImport cmdlet is not recommended). With either option, we need to specify the service objective of the target database. If you pre-created an empty database and used a lower service objective at that time, but specified a higher service objective when starting the import, then the import operation will automatically scale up the database as a part of import.
The import operation is typically performed by the Azure SQL Database Import/Export Service. Similar to the bacpac export done earlier, the service also uses the DacFx framework to import the uploaded bacpac into the target database.
The IP address space used for outbound connections from the Import/Export Service infrastructure to the target logical server is not documented, and is subject to change at any time. Therefore, given that connections to the target Azure SQL Database server are gated by server firewall, the only fully reliable way to ensure that the Import/Export service will be able to connect is to enable the firewall rule that allows access from all Azure services (or, equivalently, from the 0.0.0.0 IP address). Obviously, opening the firewall to a large IP address space is a network security risk. Security conscious organizations will want to mitigate this risk by disabling this firewall rule as soon as the import operation completes successfully. For our migration scenario, there is an additional mitigating factor: during the import, the database is only accessible via the server administrator login and the AAD administrator group, if the latter is provisioned.
For organizations finding the security risk of temporarily opening the firewall to the entire Azure IP address space unacceptably high, a workaround would be to provision an Azure IaaS VM in the same region where the target database will reside, install DacFx framework on that machine, and execute the import process using SqlPackage from that machine. In this case, the firewall only needs to be opened to the public IP address of this machine.
Similar to previous steps and for the same reasons, we need to minimize the amount of time the import operation takes. During bacpac import, data is bulk loaded (using ADO.NET SqlBulkCopy class) into the target database. Usually, multiple tables are loaded concurrently. Recall that for each Azure SQL Database service objective, the amount of available resources is limited. One of these limitations is on the volume of transaction log writes. Since all bulk load threads write to the transaction log of the target database, log write throttling may be encountered during the import operation, and is in fact the primary factor gating import speed for most migrations. Therefore, a common way to speed up the import operation is to temporarily increase the service objective to the point where transaction log writes are no longer throttled. For example, if the target service objective for the database is S2, and there is a requirement to minimize import time, then it will probably make sense to import into a P2 or higher database, and then scale down to S2 once import completes.
A simple way to see if transaction log writes, or any other governed resource, is being throttled is to examine the output of sys.dm_db_resource_stats DMV during the import operation. Here is an example of a query that does this. The query needs to be executed in the context of the database being imported:
SELECT *
FROM sys.dm_db_resource_stats
ORDER BY end_time DESC;
If transaction log writes are currently being throttled, then the values in the avg_log_write_percent column for the top rows in the result set will be close or equal to 100%. This is an indication that a higher service objective can reduce import time. Note that similar resource utilization data is also available in Azure Portal, and in the sys.resource_stats DMV. However, that data is averaged over five minute intervals, as opposed to 15 second intervals in sys.dm_db_resource_stats. Therefore, spikes in resource utilization that may indicate throttling will be less noticeable in the portal.
For larger databases, the import operation can take a long time, and should be monitored. If using Azure Portal, use the Import/Export history tile under the Operations section on the SQL server Overview blade, and examine the operations in progress. Completion percentage and current operation status are displayed.
If using PowerShell, the Get-AzureRmSqlDatabaseImportExportStatus cmdlet will return an object that describes the status of the import operation, including completion percentage. This cmdlet requires an OperationStatusLink parameter value (i.e. the identifier for the import operation), which is a field of the object returned by the New-AzureRmSqlDatabaseImport cmdlet when it starts a new import operation.
Once the import operation completes successfully, the server firewall rule that allowed access from all Azure services can be disabled, and bacpacs uploaded to the storage account can be deleted, to reduce data exposure and exfiltration concerns.
Verify Migrated Data
As with any database migration, before allowing the application to use the database, we should verify, to the extent it is practical and reasonable, that all objects and data in the source database have been successfully migrated. As noted earlier, this is particularly important for a bacpac migration, which does not natively guarantee that included data is transactionally and referentially consistent.
A simple verification mechanism for database metadata is to compare the sets of all user objects in the source and target databases, ensuring that they are the same. Similarly, for each user table, a basic data verification approach is to compare row counts between the source and target databases.
Here is a sample query to return a set of all user objects in a database:
SELECT s.name AS [schema_name],
o.name AS [object_name],
o.type_desc AS [object_type_desc]
FROM sys.objects AS o
INNER JOIN sys.schemas AS s
ON o.schema_id = s.schema_id
WHERE s.name <> 'sys'
AND
o.is_ms_shipped = 0
AND
o.type NOT IN ('IT','S')
;
Here is a sample query to return the row count for each user table in a database:
SELECT s.name AS [schema_name],
t.name AS table_name,
SUM(p.rows) AS row_count
FROM sys.partitions AS p
INNER JOIN sys.indexes AS i
ON p.object_id = i.object_id
AND
p.index_id = i.index_id
INNER JOIN sys.tables AS t
ON p.object_id = t.object_id
INNER JOIN sys.schemas AS s
ON t.schema_id = s.schema_id
WHERE s.name <> 'sys'
AND
t.is_ms_shipped = 0
AND
t.type NOT IN ('IT','S')
AND
i.type_desc IN ('HEAP','CLUSTERED','CLUSTERED COLUMNSTORE')
GROUP BY s.name, t.name
;
These are quick, but by no means comprehensive verification methods, particularly for verifying full fidelity of migrated data. Either a data compare tool (e.g. SSDT), or the tablediff utility, may be used for a more thorough, but also more time-consuming data comparison between the source and target databases.
Grant Database Access
Once the migrated database is successfully verified, we can grant database access to any server logins that were created previously, or create database authentication principals (users with passwords, or external provider users if using AAD authentication). This step should be implemented with an idempotent script that creates database users for server logins (and/or creates database users with passwords, and/or creates database users from external provider), and grants them database role membership and explicit object permissions as needed.
Verify Application Functionality
At this point, the database is successfully migrated. However, we should recall that database migration is just a part of the overall application migration to Azure. We still need to confirm that the application functions as expected, that the monitoring tools work, and that the database can be managed successfully. Only once these steps are successfully completed, we can declare the application migration successful, and declare the end of migration downtime.
Conclusion
In this article, we went over an implementation of the SQL Server to Azure SQL Database migration that many customers have used as a part of moving applications to the Azure public cloud. We described the preliminary steps needed to prepare the migration, and then focused on the specific implementation details of actual migration. This article will be helpful to organizations considering or actively working on database migration to Azure SQL Database.
Appendix A. Sample Migration Script
A sample PowerShell script that implements bacpac upload and import steps is available. The script (optionally) uploads all bacpac files found in specified directory to an Azure Storage account using AzCopy, and then starts multiple concurrent import operations for each bacpac found in the specified Azure Blob Storage container. The script continues to run for as long as any of the started import operations are in progress, and periodically outputs the status of all in-progress or failed operations.
Prior to completion, the script throws an exception if any import operations have failed, or if there are any source bacpacs without a matching database on the target logical server.
The latest version of the script can be found in this GitHub repo.