The Data Toolbox

I thought it would be fitting for my first post to this blog be about some of the things I learned while at the PASS Summit this year. It was my first time attending, and it was an absolute blast. I learned a lot in a small amount of time, and was surprised at how large (and friendly) this community is, it really is a #SQLFamily.

Wednesday, Oct 28th

Started the off the day wandering into a breakfast that had a demo of EMC's XtreamIO for SQL Server. It was a really cool technology... that I'll never get to play with. All SSD and used diffs when copying databases to different environments to save space. It had some really cool compression technology built in as well.

The Keynote

The Keynote was the real winner that morning. I was able to get a pretty sweet front row spot, as seen in the panorama below.

They started the show with a video predicting that soon we will be generating 40ZB of data a year, or some stat like that. Sounds great for job security, but I'd be interested to see where that number came from. The Summit had 5,500 registrations from 58 countries and 2000 companies!

The Keynote speaker showed a video about ImagineCare which pulled data from healthcare devices (heart rate monitors, glucose monitors, ect. great example of IoT) into the cloud and ran some ML algorithms to try and predict if someone needed to see a doctor or if they were likely to visit an ER. Then they showcased a company called DocuSign, which has doubled it's data footprint every year. Glad to see that they can keep up.

The rest of the time they demoed SQL Server 2016 and it looked great! Although just about anything would on an HP SuperDome machine (240 core (480HT) and 12 TB RAM). A lot of cool things are built into 2016, Mobile BI, integration with R, In-Memory OLTP updates, Polybase, JSON support, Real Time Operational Analytics, Always Encrypted. They demoed most of these and it was very cool.

Eating the Elephant

This was an intro session to Hadoop for DBA's I was really hoping for a demo here, but here are my takeaways.

SQL Server releases every 2-4 years

Hadoop "official" release about every 6 months

5V's of "Big Data" (Big Data is a poorly chosen moniker btw, we had big data problems when files were to big for floppy disks (you know the save icon))

Volume

Velocity

Variety

Variability

Value

Data Lake is like the junk drawer of data.

There's something called a Lambda Architecture. Not real sure what this is but it sounds really cool.

Word of the day is Polyschematic, sounds better than unstructured. Because everything got some structure to it.

You can use the HIVE ODBC driver with SSIS to import/export data.

Paul Randal's DBA Mythbusters

This one was really cool. Mostly because it was Paul Randal, but he had around 25 or so myths he busted, no demos but was able to explain what happens under the hood. He's got a bunch of blog posts about this http://www.sqlskills.com/blogs/paul/myths-and-misconceptions-60-page-pdf-of-blog-posts/.

Database Design Throwdown

This was a pretty fun session. The two speakers would take a side of a topic (even if it was ridiculous) and try to get the audience to share their views on the issue.

Hive & Hadoop High Performance Data Warehousing

This was a really cool session. Mostly talked about how Hadoop can be a way to supplement the Data Warehouse and how Hive is a great way to get started since we already know SQL.

Hadoop is beneficial when we have large data sets (Terabytes in size), "Unstructured data", data stored for Auditing and Archive, Log and Sensor Data, The Internet of Things (IoT), Data Mining, Machine Learning, Pattern Recognition

Stored data as ORC format, a column store format for Hive

Used Sqoop to stage data into Hadoop
and HiveQL to transform the data into Dimensions and Facts.

Thrusday, Oct 29th

Morning Keynote

I didn't get the awesome seat like I did the day before, in fact I almost over slept the keynote forgot to set an alarm. (BTW if you create an alarm in Android, it's a one time alarm by default). This keynote lacked the demos that we saw from the day before, but was still very interesting. Some people from Microsoft came in and talked about the Internet of Things and some of the challenges it creates. They had some crazy stats like there are 12.5 billion devices connected and we should see that double to 25 billion by 2020.

The main idea here was dubbed "Fog Computing" or edge computing, and was primarily concerned with extracting data from these sensors and pushing them to the cloud. The work would be done by an aggregator, that would take in a stream of data from a device and based on the results of queries on that stream of data would send it up to the cloud infrastructure. Once in the cloud the data could be used in applications, analytic solutions, ect.

Analyzing Performance & Scalability for OLTP/DW Solutions

Testing was all based on the HP Integrity Superdome X-platform

8 blades w/ 30 cores and 1 TB of Ram

3Par 7440c SAN

224 SSD - 480GB each... you can do the math

8 500GB databases as test databases, spread across the storage array in RAID 1+0

Reference KB article 2964518 for recommendations for scaling up SQL Server 2014 (Here)

May want to consider for our environment.

Test included 22 queries, running individually.

May be good to devise a benchmark for our system.

Raising Code Quality w/ Automated Testing

The cost of fixing code after it has been released into production is > 150x the cost of doing it during development.

tSQLt was the framework used for testing

Outline for Tests
Assemble - create the test cases and load the appropriate tables
Act - execute the code we want to test.
Assert - check that something is true from the test, if not the test has failed.

Shouldn't need 100K+ rows of data
Just enough to cover specific scenarios (test cases)

Create a separate schema or database and have an sp that loads the data
script out the data, keep the script in source control

write a test for every new piece of code/bug

Wait, Latches, and Locks

Use Extended Events, system_health is a good one
Async_Network_IO is based on an application timeout. Very rare that it's the network

SOS_Scheduler_Yield
SQL is a cooperative system, so tasks will give up the processor to other tasks after a quantum of time has expired. quantum is 4ms.
The SOS_Scheduler_Yield wait happens when a task doesn't "naturally" wait.
High count/High wait could be CPU bound

THREADPOOL

Happens when a login can't get a worker thread from the threadpool, too many "people" on the machine. You'll only see it in stats and tasks, may need to use the DAC to see it live.

IO_COMPLETION

Happens when you spill into TempDB for sorting.

Memory waits

Resource_Semaphore - limited memory

Resource_Semaphore_query_compile

CMEMTHREAD - Trace Flag 8084 may help

Preemptive waits

Non-Yielding scheduled, linked queries calling external code

will leave task in running status

CXPacket

Used to synchronize parallel query workers (just means we have parallel queries)

Tune queries as needed

Craig Freedman Paper on Parallel query

Will be a thread w/o CXPacket, could be the real problem

Understanding ColumnStore Indexing

Architecture

Split data into Row Groups 1-1045678 rows

A deleted bitmap is a structure storing information about rows that have been deleted in a row group

Updates are Delete + Insert

Row groups w/Status: Open, Closed, Hidden, Tombstone

Hidden is seen during bulk load

Tombstone is a compressed Delta-Store (you can ignore it)

Segment elimination

Only columns needed are used

min and max are used as heuristics to determine which row groups are needed

only works for certain types

Predicate Pushdown

where clause is passed into the storage engine so fewer rows are read off disk.

Phrase of the Day: "blazingly slow"

When rebuilding the cci use MAXDOP = 1, after building a clustered index (keyed on the most frequently used predicate), parallel will not help w/ segment alignment, unless the table is partitioned.

Need to use partitioning w/ columnstore

He used the resource governor to starve the rebuild of memory to create smaller row groups. Having many row groups can help w/ parallel queries, and is less data in a single group.

Extended Events for Memory Monitoring

Clustered_columnstore_index_rebuild

column_store_index_low memory

column_store_index_throttle

column_store_index_build_process_segment

Row group trimming

Alter index reorganize

Memory Pressure

Dictionary Pressure

Dictionaries

Global and Local

Global is for all row groups

Local is for only one row group

Dictionary pressure can/will occur with strings in the row group

Avoid strings in the fact tables

Friday, Oct 30th

Spark SQL

Performance - In Memory (Spark is an in memory technology that can run on YARN)

Spark is the 2014 Sort Benchmark winner

100TB in 23 mins 206 node system

Azure Data Lake has an offering for Spark Platform

Zeppelin - Spark SQL interpreter

Spark can work w/ Tableau

Fast and general cluster computing system, interoperable w/ Hadoop

Improves efficiency through In-Memory computing primitives, general computation graphs

Improves usability through API's in R, Scala, Python, and Java, unified interface to read/write

Resilient Distributed Datasets - collection of objects that can be stored in memory or disk across a cluster, parallel functionality

Data frames are a collection of rows and columns

Demo had crime data w/ 5.5 million rows, flight data with 744 million rows, and a larger set of flight data w/ 1.5TB 11 billion rows. All queries ran in ~1 sec. 32 node 8 core 55GB per node.

Date Dimension

Use an INT for the DateKey, much faster with joins.

Very descriptive columns

First day of month, last day of month, last day of last month, ect.

use extended properties to set descriptions

Separate Time dimension

uses the time data type

AM/PM description and flags

just one day of time

PK is an INT based on HHMMSS like the datekey

consider including an offset from UTC

Data Loading Performance

Loading w/ loops
first 18 secs
second 1 sec - this one was done inside a transaction.

Delayed_Durability database setting = FORCED
writes only to log buffer
async write to disk for commit

Narrow Index key
Static
Ever-Increasing - prevents page splits (creates latching problems when trying to load into one page)

SQLQueryStress

Could use In-Memory OLTP

Non-Sequential Key to avoid the latching problem but causes fragmentation (rebuild the index after loading)

Hash Partitioning

minimal logging
Only allocations are logged (simple or bulk-logged only)

Bulk Load
Stream of row- continuous, work over tds
saves a lot of network activity

BCP, BULK INSERT, SSIS
INSERT... SELECT
BULK API

Methods
Load to HEAP, build index later
Use Tablock when doing this
Generally the fastest way to load data, but building the index takes time
use a bigger network packet size 32KB

Load into a clustered table
Do NOT use Tablock in this scenario
Use TF 610
Has to sort the data w/o TF610
pass order hint to bcp

BatchSize
each batch is a separate transaction
prevents tempdb spills when loading index table
prevents lock escalation
allows reading from table

Loading into a Partitioned table
Switch out, bulk load, switch in

Large Updates and Deletes
Delete (Partitioned)
- bulk load needed rows into outer table
- switch out current partition
- switch in outer table

Update (Partitioned)
-BulkLoad update rows into outer table
- switch out all rows to outer table
- merge into new table
- switch in new table

CPU Utilization
Generally one loading file per core
Watch out for SOS_SCHEDULER_YIELD

PFS Contention
PFS - Page Free Space (bitmap tells where free space is in db)
all processes have to go there, just add more files.

Jumbo Frames - Bigger TCP/IP packets
Larger Network Packets
Disk Layout - Pre Size data and log files, enable IFI, put files in different luns
Modern Hardware and RAID (Generally RAID 1+0 or RAID 1 for logs and RAID 5 for data if RAID 1+0 isn't available)
Make sure the interface to the SAN is fast enough.

Compression

Types of Compression

Row - Reduces metadata overhead per row

PAGE

COLUMNSTORE

Row

Compresses fixed data types that contain variable length data

PAGE compression

Row compress the data first

Prefix compression, compresses column based on prefix of values in column

Dictionary compression, compresses similar prefix compressed values

What to compress

find w/ sys.dm_db_index_operational_stats

sp_estimated_data_compression_savings

deduplication is what PAGE Compression does.

CPU will have to decompress the data. There is a trade off here, storage for CPU

Data Integration w/ MDS

Master Data Management

Methodology for managing Master Data for the org

Manage the data across many different systems

Architectures

Registries

Consolidation

Coexistance

Model is a container for the information

Entity is a set of records within a model (like a table in a database)

Members are records of an entity

MDS Staging

stage tables are created for each entity

import types define operations for a staging table

processing done w/ batch tags and store procs

SSIS used to stage data.