Int. Data Eng's (DE) office #

Chris (CB) enters. It looks like he want to ask a question.

Hey, I'm trying to run some sanity checks on our data, but I can't find it in Azure SQL.

We migrated everything to the Data Lake.

It's mostly in a data format called Parquet.

You'll have to rewrite your query in Python.

It only works if you run it from a notebook in the browser IDE

You can't do anything until you spin up some compute. That should take 5-10 minutes.

Why the #%$@ is all my data in Parquet? #

(A deep dive into the data lake)

What is a data lake? #

Databricks

Snowflake

Cloudera

OLAP (index optimized reads)
Format Agnostic
Data lifecycle management

Delta Lake #

(e.g. Databricks)

/delta_lake
  /_delta_log # all recent transactions
    00.json   # plus schema info
    01.json
    ...

  file1.parquet # transaction log periodically
  file2.parquet # written to parquet

Int. CB's office #

6 months ago

Azure Data factory doesn't have great support for python. Can we move part of our ETL to Databricks?

4 months ago

We could save hours if we streamed updates to a delta table instead of bulk loads.

2 months ago

Our data lake is using 80% fewer DTUs than traditional SQL.

Why don't we just move everything to Databricks?

How to speak Parquet #

Parquet readers #

Scala/Spark
Python (Pandas) 🐍
DuckDB 🦆
Parquet.NET 👍
- (Ivan G)

Parquet is open source! #

Questions #

Is Parquet space efficient (storage & IO)?
Can Parquet support complex data structures?
How do you query Parquet files?

Parquet features #

OLAP (NOT OLTP) #

Columnar storage
Schema-on-read (vs. schema-on-write)
Encodings
Compression
Index (?!)

Parquet Structure #

  %%{init: {'theme':'dark'}}%%
	block-beta
	columns 1
	block:group1
     columns 1
      par["PAR1"]
	  c11["Column 1 (Chunk 1)"]
      c21["Column 2 (Chunk 1)"]
      space
      c21 --> cn1
	  cn1["Column n (Chunk 1)"]
	  blockarrowId6<["  "]>(down)
	  c1m["Column 1 (Chunk m)"]
	  space
	  c1m --> cnm
	  cnm["Column n (Chunk m)"]
	  meta["File Metadata"]
      par2["PAR1"]
    end

Parquet Structure #

Column Chunks #

  %%{init: {'theme':'dark'}}%%
	block-beta
	columns 3
	block:group1
     columns 1
      par["PAR1"]
	  c11["Column 1 (Chunk 1)"]
      c21["Column 2 (Chunk 1)"]
      space
      c21 --> cn1
	  cn1["Column n (Chunk 1)"]
	  blockarrowId6<["  "]>(down)
	  c1m["Column 1 (Chunk m)"]
	  space
	  c1m --> cnm
	  cnm["Column n (Chunk m)"]
	  meta["File Metadata"]
      par2["PAR1"]
    end
    space
    block:group2
    columns 1
      header0["Header"]
      page0["Page 0"]
      blockarrowId2<["  "]>(down)
      headerk["Header"]
      pagek["Page k"]
    end
    group2 --> c11

Parquet Structure #

Metadata #

  %%{init: {'theme':'dark'}}%%
	block-beta
	columns 3
    block:group2
    columns 1
      version
      schema
      c11m["Column 1 (Chunk 1) Metadata"]
      c21m["Column 2 (Chunk 1) Metadata"]
      blockarrowId4<["  "]>(down)
      len["Footer Length"]
    end
    space
	block:group1
     columns 1
      par["PAR1"]
	  c11["Column 1 (Chunk 1)"]
      c21["Column 2 (Chunk 1)"]
      space
      c21 --> cn1
	  cn1["Column n (Chunk 1)"]
	  blockarrowId6<["  "]>(down)
	  c1m["Column 1 (Chunk m)"]
	  space
	  c1m --> cnm
	  cnm["Column n (Chunk m)"]
	  meta["File Metadata"]
      par2["PAR1"]
    end
    group2 --> meta
    c11m --o c11
    c21m --o c21

Encodings #

Base types #

  - BOOLEAN: 1 bit boolean
  - INT32: 32 bit signed ints
  - INT64: 64 bit signed ints
  - INT96: 96 bit signed ints
  - FLOAT: IEEE 32-bit floating point values
  - DOUBLE: IEEE 64-bit floating point values
  - BYTE_ARRAY: arbitrarily long byte arrays
  - FIXED_LEN_BYTE_ARRAY: fixed length byte arrays

Encodings #

Logical types #

  - STRING: UTF8 ENCODED BYTE_ARRAY
  - DECIMAL:
	  INT32 or INT64 or FIXED_LEN_BYTE_ARRAY or BYTE_ARRAY
	   & PRECISION INT32 & SCALE INT32
  - DATE: INT32
  - JSON: UTF8 ENCODED BYTE_ARRAY
  - LIST (SEE NESTED TYPES)
  - MAP
  - RECORD
  - ETC.

Nested Types #

Nested Lists

Column
- [[1],[2],[3]]]
- [[4,5]]
- [[6,7],[8]]

Repetition Level

VALUES:
- 1,2,3,4,5,6,7,8
REPETITION_LEVELS:
- 0,1,1,0,2,0,2,1

Encoded

1234567801102021

Encodings #

   Data: 100, 100, 100, 101, 101, 102, 103, 103

Run Length Encoding

	3, 100, 2, 101, 1, 102, 2, 103

Dictionary Encoding

- DICTIONARY: 100,101,102,103
- DATA: 0, 0, 0, 1, 1, 2, 3

Delta Encoding

- format: [count] [first_value] [minimum_delta] [values]
- 8, 100, 0, 0,0,0,1,0,1,1,0

Compression #

SNAPPY
GZIP
LZO
BROTLI
ZSTD

Indexing #

How do you query Parquet efficiently?

"Predicate Pushdown" (e.g. LINQ to SQL)

Min/Max statistics
Bloom Filter

Indexing #

Z-Order #

[
{"x":1,"y":1},{"x":1,"y":2},{"x":1,"y":3},{"x":1,"y":4},
{"x":2,"y":1},{"x":2,"y":2},{"x":2,"y":3},{"x":2,"y":4},
{"x":3,"y":1},{"x":3,"y":2},{"x":3,"y":3},{"x":3,"y":4},
{"x":4,"y":1},{"x":4,"y":2},{"x":4,"y":3},{"x":4,"y":4}
]

How do you sort multiple columns?

  %%{init: {'theme':'dark'}}%%
block-beta
columns 1
block:group2
columns 7
c11["{x:1,y:1}"] space
c12["{x:1,y:2}"] space
c13["{x:1,y:3}"] space
c14["{x:1,y:4}"]
c21["{x:2,y:1}"] space
c22["{x:2,y:2}"] space
c23["{x:2,y:3}"] space
c24["{x:2,y:4}"]
space space space space space space space
c31["{x:3,y:1}"] space
c32["{x:3,y:2}"] space
c33["{x:3,y:3}"] space
c34["{x:3,y:4}"]
c41["{x:4,y:1}"] space
c42["{x:4,y:2}"] space
c43["{x:4,y:3}"] space
c44["{x:4,y:4}"]
end

  %%{init: {'theme':'dark'}}%%
block-beta
columns 1
block:group2
columns 7
c11["{x:1,y:1}"] space
c12["{x:1,y:2}"] space
c13["{x:1,y:3}"] space
c14["{x:1,y:4}"]
c21["{x:2,y:1}"] space
c22["{x:2,y:2}"] space
c23["{x:2,y:3}"] space
c24["{x:2,y:4}"]
space space space space space space space
c31["{x:3,y:1}"] space
c32["{x:3,y:2}"] space
c33["{x:3,y:3}"] space
c34["{x:3,y:4}"]
c41["{x:4,y:1}"] space
c42["{x:4,y:2}"] space
c43["{x:4,y:3}"] space
c44["{x:4,y:4}"]
c11 --> c12
c12 --> c21
c21 --> c22
end

  %%{init: {'theme':'dark'}}%%
block-beta
columns 1
block:group2
columns 7
c11["{x:1,y:1}"] space
c12["{x:1,y:2}"] space
c13["{x:1,y:3}"] space
c14["{x:1,y:4}"]
c21["{x:2,y:1}"] space
c22["{x:2,y:2}"] space
c23["{x:2,y:3}"] space
c24["{x:2,y:4}"]
space space space space space space space
c31["{x:3,y:1}"] space
c32["{x:3,y:2}"] space
c33["{x:3,y:3}"] space
c34["{x:3,y:4}"]
c41["{x:4,y:1}"] space
c42["{x:4,y:2}"] space
c43["{x:4,y:3}"] space
c44["{x:4,y:4}"]
c11 --> c12
c12 --> c21
c21 --> c22
c22 --> c13
c13 --> c14
c14 --> c23
c23 --> c24
c24 --> c31
c31 --> c32
c32 --> c41
c41 --> c42
c42 --> c33
c33 --> c34
c34 --> c43
c43 --> c44
end