Int. Data Eng's (DE) office #

Chris (CB) enters. It looks like he want to ask a question.

CB
Hey, I'm trying to run some sanity checks on our data, but I can't find it in Azure SQL.
DE
We migrated everything to the Data Lake.
DE
It's mostly in a data format called Parquet.
DE
You'll have to rewrite your query in Python.
DE
It only works if you run it from a notebook in the browser IDE
DE
You can't do anything until you spin up some compute. That should take 5-10 minutes.

Why the #%$@ is all my data in Parquet? #

(A deep dive into the data lake)

What is a data lake? #

Lake Conroe dragon

Databricks

DataBricks

Snowflake

Snowflake

Cloudera

Cloudera
  • OLAP (index optimized reads)
  • Format Agnostic
  • Data lifecycle management

Delta Lake #

(e.g. Databricks)

/delta_lake
  /_delta_log # all recent transactions
    00.json   # plus schema info
    01.json
    ...

  file1.parquet # transaction log periodically
  file2.parquet # written to parquet

Int. CB's office #

6 months ago
DE
Azure Data factory doesn't have great support for python. Can we move part of our ETL to Databricks?
4 months ago
DE
We could save hours if we streamed updates to a delta table instead of bulk loads.
2 months ago
DE
Our data lake is using 80% fewer DTUs than traditional SQL.
CB
Why don't we just move everything to Databricks?

How to speak Parquet #

Parquet readers #

Parquet is open source! #

Parquet Logo
Apache Logo

Questions #

  • Is Parquet space efficient (storage & IO)?
  • Can Parquet support complex data structures?
  • How do you query Parquet files?

Parquet features #

OLAP (NOT OLTP) #

  • Columnar storage
  • Schema-on-read (vs. schema-on-write)
  • Encodings
  • Compression
  • Index (?!)

Parquet Structure #

  %%{init: {'theme':'dark'}}%%
	block-beta
	columns 1
	block:group1
     columns 1
      par["PAR1"]
	  c11["Column 1 (Chunk 1)"]
      c21["Column 2 (Chunk 1)"]
      space
      c21 --> cn1
	  cn1["Column n (Chunk 1)"]
	  blockarrowId6<["  "]>(down)
	  c1m["Column 1 (Chunk m)"]
	  space
	  c1m --> cnm
	  cnm["Column n (Chunk m)"]
	  meta["File Metadata"]
      par2["PAR1"]
    end

Parquet Structure #

Column Chunks #

  %%{init: {'theme':'dark'}}%%
	block-beta
	columns 3
	block:group1
     columns 1
      par["PAR1"]
	  c11["Column 1 (Chunk 1)"]
      c21["Column 2 (Chunk 1)"]
      space
      c21 --> cn1
	  cn1["Column n (Chunk 1)"]
	  blockarrowId6<["  "]>(down)
	  c1m["Column 1 (Chunk m)"]
	  space
	  c1m --> cnm
	  cnm["Column n (Chunk m)"]
	  meta["File Metadata"]
      par2["PAR1"]
    end
    space
    block:group2
    columns 1
      header0["Header"]
      page0["Page 0"]
      blockarrowId2<["  "]>(down)
      headerk["Header"]
      pagek["Page k"]
    end
    group2 --> c11

Parquet Structure #

Metadata #

  %%{init: {'theme':'dark'}}%%
	block-beta
	columns 3
    block:group2
    columns 1
      version
      schema
      c11m["Column 1 (Chunk 1) Metadata"]
      c21m["Column 2 (Chunk 1) Metadata"]
      blockarrowId4<["  "]>(down)
      len["Footer Length"]
    end
    space
	block:group1
     columns 1
      par["PAR1"]
	  c11["Column 1 (Chunk 1)"]
      c21["Column 2 (Chunk 1)"]
      space
      c21 --> cn1
	  cn1["Column n (Chunk 1)"]
	  blockarrowId6<["  "]>(down)
	  c1m["Column 1 (Chunk m)"]
	  space
	  c1m --> cnm
	  cnm["Column n (Chunk m)"]
	  meta["File Metadata"]
      par2["PAR1"]
    end
    group2 --> meta
    c11m --o c11
    c21m --o c21

Encodings #

Base types #

  - BOOLEAN: 1 bit boolean
  - INT32: 32 bit signed ints
  - INT64: 64 bit signed ints
  - INT96: 96 bit signed ints
  - FLOAT: IEEE 32-bit floating point values
  - DOUBLE: IEEE 64-bit floating point values
  - BYTE_ARRAY: arbitrarily long byte arrays
  - FIXED_LEN_BYTE_ARRAY: fixed length byte arrays

Encodings #

Logical types #

  - STRING: UTF8 ENCODED BYTE_ARRAY
  - DECIMAL:
	  INT32 or INT64 or FIXED_LEN_BYTE_ARRAY or BYTE_ARRAY
	   & PRECISION INT32 & SCALE INT32
  - DATE: INT32
  - JSON: UTF8 ENCODED BYTE_ARRAY
  - LIST (SEE NESTED TYPES)
  - MAP
  - RECORD
  - ETC.

Nested Types #

Nested Lists

Column
- [[1],[2],[3]]]
- [[4,5]]
- [[6,7],[8]]

Repetition Level

VALUES:
- 1,2,3,4,5,6,7,8
REPETITION_LEVELS:
- 0,1,1,0,2,0,2,1

Encoded

1234567801102021

Encodings #

   Data: 100, 100, 100, 101, 101, 102, 103, 103

Run Length Encoding

	3, 100, 2, 101, 1, 102, 2, 103
Dictionary Encoding
- DICTIONARY: 100,101,102,103
- DATA: 0, 0, 0, 1, 1, 2, 3
Delta Encoding
- format: [count] [first_value] [minimum_delta] [values]
- 8, 100, 0, 0,0,0,1,0,1,1,0

Compression #

  • SNAPPY
  • GZIP
  • LZO
  • BROTLI
  • ZSTD

Indexing #

How do you query Parquet efficiently?

"Predicate Pushdown" (e.g. LINQ to SQL)
  • Min/Max statistics
  • Bloom Filter

Indexing #

Z-Order #

[
{"x":1,"y":1},{"x":1,"y":2},{"x":1,"y":3},{"x":1,"y":4},
{"x":2,"y":1},{"x":2,"y":2},{"x":2,"y":3},{"x":2,"y":4},
{"x":3,"y":1},{"x":3,"y":2},{"x":3,"y":3},{"x":3,"y":4},
{"x":4,"y":1},{"x":4,"y":2},{"x":4,"y":3},{"x":4,"y":4}
]

How do you sort multiple columns?

  %%{init: {'theme':'dark'}}%%

block-beta columns 1 block:group2 columns 7 c11["{x:1,y:1}"] space c12["{x:1,y:2}"] space c13["{x:1,y:3}"] space c14["{x:1,y:4}"] c21["{x:2,y:1}"] space c22["{x:2,y:2}"] space c23["{x:2,y:3}"] space c24["{x:2,y:4}"] space space space space space space space c31["{x:3,y:1}"] space c32["{x:3,y:2}"] space c33["{x:3,y:3}"] space c34["{x:3,y:4}"] c41["{x:4,y:1}"] space c42["{x:4,y:2}"] space c43["{x:4,y:3}"] space c44["{x:4,y:4}"]

end

  %%{init: {'theme':'dark'}}%%

block-beta columns 1 block:group2 columns 7 c11["{x:1,y:1}"] space c12["{x:1,y:2}"] space c13["{x:1,y:3}"] space c14["{x:1,y:4}"] c21["{x:2,y:1}"] space c22["{x:2,y:2}"] space c23["{x:2,y:3}"] space c24["{x:2,y:4}"] space space space space space space space c31["{x:3,y:1}"] space c32["{x:3,y:2}"] space c33["{x:3,y:3}"] space c34["{x:3,y:4}"] c41["{x:4,y:1}"] space c42["{x:4,y:2}"] space c43["{x:4,y:3}"] space c44["{x:4,y:4}"] c11 --> c12 c12 --> c21 c21 --> c22 end

  %%{init: {'theme':'dark'}}%%

block-beta columns 1 block:group2 columns 7 c11["{x:1,y:1}"] space c12["{x:1,y:2}"] space c13["{x:1,y:3}"] space c14["{x:1,y:4}"] c21["{x:2,y:1}"] space c22["{x:2,y:2}"] space c23["{x:2,y:3}"] space c24["{x:2,y:4}"] space space space space space space space c31["{x:3,y:1}"] space c32["{x:3,y:2}"] space c33["{x:3,y:3}"] space c34["{x:3,y:4}"] c41["{x:4,y:1}"] space c42["{x:4,y:2}"] space c43["{x:4,y:3}"] space c44["{x:4,y:4}"] c11 --> c12 c12 --> c21 c21 --> c22 c22 --> c13 c13 --> c14 c14 --> c23 c23 --> c24 c24 --> c31 c31 --> c32 c32 --> c41 c41 --> c42 c42 --> c33 c33 --> c34 c34 --> c43 c43 --> c44 end

Indexing #

Bloom Filter #

A Bloom Filter tests membership in a set

  • No false negatives
  • False positives possible
  • Serialized as a bit array
  • One per column chunk

Summary #

Your org WILL store data in Parquet

Good news: #

  • Parquet is super efficient (I/O, storage)
  • Predicate pushdown
  • Open source

Bad news: #

  • Vendor lock in
  • Tooling needs improvement

Thank you! #

Headshot of Chris Bremer
LinkedIn QR Code

Post Credits #

No BLoom Filter 4 U!