Chris (CB) enters. It looks like he want to ask a question.

Databricks

Snowflake

Cloudera

(e.g. Databricks)
/delta_lake
/_delta_log # all recent transactions
00.json # plus schema info
01.json
...
file1.parquet # transaction log periodically
file2.parquet # written to parquet


%%{init: {'theme':'dark'}}%%
block-beta
columns 1
block:group1
columns 1
par["PAR1"]
c11["Column 1 (Chunk 1)"]
c21["Column 2 (Chunk 1)"]
space
c21 --> cn1
cn1["Column n (Chunk 1)"]
blockarrowId6<[" "]>(down)
c1m["Column 1 (Chunk m)"]
space
c1m --> cnm
cnm["Column n (Chunk m)"]
meta["File Metadata"]
par2["PAR1"]
end
%%{init: {'theme':'dark'}}%%
block-beta
columns 3
block:group1
columns 1
par["PAR1"]
c11["Column 1 (Chunk 1)"]
c21["Column 2 (Chunk 1)"]
space
c21 --> cn1
cn1["Column n (Chunk 1)"]
blockarrowId6<[" "]>(down)
c1m["Column 1 (Chunk m)"]
space
c1m --> cnm
cnm["Column n (Chunk m)"]
meta["File Metadata"]
par2["PAR1"]
end
space
block:group2
columns 1
header0["Header"]
page0["Page 0"]
blockarrowId2<[" "]>(down)
headerk["Header"]
pagek["Page k"]
end
group2 --> c11
%%{init: {'theme':'dark'}}%%
block-beta
columns 3
block:group2
columns 1
version
schema
c11m["Column 1 (Chunk 1) Metadata"]
c21m["Column 2 (Chunk 1) Metadata"]
blockarrowId4<[" "]>(down)
len["Footer Length"]
end
space
block:group1
columns 1
par["PAR1"]
c11["Column 1 (Chunk 1)"]
c21["Column 2 (Chunk 1)"]
space
c21 --> cn1
cn1["Column n (Chunk 1)"]
blockarrowId6<[" "]>(down)
c1m["Column 1 (Chunk m)"]
space
c1m --> cnm
cnm["Column n (Chunk m)"]
meta["File Metadata"]
par2["PAR1"]
end
group2 --> meta
c11m --o c11
c21m --o c21
- BOOLEAN: 1 bit boolean
- INT32: 32 bit signed ints
- INT64: 64 bit signed ints
- INT96: 96 bit signed ints
- FLOAT: IEEE 32-bit floating point values
- DOUBLE: IEEE 64-bit floating point values
- BYTE_ARRAY: arbitrarily long byte arrays
- FIXED_LEN_BYTE_ARRAY: fixed length byte arrays
- STRING: UTF8 ENCODED BYTE_ARRAY
- DECIMAL:
INT32 or INT64 or FIXED_LEN_BYTE_ARRAY or BYTE_ARRAY
& PRECISION INT32 & SCALE INT32
- DATE: INT32
- JSON: UTF8 ENCODED BYTE_ARRAY
- LIST (SEE NESTED TYPES)
- MAP
- RECORD
- ETC.
Nested Lists
Column
- [[1],[2],[3]]]
- [[4,5]]
- [[6,7],[8]]
Repetition Level
VALUES:
- 1,2,3,4,5,6,7,8
REPETITION_LEVELS:
- 0,1,1,0,2,0,2,1
Encoded
1234567801102021
Data: 100, 100, 100, 101, 101, 102, 103, 103
Run Length Encoding
3, 100, 2, 101, 1, 102, 2, 103
- DICTIONARY: 100,101,102,103
- DATA: 0, 0, 0, 1, 1, 2, 3
- format: [count] [first_value] [minimum_delta] [values]
- 8, 100, 0, 0,0,0,1,0,1,1,0
How do you query Parquet efficiently?
[
{"x":1,"y":1},{"x":1,"y":2},{"x":1,"y":3},{"x":1,"y":4},
{"x":2,"y":1},{"x":2,"y":2},{"x":2,"y":3},{"x":2,"y":4},
{"x":3,"y":1},{"x":3,"y":2},{"x":3,"y":3},{"x":3,"y":4},
{"x":4,"y":1},{"x":4,"y":2},{"x":4,"y":3},{"x":4,"y":4}
]
How do you sort multiple columns?
%%{init: {'theme':'dark'}}%%
block-beta
columns 1
block:group2
columns 7
c11["{x:1,y:1}"] space
c12["{x:1,y:2}"] space
c13["{x:1,y:3}"] space
c14["{x:1,y:4}"]
c21["{x:2,y:1}"] space
c22["{x:2,y:2}"] space
c23["{x:2,y:3}"] space
c24["{x:2,y:4}"]
space space space space space space space
c31["{x:3,y:1}"] space
c32["{x:3,y:2}"] space
c33["{x:3,y:3}"] space
c34["{x:3,y:4}"]
c41["{x:4,y:1}"] space
c42["{x:4,y:2}"] space
c43["{x:4,y:3}"] space
c44["{x:4,y:4}"]
end
%%{init: {'theme':'dark'}}%%
block-beta
columns 1
block:group2
columns 7
c11["{x:1,y:1}"] space
c12["{x:1,y:2}"] space
c13["{x:1,y:3}"] space
c14["{x:1,y:4}"]
c21["{x:2,y:1}"] space
c22["{x:2,y:2}"] space
c23["{x:2,y:3}"] space
c24["{x:2,y:4}"]
space space space space space space space
c31["{x:3,y:1}"] space
c32["{x:3,y:2}"] space
c33["{x:3,y:3}"] space
c34["{x:3,y:4}"]
c41["{x:4,y:1}"] space
c42["{x:4,y:2}"] space
c43["{x:4,y:3}"] space
c44["{x:4,y:4}"]
c11 --> c12
c12 --> c21
c21 --> c22
end
%%{init: {'theme':'dark'}}%%
block-beta
columns 1
block:group2
columns 7
c11["{x:1,y:1}"] space
c12["{x:1,y:2}"] space
c13["{x:1,y:3}"] space
c14["{x:1,y:4}"]
c21["{x:2,y:1}"] space
c22["{x:2,y:2}"] space
c23["{x:2,y:3}"] space
c24["{x:2,y:4}"]
space space space space space space space
c31["{x:3,y:1}"] space
c32["{x:3,y:2}"] space
c33["{x:3,y:3}"] space
c34["{x:3,y:4}"]
c41["{x:4,y:1}"] space
c42["{x:4,y:2}"] space
c43["{x:4,y:3}"] space
c44["{x:4,y:4}"]
c11 --> c12
c12 --> c21
c21 --> c22
c22 --> c13
c13 --> c14
c14 --> c23
c23 --> c24
c24 --> c31
c31 --> c32
c32 --> c41
c41 --> c42
c42 --> c33
c33 --> c34
c34 --> c43
c43 --> c44
end
A Bloom Filter tests membership in a set
Your org WILL store data in Parquet


