Chris (CB) enters. It looks like he want to ask a question.
Databricks
Snowflake
Cloudera
(e.g. Databricks)
/delta_lake
/_delta_log # all recent transactions
00.json # plus schema info
01.json
...
file1.parquet # transaction log periodically
file2.parquet # written to parquet
%%{init: {'theme':'dark'}}%% block-beta columns 1 block:group1 columns 1 par["PAR1"] c11["Column 1 (Chunk 1)"] c21["Column 2 (Chunk 1)"] space c21 --> cn1 cn1["Column n (Chunk 1)"] blockarrowId6<[" "]>(down) c1m["Column 1 (Chunk m)"] space c1m --> cnm cnm["Column n (Chunk m)"] meta["File Metadata"] par2["PAR1"] end
%%{init: {'theme':'dark'}}%% block-beta columns 3 block:group1 columns 1 par["PAR1"] c11["Column 1 (Chunk 1)"] c21["Column 2 (Chunk 1)"] space c21 --> cn1 cn1["Column n (Chunk 1)"] blockarrowId6<[" "]>(down) c1m["Column 1 (Chunk m)"] space c1m --> cnm cnm["Column n (Chunk m)"] meta["File Metadata"] par2["PAR1"] end space block:group2 columns 1 header0["Header"] page0["Page 0"] blockarrowId2<[" "]>(down) headerk["Header"] pagek["Page k"] end group2 --> c11
%%{init: {'theme':'dark'}}%% block-beta columns 3 block:group2 columns 1 version schema c11m["Column 1 (Chunk 1) Metadata"] c21m["Column 2 (Chunk 1) Metadata"] blockarrowId4<[" "]>(down) len["Footer Length"] end space block:group1 columns 1 par["PAR1"] c11["Column 1 (Chunk 1)"] c21["Column 2 (Chunk 1)"] space c21 --> cn1 cn1["Column n (Chunk 1)"] blockarrowId6<[" "]>(down) c1m["Column 1 (Chunk m)"] space c1m --> cnm cnm["Column n (Chunk m)"] meta["File Metadata"] par2["PAR1"] end group2 --> meta c11m --o c11 c21m --o c21
- BOOLEAN: 1 bit boolean
- INT32: 32 bit signed ints
- INT64: 64 bit signed ints
- INT96: 96 bit signed ints
- FLOAT: IEEE 32-bit floating point values
- DOUBLE: IEEE 64-bit floating point values
- BYTE_ARRAY: arbitrarily long byte arrays
- FIXED_LEN_BYTE_ARRAY: fixed length byte arrays
- STRING: UTF8 ENCODED BYTE_ARRAY
- DECIMAL:
INT32 or INT64 or FIXED_LEN_BYTE_ARRAY or BYTE_ARRAY
& PRECISION INT32 & SCALE INT32
- DATE: INT32
- JSON: UTF8 ENCODED BYTE_ARRAY
- LIST (SEE NESTED TYPES)
- MAP
- RECORD
- ETC.
Nested Lists
Column
- [[1],[2],[3]]]
- [[4,5]]
- [[6,7],[8]]
Repetition Level
VALUES:
- 1,2,3,4,5,6,7,8
REPETITION_LEVELS:
- 0,1,1,0,2,0,2,1
Encoded
1234567801102021
Data: 100, 100, 100, 101, 101, 102, 103, 103
Run Length Encoding
3, 100, 2, 101, 1, 102, 2, 103
- DICTIONARY: 100,101,102,103
- DATA: 0, 0, 0, 1, 1, 2, 3
- format: [count] [first_value] [minimum_delta] [values]
- 8, 100, 0, 0,0,0,1,0,1,1,0
How do you query Parquet efficiently?
[
{"x":1,"y":1},{"x":1,"y":2},{"x":1,"y":3},{"x":1,"y":4},
{"x":2,"y":1},{"x":2,"y":2},{"x":2,"y":3},{"x":2,"y":4},
{"x":3,"y":1},{"x":3,"y":2},{"x":3,"y":3},{"x":3,"y":4},
{"x":4,"y":1},{"x":4,"y":2},{"x":4,"y":3},{"x":4,"y":4}
]
How do you sort multiple columns?
%%{init: {'theme':'dark'}}%%block-beta columns 1 block:group2 columns 7 c11["{x:1,y:1}"] space c12["{x:1,y:2}"] space c13["{x:1,y:3}"] space c14["{x:1,y:4}"] c21["{x:2,y:1}"] space c22["{x:2,y:2}"] space c23["{x:2,y:3}"] space c24["{x:2,y:4}"] space space space space space space space c31["{x:3,y:1}"] space c32["{x:3,y:2}"] space c33["{x:3,y:3}"] space c34["{x:3,y:4}"] c41["{x:4,y:1}"] space c42["{x:4,y:2}"] space c43["{x:4,y:3}"] space c44["{x:4,y:4}"]
end
%%{init: {'theme':'dark'}}%%block-beta columns 1 block:group2 columns 7 c11["{x:1,y:1}"] space c12["{x:1,y:2}"] space c13["{x:1,y:3}"] space c14["{x:1,y:4}"] c21["{x:2,y:1}"] space c22["{x:2,y:2}"] space c23["{x:2,y:3}"] space c24["{x:2,y:4}"] space space space space space space space c31["{x:3,y:1}"] space c32["{x:3,y:2}"] space c33["{x:3,y:3}"] space c34["{x:3,y:4}"] c41["{x:4,y:1}"] space c42["{x:4,y:2}"] space c43["{x:4,y:3}"] space c44["{x:4,y:4}"] c11 --> c12 c12 --> c21 c21 --> c22 end
%%{init: {'theme':'dark'}}%%block-beta columns 1 block:group2 columns 7 c11["{x:1,y:1}"] space c12["{x:1,y:2}"] space c13["{x:1,y:3}"] space c14["{x:1,y:4}"] c21["{x:2,y:1}"] space c22["{x:2,y:2}"] space c23["{x:2,y:3}"] space c24["{x:2,y:4}"] space space space space space space space c31["{x:3,y:1}"] space c32["{x:3,y:2}"] space c33["{x:3,y:3}"] space c34["{x:3,y:4}"] c41["{x:4,y:1}"] space c42["{x:4,y:2}"] space c43["{x:4,y:3}"] space c44["{x:4,y:4}"] c11 --> c12 c12 --> c21 c21 --> c22 c22 --> c13 c13 --> c14 c14 --> c23 c23 --> c24 c24 --> c31 c31 --> c32 c32 --> c41 c41 --> c42 c42 --> c33 c33 --> c34 c34 --> c43 c43 --> c44 end
A Bloom Filter tests membership in a set
Your org WILL store data in Parquet