Every network conversation (NetFlow)
Full fidelity packet data (PCAP)
Every network conversation (NetFlow)
Full fidelity packet data (PCAP)
Searchable detailed protocol data for:
Every network conversation (NetFlow)
Full fidelity packet data (PCAP)
Searchable detailed protocol data for:
How hard could that possibly be?
¯\_(ツ)_/¯
Spark seems good for that
Spark seems good for that
Spark seems good for that
S3 is really cheap and fairly well supported by Spark
Spark seems good for that
S3 is really cheap and fairly well supported by Spark
Spark seems good for that
S3 is really cheap and fairly well supported by Spark
Spark seems good for that too
├── cid=X| ├── year=2015| └── year=2016| └── month=0| └── day=0| └── hour=0└── cid=Y
├── cid=X| ├── year=2015| └── year=2016| └── month=0| └── day=0| └── hour=0└── cid=Y
"Over the last 6 months, how many times did I see IP X using protocol Y?"
"Over the last 6 months, how many times did I see IP X using protocol Y?"
"When did IP X not use port 80 for HTTP?"
"Over the last 6 months, how many times did I see IP X using protocol Y?"
"When did IP X not use port 80 for HTTP?"
"Who keeps scanning server Z for open SSH ports?"
"Over the last 6 months, how many times did I see IP X using protocol Y?"
"When did IP X not use port 80 for HTTP?"
"Who keeps scanning server Z for open SSH ports?"
select count(*) from events where ip = '192.168.0.1' and cid = 1 and year = 2016
select count(*) from events where ip = '192.168.0.1' and cid = 1 and year = 2016
├── cid=X| ├── year=2015| └── year=2016| |── month=0| | └── day=0| | └── hour=0| | └── 192.168.0.1_was_NOT_here.parquet| └── month=1| └── day=0| └── hour=0| └── 192.168.0.1_WAS_HERE.parquet└── cid=Y
select count(*) from events where month = 6
select count(*) from events where month = 6
mapPartitions
on SOLR RDD and turn into Parquet RDDsAs an optimization for small file sets we pull the SOLR rows driver side
Source | Scan/Filter Time |
---|---|
SOLR | < 100 milliseconds |
Hive | > 5 seconds |
S3 directory listing | > 5 minutes!!! |
Field | Cardinality | Result |
---|---|---|
Protocol | Medium (9000) | ❌ |
Port | High (65535) | ❌❌ |
IP Addresses | Astronomically High (3.4 undecillion) | ❌❌❌ |
Term | Doc IDs |
---|---|
192.168.0.1 | 1,2,3,5,8,13... |
10.0.0.1 | 2,4,6,8... |
8.8.8.8 | 1,2,3,4,5,6 |
Term | Doc IDs |
---|---|
192.168.0.1 | 1,2,3,5,8,13... |
10.0.0.1 | 2,4,6,8... |
8.8.8.8 | 1,2,3,4,5,6 |
Term | Doc IDs |
---|---|
192.168.0.1 | 1,2,3,5,8,13... |
10.0.0.1 | 2,4,6,8... |
8.8.8.8 | 1,2,3,4,5,6 |
What if our terms were the offsets of the Bloom Filter values?
What if our terms were the offsets of the Bloom Filter values?
Term | Doc IDs |
---|---|
0 | 1,2,3,5,8,13... |
1 | 2,4,6,8... |
2 | 1,2,3,4,5,6 |
3 | 1,2,3 |
... | ... |
N | 1,2,3,4,5... |
Term | Doc IDs |
---|---|
0 | 0,1,2 |
1 | 1,2 |
2 | 1 |
3 | 0 |
4 | 1,2 |
5 | 0 |
Field | Value | Indexed Values | Doc ID |
---|---|---|---|
ip | 192.168.0.1 | {0, 3, 5} | 0 |
ip | 10.0.0.1 | {1, 2, 4} | 1 |
ip | 8.8.8.8 | {0, 1, 4} | 2 |
Field | Query String | Actual Query |
---|---|---|
ip | ip:192.168.0.1 | ip_bits:0 AND 3 AND 5 |
ip | ip:10.0.0.1 | ip_bits:1 AND 4 AND 5 |
What partition would it choose?
The partition would have to be encoded in the key?! 🤔
What partition would it choose?
The partition would have to be encoded in the key?! 🤔
What partition would it choose?
The partition would have to be encoded in the key?! 🤔
Time sharded C* clusters with SOLR
Cheap speedy Cold storage based on S3 and Spark
Time sharded C* clusters with SOLR
Cheap speedy Cold storage based on S3 and Spark
A mechanism for archiving data to S3
Simultaneously querying heterogeneous data stores
Stitching together time series data from multiple stores
Simultaneously querying heterogeneous data stores
Stitching together time series data from multiple stores
Managing sharding:
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |