Information Security Blog

Searching Insanely Large Datasets

Searching Insanely Large Datasets

Searching-Insanely-Large-DatasetsWe were in a bad place. I was tasked with finding some data in a 15TB database table.  It was partitioned, but not indexed on the field we needed.  I worked with the DBAs to see if we could query the data, but they stated it would take a week to return.  However, we needed it faster than that, and we also needed additional information for the same time period.

We kicked it around for a bit and decided to dump the data to a flat file for the dates specified.  I reasoned that as long as the data was dumped as one record to one line in the flat file, we should be able to use grep to find the records we needed.  I did some mental calculations regarding record size and number of records and figured the dump file should be around 350GB.  Dumping and grepping through it should be very fast, maybe 48 hours tops.

I got an email stating they were splitting the output into three files and the first was finished.  I logged in and saw the first 800GB file.  I emailed back and explain my concern with the difference between my calculation and the actual file size.  It turned out that one record spanned 12 to 16 lines in the file.  Grep wouldn’t work.  The file looked a bit like this:

ase3de5g-j4dj-8jf8-1jdj4h2dje9   |   DATE                          |
LOGIN            |                            FROM  |                   |               |
|          |
|      TO|        DATA |  MORE
DATA| |     |…

We had a short discussion and decided to work with the files provided.  I put together a short perl script that did two things.

First it stripped all of the extra white space and new lines, and then it looped through that data searching for our target data.  Let me preface this by saying, yes there are more elegant ways to do this, but this technique was also used to show someone else how to do it…a “Teach him how to fish” moment.

To strip out the white space, we used:
s/\s+\|/\|/g;
s/\|\s+/\|/g;

The regex does the following:

The structure   s/  /  /g means ‘Let’s substitute’.  The g on the end means ‘everywhere on the line’.

The \s means it should look for whitespace and the + that follows means ‘one or more’, so \s+ looks for one or more whitespace characters.  The \| is just an escaped pipe character.  It must be escaped, otherwise it would be interpreted as an ‘or’ metacharacter.  We did the same with new line characters with this:
s/\n+//g;

This is slightly different; it effectively deletes all the new line characters in the file.  Wait, I know, now you’re thinking, “How do you know where one record starts and one record ends now?”.  Each record starts with an identifier, which we can match with a pattern.  Let’s look at it:

s/(?=[A-Z0-9]{8}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{12})/\n/g;

I know…  looks a bit crazy compared to the previous examples.  Let’s walk through it.

You see the same substitution s/ / /g.  Our goal is to look for the identifier, then insert a new line character before it.  We use a look ahead feature which looks like this “(?= )”.  We’re going to attempt to find our identifier, which isn’t the same for each record, but does have a pattern.  These records have an identifier made up of 8 alpha-numerics, followed by 3 groups of 4, then a group of 12.  So if we walk through our regex the first group:

[A-Z0-9]{8}

This translates to “Look for any 8 capital letters or numbers”, so if we put it all together:
[A-Z0-9]{8}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{12}

It looks for alpha numeric groups like so:
8-4-4-4-12

Once our regex finds this in the look ahead, it inserts a new line character.

Overall, the result is that there is no excess whitespace, no extra line breaks and one record per row.  This alone reduced the 800GB file down to 62GB.  Grepping through the file now took minutes to find the data we were looking for, and this tool is now available for future searches.

Since looking for that proverbial “data-needle” is often not a fun task, I hope this formula helps someone else down the road!



Free Whitepaper: Five Best Practices for SIEM

siem-whitepaper

The promise of SIEM is the consolidation of all relevant Security Event Logs from disparate sources into a single unified and normalized data store.

Free Download: ISO 27001 Implementation Roadmap

ISO 27001 RoadmapHave no fear – our “roadmap” will guide you, step by step, through the entire ISO 27001 process.

Getting to ISO 27001 certification is a process made up of things you already know – and things you may already be doing!

Free Download: A Best Practices Guide to Database Security

database security roadmap

Because data is only as secure as the systems & processes it relies on – a holistic approach to data security is essential. This roadmap is not meant to be exhaustive but rather to stimulate the necessary thought process to put you on the path to good data security.

Free Whitepaper: Stop Wasting Money on Penetration Testing

penetration-testing-whitepaper

Penetration Testing is most frequently performed to:

  • Substantiate the net effectiveness of a mature control environment
  • Prove to a third party that an environment is secure/trustworthy
  • Quickly assess the security of a less mature control environment (in a sense a technical risk assessment)
  • To validate that significant changes did not have unanticipated results

Download: Information Security Attestation Guide

Information Security GuideA Best-Practices Guide to Information Security Attestation

Download our proven Information Security Guide to simplify the process of protecting your data, proving you’re secure and growing your business.

Best Practices for Firing A Network Security Administrator

Firing A Network Security AdministratorWant to know how to fire a Network Admin? Need to know what precautions to take? Firing any employee can be a stressful event. Firing one who has significant knowledge of and privileged access to your Information Technology/Security infrastructure is even more stressful, as the risks are so notable.

Is ISO 27001 Right for (Y)our Organization?

iso-27001-webinar

Thinking about ISO 27001 Certification? View our free On-Demand ISO 27001 Webinar

  • How to deal with increasing threats
  • How to manage multiple regulatory requirements
  • How to handle client requests for attestation
  • To validate that significant changes did not have unanticipated results

About the Author:

MIke Gargiullo - Senior Security Consultant

Add a Comment