Searching Insanely Large Datasets

July 26, 2010

Last Updated on July 26, 2010

We were in a bad place. I was tasked with finding some data in a 15TB database table. It was partitioned, but not indexed on the field we needed. I worked with the DBAs to see if we could query the data, but they stated it would take a week to return. However, we needed it faster than that, and we also needed additional information for the same time period.
We kicked it around for a bit and decided to dump the data to a flat file for the dates specified. I reasoned that as long as the data was dumped as one record to one line in the flat file, we should be able to use grep to find the records we needed. I did some mental calculations regarding record size and number of records and figured the dump file should be around 350GB. Dumping and grepping through it should be very fast, maybe 48 hours tops.
I got an email stating they were splitting the output into three files and the first was finished. I logged in and saw the first 800GB file. I emailed back and explain my concern with the difference between my calculation and the actual file size. It turned out that one record spanned 12 to 16 lines in the file. Grep wouldn’t work. The file looked a bit like this:
ase3de5g-j4dj-8jf8-1jdj4h2dje9   |   DATE                          |
LOGIN            |                            FROM |                   |               |
|          |
|      TO|        DATA | MORE
DATA| |     |…
We had a short discussion and decided to work with the files provided. I put together a short perl script that did two things.
First it stripped all of the extra white space and new lines, and then it looped through that data searching for our target data. Let me preface this by saying, yes there are more elegant ways to do this, but this technique was also used to show someone else how to do it…a “Teach him how to fish” moment.
To strip out the white space, we used:
s/s+|/|/g;
s/|s+/|/g;
The regex does the following:
The structure   s/ / /g means ‘Let’s substitute’. The g on the end means ‘everywhere on the line’.
The s means it should look for whitespace and the + that follows means ‘one or more’, so s+ looks for one or more whitespace characters. The | is just an escaped pipe character. It must be escaped, otherwise it would be interpreted as an ‘or’ metacharacter. We did the same with new line characters with this:
s/n+//g;
This is slightly different; it effectively deletes all the new line characters in the file. Wait, I know, now you’re thinking, “How do you know where one record starts and one record ends now?”. Each record starts with an identifier, which we can match with a pattern. Let’s look at it:
s/(?=[A-Z0-9]{8}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{12})/n/g;
I know… looks a bit crazy compared to the previous examples. Let’s walk through it.
You see the same substitution s/ / /g. Our goal is to look for the identifier, then insert a new line character before it. We use a look ahead feature which looks like this “(?= )”. We’re going to attempt to find our identifier, which isn’t the same for each record, but does have a pattern. These records have an identifier made up of 8 alpha-numerics, followed by 3 groups of 4, then a group of 12. So if we walk through our regex the first group:
[A-Z0-9]{8}
This translates to “Look for any 8 capital letters or numbers”, so if we put it all together:
[A-Z0-9]{8}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{12}
It looks for alpha numeric groups like so:
8-4-4-4-12
Once our regex finds this in the look ahead, it inserts a new line character.
Overall, the result is that there is no excess whitespace, no extra line breaks and one record per row. This alone reduced the 800GB file down to 62GB. Grepping through the file now took minutes to find the data we were looking for, and this tool is now available for future searches.
Since looking for that proverbial “data-needle” is often not a fun task, I hope this formula helps someone else down the road!

What's Next?

To hear this practical, best-practice oriented show with Temi Adebambo

Click Here

Pivot Point is now part of CBIZ. Click Here for more information.

Blog

Searching Insanely Large Datasets

What's Next?

What do you think? Is Digital Business Risk Management the Future of Attack Surface Management?

Search our Blogs

What are the New CMMC 2.0 Flowdown Requirements to Manage Defense Supply Chain Cyber Risk?

What is Swarm AI and How Can It Advance Cybersecurity?

How CMMC Enhances Defense Supply Chain Security

Is Decentralized Cybersecurity Mesh the Future of Cybersecurity?

What is a Post-Quantum Strategy and Does Our Business Need One?

What is Kubernetes Security Posture Management (KSPM) and Why Should We (as Cloud-Native Developers) Care?

Registered Practitioners Versus Certified CMMC Professionals: What’s the Difference for DIB Orgs Seeking CMMC Compliance?

What is a Cloud Native Application Protection Platform (CNAPP) and What Can It Do for My Business?

What is Cloud Infrastructure Entitlement Management (CIEM) and Why Is It Becoming So Important?

What is the CMMC Assessment Process (CAP) Handbook and Why Should DIB Orgs Care?

ISO 27001 vs NIST 800-53: All You Need to Know

ISO 27001 vs NIST Cybersecurity Framework: What’s the Difference?

The Primary Importance of CUI Scoping for CMMC Certification

Know the Difference between ISO 27001 vs 27002 vs 27003

What is Content Disarm and Reconstruction and Why Should I (as a Recipient of Digital Documents) Care?

The Role of Leadership in ISO 27001 Compliance

Why File-Based Malware Dominates Cyberattacks

Data Detection and Response for Privacy and Compliance

DIB SMBs Rate Their Cybersecurity as Much Better than It Actually Is – Why?

Top 5 Insights from Radicl’s DIB Cybersecurity Maturity Report 2024

How can we help you?

Services

Compliance

Insights

Pivot Point Security