Searching Insanely Large Datasets

July 26, 2010

Last Updated on February 19, 2025

We were in a bad place. I was tasked with finding some data in a 15TB database table. It was partitioned, but not indexed on the field we needed. I worked with the DBAs to see if we could query the data, but they stated it would take a week to return. However, we needed it faster than that, and we also needed additional information for the same time period.
We kicked it around for a bit and decided to dump the data to a flat file for the dates specified. I reasoned that as long as the data was dumped as one record to one line in the flat file, we should be able to use grep to find the records we needed. I did some mental calculations regarding record size and number of records and figured the dump file should be around 350GB. Dumping and grepping through it should be very fast, maybe 48 hours tops.
I got an email stating they were splitting the output into three files and the first was finished. I logged in and saw the first 800GB file. I emailed back and explain my concern with the difference between my calculation and the actual file size. It turned out that one record spanned 12 to 16 lines in the file. Grep wouldn’t work. The file looked a bit like this:
ase3de5g-j4dj-8jf8-1jdj4h2dje9   |   DATE                          |
LOGIN            |                            FROM |                   |               |
|          |
|      TO|        DATA | MORE
DATA| |     |…
We had a short discussion and decided to work with the files provided. I put together a short perl script that did two things.
First it stripped all of the extra white space and new lines, and then it looped through that data searching for our target data. Let me preface this by saying, yes there are more elegant ways to do this, but this technique was also used to show someone else how to do it…a “Teach him how to fish” moment.
To strip out the white space, we used:
s/s+|/|/g;
s/|s+/|/g;
The regex does the following:
The structure   s/ / /g means ‘Let’s substitute’. The g on the end means ‘everywhere on the line’.
The s means it should look for whitespace and the + that follows means ‘one or more’, so s+ looks for one or more whitespace characters. The | is just an escaped pipe character. It must be escaped, otherwise it would be interpreted as an ‘or’ metacharacter. We did the same with new line characters with this:
s/n+//g;
This is slightly different; it effectively deletes all the new line characters in the file. Wait, I know, now you’re thinking, “How do you know where one record starts and one record ends now?”. Each record starts with an identifier, which we can match with a pattern. Let’s look at it:
s/(?=[A-Z0-9]{8}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{12})/n/g;
I know… looks a bit crazy compared to the previous examples. Let’s walk through it.
You see the same substitution s/ / /g. Our goal is to look for the identifier, then insert a new line character before it. We use a look ahead feature which looks like this “(?= )”. We’re going to attempt to find our identifier, which isn’t the same for each record, but does have a pattern. These records have an identifier made up of 8 alpha-numerics, followed by 3 groups of 4, then a group of 12. So if we walk through our regex the first group:
[A-Z0-9]{8}
This translates to “Look for any 8 capital letters or numbers”, so if we put it all together:
[A-Z0-9]{8}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{12}
It looks for alpha numeric groups like so:
8-4-4-4-12
Once our regex finds this in the look ahead, it inserts a new line character.
Overall, the result is that there is no excess whitespace, no extra line breaks and one record per row. This alone reduced the 800GB file down to 62GB. Grepping through the file now took minutes to find the data we were looking for, and this tool is now available for future searches.
Since looking for that proverbial “data-needle” is often not a fun task, I hope this formula helps someone else down the road!