Perl and ‘uniq’ – de-duping arrays and other tricks

There comes a time when everybody needs to de-dupe an array, or just simulate the UNIX uniq command. Here at $dayjob, for example, we do quite a lot of UNIX shell commands to look up the number of times has connected to a server via POP23 or IMAP. We may use a command like:

 grep 'pop3(' /var/log/maillog | awk '{print $6}' | sort | uniq -c | sort | tail -10

which would give us the top 10 ‘poppers’, The output looks like this (names changed!):

1300 pop3(user1):
1308 pop3(user2):
1310 pop3(user3):
1311 pop3(user4):
1312 pop3(user5):
1314 pop3(user6):
1315 pop3(user7):
1320 pop3(user8):
1471 pop3(user9):
2454 pop3(user10):

So how can we do this in Perl? Hashes are our friends when we want to de-dupe an array, so let’s start with getting the data into the array. We’ll do it all in Perl and read the log file off the command line for ease and speed. The first block takes care of the grep 'pop3(' /var/log/maillog | awk '{print $6}' | sort | uniq -c part of the UNIX line.

#! /usr/bin/perl -w

use strict;

my %output_hash;

while (<>){
        if (/pop3\(/){
                my @temp_array = split;                  
                $output_hash{$temp_array[5]} ++;         
        }                                                
}                                          

We define the hash initially (we are using strict) then read in the file specified on te command line. We’re finding lines that match 'pop3(', then splitting the line into a temporary array so we can the extract field 6. The clever bit is leveraging the fact that a hash has to have unique keys, so we create the hash element, the key name being the 6 element of the temporary array. If it already exists then the value is incremented by one. When the code has read through the file we have a hash of usernames and values, but as yet unordered.

The next block of code sorts and displays the results:

my $counter = 0;                                         
for my $username (sort{ $output_hash{$b} <=> $output_hash{$a}} keys %output_hash){
        print "$output_hash{$username}\t$username\n";    
        $counter ++;                                     
        if ($counter > 9){last}                          
}

We need a counter to limit the list to 10, there may be a nicer way to the job but as with all thinks Perl, (TMTOWTDI). The for line cycles through each hash element sorting by value and outputting to the console. The sort by value is relatively complicated compared to sorting by keys, but Ill make a new pot on hash sorting so wont go into it here. When the counter reaches 10, the loop stops using the last statement. The output is identical except the biggest user is now at the top of the list.

How does the perl do, time-wise compared to the shell command? SHOCKING!!!!

# time ./blog1.pl /var/log/maillog
1644	pop3(user10):
1512	pop3(user9):
1359	pop3(user8):
1353	pop3(user7):
1351	pop3(user6):
1351	pop3(user5):
1349	pop3(user4):
1348	pop3(user3):
1346	pop3(user2):
1336	pop3(user1):

real	0m3.008s
user	0m2.844s
sys	0m0.156s

# time grep 'pop3(' /var/log/maillog | awk '{print $6}' | sort | uniq -c | sort | tail -10
1336 pop3(user1):
1346 pop3(user2):
1349 pop3(user3):
1349 pop3(user4):
1351 pop3(user5):
1351 pop3(user6):
1353 pop3(user7):
1359 pop3(user8):
1512 pop3(user9):
1644 pop3(user10):

real	0m0.523s
user	0m0.564s
sys	0m0.159s
# 

# wc -l /var/log/maillog
  3352179 /var/log/maillog

# grep 'pop3(' /var/log/maillog | wc -l
  98062

I’ve added some extra data so you can see we are dealing witha logfile with 3.3 million lines of code and circa 100k lines with pop3( in the text.

This entry was posted in Perl and tagged , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *