Some scripting help needed

Hi folks, I have tonnes of CSV files whose sample content is as shown below: Choice|*123456780*|20|1|*YES*|Y|2010-02-26 10:42:51.239772| Choice|*123456781*|10|1|NO|Y|2010-02-26 12:42:51.243603| Choice|*123456782*|20|1|NA|Y|2010-02-26 14:42:51.247464| Choice|*123456783*|10|1|*YES*|Y|2010-02-26 14:42:51.251320| Choice|*123456780*|20|1|OK|Y|2010-02-26 15:42:51.255169| My aim is to extract the contents of the 2nd field alongside the count of similar choice result in 5th field. Desired output: UserId Occurences (*YES*) ====== =============== 123456780 465 123456781 25 I managed to achieve this on a linux box running mysql using the steps below (which can of course be wrapped into a script): 1. grep YES *raw_files* | awk 'BEGIN{FS=OFS="|"}{print $2}' > /tmp/in_file 2. Log in to mysql and perform the steps: 3. mysql> create table nums(num varchar(16)); 4. mysql> load data infile '/tmp/in_file' into table nums(num); 5. mysql> select num,count(*) thecnt from nums group by num; I'm still exploring how to achieve the same using some (simple?) script, possibly not involving the database bit. Any takers? NB: The only bit we know before-hand is desired choice result (YES/NO/NA). Thanks. Tony.

assuming bash users=`cat in.txt | cut -f2 -d'|'` for i in $users do userrecs=`cat in.txt | grep $i` yesc=`echo $userrecs | cut -f5 -d'|' | grep -c YES` noc=`echo $userrecs | cut -f5 -d'|' | grep -c NO` nac=`echo $userrecs | cut -f5 -d'|' | grep -c NA` echo "$i $yesc $noc $nac" done On Sat, Feb 27, 2010 at 6:09 PM, Tony Likhanga <tlikhanga@gmail.com> wrote:
Hi folks,
I have tonnes of CSV files whose sample content is as shown below:
Choice|*123456780*|20|1|*YES*|Y|2010-02-26 10:42:51.239772| Choice|*123456781*|10|1|NO|Y|2010-02-26 12:42:51.243603| Choice|*123456782*|20|1|NA|Y|2010-02-26 14:42:51.247464| Choice|*123456783*|10|1|*YES*|Y|2010-02-26 14:42:51.251320| Choice|*123456780*|20|1|OK|Y|2010-02-26 15:42:51.255169|
My aim is to extract the contents of the 2nd field alongside the count of similar choice result in 5th field.
Desired output:
UserId Occurences (*YES*) ====== =============== 123456780 465 123456781 25
I managed to achieve this on a linux box running mysql using the steps below (which can of course be wrapped into a script):
1. grep YES *raw_files* | awk 'BEGIN{FS=OFS="|"}{print $2}' > /tmp/in_file 2. Log in to mysql and perform the steps: 3. mysql> create table nums(num varchar(16)); 4. mysql> load data infile '/tmp/in_file' into table nums(num); 5. mysql> select num,count(*) thecnt from nums group by num;
I'm still exploring how to achieve the same using some (simple?) script, possibly not involving the database bit. Any takers? NB: The only bit we know before-hand is desired choice result (YES/NO/NA).
Thanks. Tony.
_______________________________________________ Skunkworks mailing list Skunkworks@lists.my.co.ke http://lists.my.co.ke/cgi-bin/mailman/listinfo/skunkworks ------------ Skunkworks Server donations spreadsheet
http://spreadsheets.google.com/ccc?key=0AopdHkqSqKL-dHlQVTMxU1VBdU1BSWJxdy1f... ------------ Skunkworks Rules http://my.co.ke/phpbb/viewtopic.php?f=24&t=94 ------------ Other services @ http://my.co.ke Other lists ------------- Announce: http://lists.my.co.ke/cgi-bin/mailman/listinfo/skunkworks-announce Science: http://lists.my.co.ke/cgi-bin/mailman/listinfo/science kazi: http://lists.my.co.ke/cgi-bin/mailman/admin/kazi/general

Hi Tony, This is a perfect job for awk - as you kind of already noticed. The following awk script will do it: ------------------cut------------------- #!/bin/awk -f BEGIN { FS="|"; print "UserId\t\tOccurences (YES)"; print "======\t\t==============="; } { if ($5 == "YES") { hits[$2]++; } } END { for (i in hits) { print i "\t" hits[i]; } } ------------------cut------------------- Save the script above as "glue.awk" and run it with the following command: awk -f glue.awk raw_files Example: $ awk -f glue.awk raw_files UserId Occurences (YES) ====== =============== 123456780 1 123456783 3 Regards Michael Pedersen PLUSPEOPLE Tony Likhanga wrote:
Hi folks,
I have tonnes of CSV files whose sample content is as shown below:
Choice|*123456780*|20|1|*YES*|Y|2010-02-26 10:42:51.239772| Choice|*123456781*|10|1|NO|Y|2010-02-26 12:42:51.243603| Choice|*123456782*|20|1|NA|Y|2010-02-26 14:42:51.247464| Choice|*123456783*|10|1|*YES*|Y|2010-02-26 14:42:51.251320| Choice|*123456780*|20|1|OK|Y|2010-02-26 15:42:51.255169|
My aim is to extract the contents of the 2nd field alongside the count of similar choice result in 5th field.
Desired output:
UserId Occurences (*YES*) ====== =============== 123456780 465 123456781 25
I managed to achieve this on a linux box running mysql using the steps below (which can of course be wrapped into a script):
1. grep YES /raw_files/ | awk 'BEGIN{FS=OFS="|"}{print $2}' > /tmp/in_file 2. Log in to mysql and perform the steps: 3. mysql> create table nums(num varchar(16)); 4. mysql> load data infile '/tmp/in_file' into table nums(num); 5. mysql> select num,count(*) thecnt from nums group by num;
I'm still exploring how to achieve the same using some (simple?) script, possibly not involving the database bit. Any takers? NB: The only bit we know before-hand is desired choice result (YES/NO/NA).
Thanks. Tony.

Thanks a bunch Rsohan :) and Micheal. Michael, the awk script did it in a breeze. Rsohan, there's some glitch in the bash script. It gives an output in the form: *123456780 0 123456781 1 123456782 1 123456780 1 123456781 0 123456782 1* A quick grep & word count does not yield a count similar to what the other two approaches gave. Also, the script takes a tad too long to execute :). I believe some fine tuning should hack it. Thanks, Tony.

Sorry, obviously didn't read your question properly. Correct script is: users=`cat in.txt | cut -f2 -d'|'|sort -n |uniq` for i in $users do userrecs=`cat in.txt | grep $i` yesc=`echo $userrecs | grep -o YES|wc -l ` noc=`echo $userrecs | grep -o NO|wc -l ` nac=`echo $userrecs | grep -o NA|wc -l ` echo "$i $yesc $noc $nac" done On Sun, Feb 28, 2010 at 12:21 AM, Tony Likhanga <tlikhanga@gmail.com> wrote:
Thanks a bunch Rsohan :) and Micheal.
Michael, the awk script did it in a breeze. Rsohan, there's some glitch in the bash script. It gives an output in the form:
*123456780 0 123456781 1 123456782 1 123456780 1 123456781 0 123456782 1*
A quick grep & word count does not yield a count similar to what the other two approaches gave. Also, the script takes a tad too long to execute :). I believe some fine tuning should hack it.
Thanks, Tony.
_______________________________________________ Skunkworks mailing list Skunkworks@lists.my.co.ke http://lists.my.co.ke/cgi-bin/mailman/listinfo/skunkworks ------------ Skunkworks Server donations spreadsheet
http://spreadsheets.google.com/ccc?key=0AopdHkqSqKL-dHlQVTMxU1VBdU1BSWJxdy1f... ------------ Skunkworks Rules http://my.co.ke/phpbb/viewtopic.php?f=24&t=94 ------------ Other services @ http://my.co.ke Other lists ------------- Announce: http://lists.my.co.ke/cgi-bin/mailman/listinfo/skunkworks-announce Science: http://lists.my.co.ke/cgi-bin/mailman/listinfo/science kazi: http://lists.my.co.ke/cgi-bin/mailman/admin/kazi/general

Bingo! Many thanks Rsohan, that's now perfect! Now I've got more stones than there are birds :) Merci, Tony.
participants (3)
-
Michael Pedersen
-
rsohan@gmail.com
-
Tony Likhanga