File Manipulation Query

Hi guys, I have a file with over 30 million lines (Txn IDs) over a period of 11 years. I want to delete from that file (it's a flat file, not a db table) lines that dont begin with certain characters. Or alternatively, create another file for lines that begin with certain characters. The format of the ID is DATE:<SOME.SEQ.NOs>(Of course, there are multiple txns per day) So for instance it will have 20080101XXXXXXXX - For transactions done on 1st Jan 2008, 20130613XXXXXX for today's txn, etc My requirement is to get a list of transactions done say from Jun 2010 to Jun 2011. Therefore, I should somehow either: a) Create a new file by piping output of grep ^2010 and grep ^2011 (Challenge is how to get only for 2010[06,07,08.09,10,11 and 12] and not entire 2010. Same for 2011) b) Delete the lines that dont match a) above. I dont want to go this route. Tools available: Ubuntu 12.04 HP-UX 11.31 PS: I don't know if the subject matches my requirement but ...

That's one way I'd do it this way 1. install mysql server and configure innodb_buffer_pool_size to be 90% of system memory (i might use someone's server for this) :) 2. create a table with four fields (id - autoinc, unclean - varchar(256), trndate (datetime), seq nos (innodb table to make use of 1 above) 3. index id primary in one index, trndate in another 4. import into the database, throw everything into the unclean field 5. wait :) 6. run query to split for example update mytable set trndate = left(unclean,8), seqno = substr(unclean,9,255) - i've noticed the format is yyyymmdd - makes it easier, you'd have to confirm my statement since i just wrote it from memory 7. wait again 8. then do selects and export to text files - the index on the trndate should make it super fast don't forget to put the indexes otherwise it will take forever to do the queries just a suggestion... and in the end you'll have a nice neat table in case you want to do more perations on it. You can even back it up and zip it for storage! On 13 June 2013 12:54, Bwana Lawi <mail2lawi@gmail.com> wrote:
Hi guys,
I have a file with over 30 million lines (Txn IDs) over a period of 11 years.
I want to delete from that file (it's a flat file, not a db table) lines that dont begin with certain characters.
Or alternatively, create another file for lines that begin with certain characters.
The format of the ID is DATE:<SOME.SEQ.NOs>(Of course, there are multiple txns per day)
So for instance it will have 20080101XXXXXXXX - For transactions done on 1st Jan 2008, 20130613XXXXXX for today's txn, etc
My requirement is to get a list of transactions done say from Jun 2010 to Jun 2011. Therefore, I should somehow either: a) Create a new file by piping output of grep ^2010 and grep ^2011 (Challenge is how to get only for 2010[06,07,08.09,10,11 and 12] and not entire 2010. Same for 2011)
b) Delete the lines that dont match a) above. I dont want to go this route.
Tools available: Ubuntu 12.04 HP-UX 11.31
PS: I don't know if the subject matches my requirement but ...
_______________________________________________ skunkworks mailing list skunkworks@lists.my.co.ke ------------ List info, subscribe/unsubscribe http://lists.my.co.ke/cgi-bin/mailman/listinfo/skunkworks ------------
Skunkworks Rules http://my.co.ke/phpbb/viewtopic.php?f=24&t=94 ------------ Other services @ http://my.co.ke

On a linux box, grep -A piped into grep -B with the output redirected into a new file always sorts me out. (Probably egrep works the same on other platforms) The trick is : 1. Identify the exact start of the records you want to filter out 2. Identify the exact end of the records you want to filter out 3. Make an intelligent estimation of the number of lines contained in between the start & end entries. Otherwise, sed may have some pretty neat solution if you dig deeper.

my take is that bash tools along with AWK will be much better placed to do the job. You can sort and redirect the output to a file thus ensuring that the primary source remains intact. On Thu, Jun 13, 2013 at 1:04 PM, Tony Likhanga <tlikhanga@gmail.com> wrote:
On a linux box, grep -A piped into grep -B with the output redirected into a new file always sorts me out. (Probably egrep works the same on other platforms)
The trick is : 1. Identify the exact start of the records you want to filter out 2. Identify the exact end of the records you want to filter out 3. Make an intelligent estimation of the number of lines contained in between the start & end entries.
Otherwise, sed may have some pretty neat solution if you dig deeper.
_______________________________________________ skunkworks mailing list skunkworks@lists.my.co.ke ------------ List info, subscribe/unsubscribe http://lists.my.co.ke/cgi-bin/mailman/listinfo/skunkworks ------------
Skunkworks Rules http://my.co.ke/phpbb/viewtopic.php?f=24&t=94 ------------ Other services @ http://my.co.ke

Ok. Anyone with something similar already in place? As I look at awk/grep/sed options On Thu, Jun 13, 2013 at 1:12 PM, Anthony Lenya <tlensya@gmail.com> wrote:
my take is that bash tools along with AWK will be much better placed to do the job. You can sort and redirect the output to a file thus ensuring that the primary source remains intact.
On Thu, Jun 13, 2013 at 1:04 PM, Tony Likhanga <tlikhanga@gmail.com>wrote:
On a linux box, grep -A piped into grep -B with the output redirected into a new file always sorts me out. (Probably egrep works the same on other platforms)
The trick is : 1. Identify the exact start of the records you want to filter out 2. Identify the exact end of the records you want to filter out 3. Make an intelligent estimation of the number of lines contained in between the start & end entries.
Otherwise, sed may have some pretty neat solution if you dig deeper.
_______________________________________________ skunkworks mailing list skunkworks@lists.my.co.ke ------------ List info, subscribe/unsubscribe http://lists.my.co.ke/cgi-bin/mailman/listinfo/skunkworks ------------
Skunkworks Rules http://my.co.ke/phpbb/viewtopic.php?f=24&t=94 ------------ Other services @ http://my.co.ke
_______________________________________________ skunkworks mailing list skunkworks@lists.my.co.ke ------------ List info, subscribe/unsubscribe http://lists.my.co.ke/cgi-bin/mailman/listinfo/skunkworks ------------
Skunkworks Rules http://my.co.ke/phpbb/viewtopic.php?f=24&t=94 ------------ Other services @ http://my.co.ke

Surely a regexp based grep (e.g. for months 1,2 and 11 - "egrep ^2010\(01\|02\|\11\)" ) is all you need? On Thu, Jun 13, 2013 at 11:44 AM, Bwana Lawi <mail2lawi@gmail.com> wrote:
Ok. Anyone with something similar already in place? As I look at awk/grep/sed options
On Thu, Jun 13, 2013 at 1:12 PM, Anthony Lenya <tlensya@gmail.com> wrote:
my take is that bash tools along with AWK will be much better placed to do the job. You can sort and redirect the output to a file thus ensuring that the primary source remains intact.
On Thu, Jun 13, 2013 at 1:04 PM, Tony Likhanga <tlikhanga@gmail.com>wrote:
On a linux box, grep -A piped into grep -B with the output redirected into a new file always sorts me out. (Probably egrep works the same on other platforms)
The trick is : 1. Identify the exact start of the records you want to filter out 2. Identify the exact end of the records you want to filter out 3. Make an intelligent estimation of the number of lines contained in between the start & end entries.
Otherwise, sed may have some pretty neat solution if you dig deeper.
_______________________________________________ skunkworks mailing list skunkworks@lists.my.co.ke ------------ List info, subscribe/unsubscribe http://lists.my.co.ke/cgi-bin/mailman/listinfo/skunkworks ------------
Skunkworks Rules http://my.co.ke/phpbb/viewtopic.php?f=24&t=94 ------------ Other services @ http://my.co.ke
_______________________________________________ skunkworks mailing list skunkworks@lists.my.co.ke ------------ List info, subscribe/unsubscribe http://lists.my.co.ke/cgi-bin/mailman/listinfo/skunkworks ------------
Skunkworks Rules http://my.co.ke/phpbb/viewtopic.php?f=24&t=94 ------------ Other services @ http://my.co.ke
_______________________________________________ skunkworks mailing list skunkworks@lists.my.co.ke ------------ List info, subscribe/unsubscribe http://lists.my.co.ke/cgi-bin/mailman/listinfo/skunkworks ------------
Skunkworks Rules http://my.co.ke/phpbb/viewtopic.php?f=24&t=94 ------------ Other services @ http://my.co.ke

A simple cat and grep, should do it. Something in the lines of cat file1 | egrep ^2008\(06\|07\|08\|09) > file2 On Thu, Jun 13, 2013 at 1:12 PM, Anthony Lenya <tlensya@gmail.com> wrote:
my take is that bash tools along with AWK will be much better placed to do the job. You can sort and redirect the output to a file thus ensuring that the primary source remains intact.
On Thu, Jun 13, 2013 at 1:04 PM, Tony Likhanga <tlikhanga@gmail.com>wrote:
On a linux box, grep -A piped into grep -B with the output redirected into a new file always sorts me out. (Probably egrep works the same on other platforms)
The trick is : 1. Identify the exact start of the records you want to filter out 2. Identify the exact end of the records you want to filter out 3. Make an intelligent estimation of the number of lines contained in between the start & end entries.
Otherwise, sed may have some pretty neat solution if you dig deeper.
_______________________________________________ skunkworks mailing list skunkworks@lists.my.co.ke ------------ List info, subscribe/unsubscribe http://lists.my.co.ke/cgi-bin/mailman/listinfo/skunkworks ------------
Skunkworks Rules http://my.co.ke/phpbb/viewtopic.php?f=24&t=94 ------------ Other services @ http://my.co.ke
_______________________________________________ skunkworks mailing list skunkworks@lists.my.co.ke ------------ List info, subscribe/unsubscribe http://lists.my.co.ke/cgi-bin/mailman/listinfo/skunkworks ------------
Skunkworks Rules http://my.co.ke/phpbb/viewtopic.php?f=24&t=94 ------------ Other services @ http://my.co.ke

Does the regex support an OR? I.e. files starting with 2010 or 2011 (because the period spans across 2 different years) I think I will run the regex for 2010 and then append the created file with the IDs for 2011. Thanx guys On Thu, Jun 13, 2013 at 2:00 PM, Simon Mburu <sgatonye@gmail.com> wrote:
A simple cat and grep, should do it. Something in the lines of
cat file1 | egrep ^2008\(06\|07\|08\|09) > file2
On Thu, Jun 13, 2013 at 1:12 PM, Anthony Lenya <tlensya@gmail.com> wrote:
my take is that bash tools along with AWK will be much better placed to do the job. You can sort and redirect the output to a file thus ensuring that the primary source remains intact.
On Thu, Jun 13, 2013 at 1:04 PM, Tony Likhanga <tlikhanga@gmail.com>wrote:
On a linux box, grep -A piped into grep -B with the output redirected into a new file always sorts me out. (Probably egrep works the same on other platforms)
The trick is : 1. Identify the exact start of the records you want to filter out 2. Identify the exact end of the records you want to filter out 3. Make an intelligent estimation of the number of lines contained in between the start & end entries.
Otherwise, sed may have some pretty neat solution if you dig deeper.
_______________________________________________ skunkworks mailing list skunkworks@lists.my.co.ke ------------ List info, subscribe/unsubscribe http://lists.my.co.ke/cgi-bin/mailman/listinfo/skunkworks ------------
Skunkworks Rules http://my.co.ke/phpbb/viewtopic.php?f=24&t=94 ------------ Other services @ http://my.co.ke
_______________________________________________ skunkworks mailing list skunkworks@lists.my.co.ke ------------ List info, subscribe/unsubscribe http://lists.my.co.ke/cgi-bin/mailman/listinfo/skunkworks ------------
Skunkworks Rules http://my.co.ke/phpbb/viewtopic.php?f=24&t=94 ------------ Other services @ http://my.co.ke
_______________________________________________ skunkworks mailing list skunkworks@lists.my.co.ke ------------ List info, subscribe/unsubscribe http://lists.my.co.ke/cgi-bin/mailman/listinfo/skunkworks ------------
Skunkworks Rules http://my.co.ke/phpbb/viewtopic.php?f=24&t=94 ------------ Other services @ http://my.co.ke

*2013/6/13 Simon Mburu <sgatonye@gmail.com> *
*A simple cat and grep, should do it.* *Something in the lines of* * * * * *cat file1 | egrep ^2008\(06\|07\|08\|09) > file2*
Without being picky.... isn't this some form of 'cat abuse'? I think the cat command is superfluous; unless there's some not-so-obvious advantage it adds to that syntax.

Just to caution you... The simple regex approach will only be useful if your records are in a single line as the time-stamps. Otherwise the grep -A/B will come in handy if the 'other target data' is strewn in the file, not necessarily on the same line as the timestamp ala system log file.

In linux this is the simplest exercise ever :) grep or awk and tee will sort you out Send a sample file to give u exact commands. You could even join files created by *tee* Wilson./ On 13 June 2013 14:08, Tony Likhanga <tlikhanga@gmail.com> wrote:
*2013/6/13 Simon Mburu <sgatonye@gmail.com> *
*A simple cat and grep, should do it.* *Something in the lines of* * * * * *cat file1 | egrep ^2008\(06\|07\|08\|09) > file2*
Without being picky.... isn't this some form of 'cat abuse'? I think the cat command is superfluous; unless there's some not-so-obvious advantage it adds to that syntax.
_______________________________________________ skunkworks mailing list skunkworks@lists.my.co.ke ------------ List info, subscribe/unsubscribe http://lists.my.co.ke/cgi-bin/mailman/listinfo/skunkworks ------------
Skunkworks Rules http://my.co.ke/phpbb/viewtopic.php?f=24&t=94 ------------ Other services @ http://my.co.ke

True. Just grep would do the same thing without the cat! I am beginning to think grep is another feline:-) On 13 June 2013 14:08, Tony Likhanga <tlikhanga@gmail.com> wrote:
*2013/6/13 Simon Mburu <sgatonye@gmail.com> *
*A simple cat and grep, should do it.* *Something in the lines of* * * * * *cat file1 | egrep ^2008\(06\|07\|08\|09) > file2*
Without being picky.... isn't this some form of 'cat abuse'? I think the cat command is superfluous; unless there's some not-so-obvious advantage it adds to that syntax.
_______________________________________________ skunkworks mailing list skunkworks@lists.my.co.ke ------------ List info, subscribe/unsubscribe http://lists.my.co.ke/cgi-bin/mailman/listinfo/skunkworks ------------
Skunkworks Rules http://my.co.ke/phpbb/viewtopic.php?f=24&t=94 ------------ Other services @ http://my.co.ke
-- Best regards, Odhiambo WASHINGTON, Nairobi,KE +254733744121/+254722743223 "I can't hear you -- I'm using the scrambler."

You can do it with a oneliner awk script.. On 6/13/13 12:54 PM, Bwana Lawi wrote:
Hi guys,
I have a file with over 30 million lines (Txn IDs) over a period of 11 years.
I want to delete from that file (it's a flat file, not a db table) lines that dont begin with certain characters.
Or alternatively, create another file for lines that begin with certain characters.
The format of the ID is DATE:<SOME.SEQ.NOs>(Of course, there are multiple txns per day)
So for instance it will have 20080101XXXXXXXX - For transactions done on 1st Jan 2008, 20130613XXXXXX for today's txn, etc
My requirement is to get a list of transactions done say from Jun 2010 to Jun 2011. Therefore, I should somehow either: a) Create a new file by piping output of grep ^2010 and grep ^2011 (Challenge is how to get only for 2010[06,07,08.09,10,11 and 12] and not entire 2010. Same for 2011)
b) Delete the lines that dont match a) above. I dont want to go this route.
Tools available: Ubuntu 12.04 HP-UX 11.31
PS: I don't know if the subject matches my requirement but ...
_______________________________________________ skunkworks mailing list skunkworks@lists.my.co.ke ------------ List info, subscribe/unsubscribe http://lists.my.co.ke/cgi-bin/mailman/listinfo/skunkworks ------------
Skunkworks Rules http://my.co.ke/phpbb/viewtopic.php?f=24&t=94 ------------ Other services @ http://my.co.ke

Copy the file to a Linux box(kind of obvious). Issue this command: more name_of_file | grep "string_to_filter_with" ...sit back and enjoy! :) Oh, you can also redirect the output to a new file as such: more name_of_file | grep "string_to_filter_with" > name_of_new_file.txt On 13 June 2013 12:54, Bwana Lawi <mail2lawi@gmail.com> wrote:
Hi guys,
I have a file with over 30 million lines (Txn IDs) over a period of 11 years.
I want to delete from that file (it's a flat file, not a db table) lines that dont begin with certain characters.
Or alternatively, create another file for lines that begin with certain characters.
The format of the ID is DATE:<SOME.SEQ.NOs>(Of course, there are multiple txns per day)
So for instance it will have 20080101XXXXXXXX - For transactions done on 1st Jan 2008, 20130613XXXXXX for today's txn, etc
My requirement is to get a list of transactions done say from Jun 2010 to Jun 2011. Therefore, I should somehow either: a) Create a new file by piping output of grep ^2010 and grep ^2011 (Challenge is how to get only for 2010[06,07,08.09,10,11 and 12] and not entire 2010. Same for 2011)
b) Delete the lines that dont match a) above. I dont want to go this route.
Tools available: Ubuntu 12.04 HP-UX 11.31
PS: I don't know if the subject matches my requirement but ...
_______________________________________________ skunkworks mailing list skunkworks@lists.my.co.ke ------------ List info, subscribe/unsubscribe http://lists.my.co.ke/cgi-bin/mailman/listinfo/skunkworks ------------
Skunkworks Rules http://my.co.ke/phpbb/viewtopic.php?f=24&t=94 ------------ Other services @ http://my.co.ke
-- Kind Regards, Moses M.G.
participants (10)
-
Anthony Lenya
-
Bwana Lawi
-
Jangita Nyagudi
-
Michael Pedersen
-
Moses M.G
-
Odhiambo Washington
-
rsohan@gmail.com
-
Simon Mburu
-
Thuo Wilson
-
Tony Likhanga