Friday, July 31, 2009

Can you help me with a "UNIX for Dummies" type question regarding a RegExp query?

All I want to do is search for an EXACT PHRASE in zipped files in a UNIX database, and have the querry return all instances of the exact phrase. (UNIX instruction manuals are too confusing!)





So, if I wanted to search a bunch of zipped files in a UNIX database for files containing the phrase


"the more details you provide", case insensitive - what exactly would I enter? I've used the GREP commands before, but don't really know what I'm doing. I'll really appreciate your help!

Can you help me with a "UNIX for Dummies" type question regarding a RegExp query?
What you are asking for, regex searching in blobs is tough enough, but to add the fact that the files in the blobs are zipped really adds a wrench into the mix.





Ideally, you might index the files before they are zipped. Then though an olap cube you could wrench the data out that you need. If this isn't feasable, then the solution depends on your database. For example, in oracle, you can use java to extract the files, unzip to memory, then run regex queries on each file. Expect your queries to be mighty slow and processor intensive.





Worst case scenario you select a single blob at a time, unzip it, run grep -e or other regex utility on it, store the results, then move on to the next file. Something like this will take quite a while to run.





My suggestion is to work out a way to index the files based on what you think the regex expressions would be. Then run your queries against the indexed data for quick results.





I'm going to hate myself for saying this, but check out Kimball's book on data wharehousing. A bit more info than you'll need, but it'll provide a decent start for you.
Reply:Firstly if they're zipped up and stored in the database you're out of luck because you're going to have to extract all of them, unzip them and then search them.





Assuming that you infact actually have big blob columns with unzipped data files in them, you don't need regex for an exact match, try:





SELECT something FROM table WHERE column LIKE '%he more details you provide%';





Good luck.
Reply:Here is what you need to do:





1. Simple example: Assume that you need to search for "the more details you provide" in a file a.zip. Run the command:





# zcat -n a.zip | grep -i "the more details you provide"





Here is the explaination of this command: zcat will cat the file even if its zipped. The grep will try to find out the text string from this file. The "-i" switch to grep command will enable "ignore case" nature of grep.





2. So if you know the list of files, you can do something like this:





# export FILELIST="a.zip b.zip c.zip d.zip"


# for file in FILELIST; do zcat -n $file | sed "s/^/$file: /" | grep -i "the more details you provide"; done





Do let me know if you face any issues.

hawaiian flowers

No comments:

Post a Comment