Garayed.com  

Go Back   Garayed.com > Linux
FAQ Members List Calendar Search Today's Posts Mark Forums Read


Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 01-09-2007, 04:22 AM
Anonymous
 
Posts: n/a
Default Re: looking for sorting/comparing utility

"r" == rowe22@commawait6.net <rowe22@commawait6.net>:
r> say i have three files, each contains hundreds of names.
r> now i want to know those names that are in at least two of the files, or
r> find out those in all three files, and put these names in a new file. how
r> do i do that?

Compute the MD5 hash of each line and sort/select using that. Note
that this can get really slow for large files. In such a case use perl
and/or some simpler/faster hash like crc32.

Assuming that your data files are named file1, file2, file3, ..., here
is a simple shell script (let's call it select_common_lines) that does
what you need:

#!/bin/sh
MUST_BE_IN_SO_MANY_FILES="$1";shift
for f in ${1+"$@"}
do
while :; do
IFS= read line || break
echo `echo "$line"|md5sum` "$line"
done < "$f"
done |
sort -k 1.1,32 |
awk '
($1 == prev_line) && (NR != 1){ found_same++;
if (found_same == '$MUST_BE_IN_SO_MANY_FILES') print $3;
}
($1 != prev_line) { found_same=1 }
{prev_line = $1}
'


Sample usage:
$ echo 1 4 5 9 13 17 21 25 29 | tr ' ' '\n' >file1
$ echo 11 13 4 15 17 19 21 | tr ' ' '\n' >file2
$ echo 20 21 4 22 23 24 | tr ' ' '\n' >file3
$ ./select_common_lines 2 file*
4
17
13
21
$ ./select_common_lines 3 file*
4
21

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On



All times are GMT. The time now is 09:42 AM.




LinkBacks Enabled by vBSEO 3.0.0 © 2007, Crawlability, Inc.