![]() |
| |||
| "r" == rowe22@commawait6.net <rowe22@commawait6.net>: r> say i have three files, each contains hundreds of names. r> now i want to know those names that are in at least two of the files, or r> find out those in all three files, and put these names in a new file. how r> do i do that? Compute the MD5 hash of each line and sort/select using that. Note that this can get really slow for large files. In such a case use perl and/or some simpler/faster hash like crc32. Assuming that your data files are named file1, file2, file3, ..., here is a simple shell script (let's call it select_common_lines) that does what you need: #!/bin/sh MUST_BE_IN_SO_MANY_FILES="$1";shift for f in ${1+"$@"} do while :; do IFS= read line || break echo `echo "$line"|md5sum` "$line" done < "$f" done | sort -k 1.1,32 | awk ' ($1 == prev_line) && (NR != 1){ found_same++; if (found_same == '$MUST_BE_IN_SO_MANY_FILES') print $3; } ($1 != prev_line) { found_same=1 } {prev_line = $1} ' Sample usage: $ echo 1 4 5 9 13 17 21 25 29 | tr ' ' '\n' >file1 $ echo 11 13 4 15 17 19 21 | tr ' ' '\n' >file2 $ echo 20 21 4 22 23 24 | tr ' ' '\n' >file3 $ ./select_common_lines 2 file* 4 17 13 21 $ ./select_common_lines 3 file* 4 21 |
![]() |
| Thread Tools | Search this Thread |
| Display Modes | |
| |