2 つのファイルの中身を比較する

似た 2 つのファイルの中身を比較するものを作成してみました。 2 つのファイルの中身を調べるには diff コマンドなどでも問題ないのですが、diff の読み方を覚えるのが苦手な人もいると思います。

ここでは以下の 2 つのファイルを用意します。

$ cat test_org.txt
abcde
fghij
klmno
pqrst
uvw
xyz
123345
67890

$ cat test_new.txt
acbde
fghij
klmno
pqrst
uvwxyz
12345
67890
ABCDE

もちろん、差は diff コマンドで調べられます。

$ diff -u test_org.txt test_new.txt
--- test_org.txt2009-05-25 12:24:19.000000000 +0900
+++ test_new.txt20092009-05-25 13:18:25.000000000 +0900
@@ -1,8 +1,8 @@
-abcde
+acbde
 fghij
 klmno
 pqrst
-uvw
-xyz
-123345
+uvwxyz
+12345
 67890
+ABCDE

ここで取り上げるものは、オリジナルにあって新規ファイルにないものを調べ、その逆も調べるというものです。 また、Google とかの「もしかして」(英語では Did you meean?) をレーベンシュタイン距離で見ています。

#! /usr/bin/gawk -f
# check_words.awk

BEGIN {
    org_file = ARGV[1];
    new_file = ARGV[2];
    limit = 2;

    # オリジナルファイルの読み込み
    while (getline < org_file > 0) {
        org_word[i++] = $0;
    }
    close(org_file);

    # 変更ファイルの読み込み
    while (getline < new_file > 0) {
        new_word[j++] = $0;
    }
    close(new_file);

    for (i in org_word) {
        count = 0;
        for (j in new_word) {
            if (org_word[i] == new_word[j]) {
                count = 1;
                break;
            } else {
                leven_dist = levenshtein_distance(org_word[i], new_word[j]);
                if (leven_dist <= limit) {
                    did_you_mean[org_word[i]] = new_word[j];
                }
            }
        }
        if (did_you_mean[org_word[i]] == "" && count == 0) {
            print org_word[i] " is not fount in new file.";
        } else if (did_you_mean[org_word[i]] != "" && count == 0) {
            print org_word[i] " is not fount in new file. "\
                  "But did you mean " did_you_mean[org_word[i]] "?";
            delete did_you_mean;
        } else {
            print org_word[i] " is found in new file.";
        }
    }

    print "---------------------------------------------";

    for (i in new_word) {
        count = 0;
        for (j in org_word) {
            if (new_word[i] == org_word[j]) {
                count = 1;
                break;
            } else {
                leven_dist = levenshtein_distance(new_word[i], org_word[j]);
                if (leven_dist <= limit) {
                    did_you_mean[new_word[i]] = org_word[j];
                }
            }
        }
        if (did_you_mean[new_word[i]] == "" && count == 0) {
            print new_word[i] " is not fount in org file.";
        } else if (did_you_mean[new_word[i]] != "" && count == 0) {
            print new_word[i] " is not fount in org file. "\
                  "But did you mean " did_you_mean[new_word[i]] "?";
            delete did_you_mean;
        } else {
            print new_word[i] " is found in org file.";
        }
    }
}

# 2 つの文字列のレーベンシュタイン距離を返す
function levenshtein_distance(s1, s2,   a_str1, a_str2, i, j) {
    len_s1 = length(s1);
    len_s2 = length(s2);
    split(s1, a_str1, "");
    split(s2, a_str2, "");
    for (i = 0; i <= len_s1; i++) {
        distance[i, 0] = i;
    }
    for (j = 0; j <= len_s2; j++) {
        distance[0, j] = j;
    }
    for (i = 1; i <= len_s1; i++) {
        for (j = 1; j <= len_s2; j++) {
            if (a_str1[i] == a_str2[j]) {
                cost = 0;
            } else {
                cost = 1;
            }
            distance[i, j] = min3(distance[i - 1, j    ] + 1,\
                                  distance[i    , j - 1] + 1,\
                                  distance[i - 1, j - 1] + cost);
        }
    }
    return distance[len_s1, len_s2];
}

# 3 つの最小値を返す
function min3(a, b, c,    min) {
    min = a;
    if (b <= min) {
        min = b;
    }
    if (c <= min) {
        min = c;
    }
    return min;
}

出力は実行してみると分かると思います。

$ nawk -f check_words.awk test_org.txt test_new.txt
klmno is found in new file.
pqrst is found in new file.
uvw is not fount in new file.
xyz is not fount in new file.
123345 is not fount in new file. But did you mean 12345?
67890 is found in new file.
abcde is not fount in new file. But did you mean acbde?
fghij is found in new file.
---------------------------------------------
klmno is found in org file.
pqrst is found in org file.
uvwxyz is not fount in org file.
12345 is not fount in org file. But did you mean 123345?
67890 is found in org file.
ABCDE is not fount in org file.
acbde is not fount in org file. But did you mean abcde?
fghij is found in org file.

存在しても存在しなくても表示しています。 数が多いと面倒なので、出力を grep に渡すと違いだけが出力されます。

$ nawk -f check_words.awk test_org.txt test_new.txt | grep not
uvw is not fount in new file.
xyz is not fount in new file.
123345 is not fount in new file. But did you mean 12345?
abcde is not fount in new file. But did you mean acbde?
uvwxyz is not fount in org file.
12345 is not fount in org file. But did you mean 123345?
ABCDE is not fount in org file.
acbde is not fount in org file. But did you mean abcde?

似た 2 つのリストを比較する機会があり、作ってみたものです。

tag_nawk.pngtag_nawk.pngtag_nawk.pngtag_nawk.png