Autopsy 4.4.0 におけるキーワード検索(Standard Tokenizer)

メールアドレスやドメイン名の区切り方

AutopsyのSolr用スキーマファイル schema.xml には、solr.StandardTokenizerFactory が定義されており、Apache Solr の Standard Tokenizer の動作については、下記ページを参照することで内容を確認できる。

https://lucene.apache.org/solr/guide/6_6/tokenizers.html#Tokenizers-StandardTokenizer

Periods (dots) that are not followed by whitespace are kept as part of the token, including Internet domain names.

The "@" character is among the set of token-splitting punctuation, so email addresses are not preserved as single tokens.

ドットと@については処理に注意する必要があり、特に電子メールアドレスや URL についてはこのルールの影響を受けるので検索文字列を考える上では考慮しておく必要がある。

英語の文字列については、下記のような区切りで処理されると上記URLでは例示されている。

In: "Please, email john.doe@foo.com by 03-09, re: m37-xq."
Out: "Please", "email", "john.doe", "foo.com", "by", "03", "09", "re", "m37", "xq"

上記の例文を Autopsy のキーワード検索で検索する場合、例えば完全一致で john を検索してもヒットしない。ピリオドを含む単位で区切られているため、サブストリング一致で検索する必要がある。

URL文字列を検索する場合にも、例えば www.example.co.jp であれば、example を検索するにはサブストリング一致で検索するか、www.example.co.jp を完全一致で検索する必要がある。

例えば下記の一文では、「No. 221B」という部分は、No. 221B または 221B を完全一致で検索する事でヒットする。

WE MET next day as he had arranged, and inspected the rooms at No.  221B, Baker Street, of which he had spoken at our meeting.

なお、No という単語は stopwords_en.txt に定義されているため、ストップワードとして処理される。

例えば 03-09、03 09、03+09 のように複数のパターンが想定される場合には、正規表現を利用する事で検索する事ができる。

なお、Autopsy 4.4.0 のキーワード検索機能では、AND や OR を利用した Boolean search や、Proximity search を利用する事が出来ない。

ハイフンの扱い

上記例文「m37-xq」では、ハイフンが単語境界として？処理され"m37", "xq"となっている。ハイフンが含まれる場合、その位置で単語が分けられる。

UAX #29 4.1.1 Word Boundary Rules ではハイフンについて下記の記載がある。

UAX #29: Unicode Text Segmentation

The correct interpretation of hyphens in the context of word boundaries is challenging. It is quite common for separate words to be connected with a hyphen: “out-of-the-box,” “under-the-table,” “Italian-American,” and so on. A significant number are hyphenated names, such as “Smith-Hawkins.” When doing a Whole Word Search or query, users expect to find the word within those hyphens. While there are some cases where they are separate words (usually to resolve some ambiguity such as “re-sort” as opposed to “resort”), it is better overall to keep the hyphen out of the default definition. Hyphens include U+002D HYPHEN-MINUS, U+2010 HYPHEN, possibly also U+058A ARMENIAN HYPHEN, and U+30A0 KATAKANA-HIRAGANA DOUBLE HYPHEN.