People often want to locate a text document that is similar to, but not identical with another document. A common example is a Word document saved as in PDF format. The binary content of the Word and PDF versions of the document are completely different, as they are different file types opened by different applications. Even the text content is different: PDF format sees page numbering as part of the text and Word does not.
If you want to find the Word document from which a PDF document was obtained and the names of the two files are quite different, then out-of-the-box Windows can’t help much. Windows Search does not index the text content of PDF files unless a PDF text extractor is added to the extractors used by the IFilter interface to create a searchable text index, and the index rebuilt.
This means that you can’t search for some long string that you know is included in both the Word and PDF versions of the file. Another situation is where slightly different documents (perhaps differing only in salutation) are emailed as attachments to a number of recipients, perhaps with different file names. You might want to identify all of the email messages to which the same basic document was sent
When documents are generated in a multi-author environment, accessing the latest version is often crucial. If all authors use a document management system, then version control can solve this problem. However, the fact that version control is available does not mean that it will be used – authors often like to add their initials, and sometimes a date to file names resulting in a proliferation of documents, making it difficult to determine which one contains the most current changes.
FindAlike is designed to identify documents with similar, but not necessarily identical text, by building on the capability provided by Windows Search to index new or modified text documents and a novel method of rapid, approximate matching of text content. FindAlike can be used as a standalone desktop application or as an Add-in for Microsoft Office products Word, Outlook, Powerpoint, and Excel.