Duplication of files used to be a subject of direct concern when electronic documents were stored on file shares and storage costs per gigabyte were significant. Now that storage costs have plummeted (for both domestic and organisational users) the actual cost of of storage media is seldom of concern. If storage media are full it is usually cheaper to buy a bigger one than to pay for the time and effort required to clean up existing storage. When backups took significant time to run, the fact that backups could not be completed within the available time window due to to storage bloat sometimes prompted a cleanup.
Cloud storage has re-introduced another storage cost factor. Domestic cloud storage providers usually offer a free quota, and paid additions to this sometimes attract attention, along with the bandwidth use involved in backups and synchronization.
Exact duplication detection for file is very straightforward and is usually achieved by generating a checksum of the byte content of the item to be analysed, which may be a file or a disk storage block. Duplication often occurs when the same file is copied to a different folder. De-duplication is often built into backup software as it can reduce the size of backup files. Because exact duplicate detection works at the byte level of files, the nature of the files being de-duplicated does not matter and it works on any file type. This allows server operating systems to have de-duplication at the block level built in. This results in exact copies of files only increasing the space used on the server by a very small, constant amount, thus making dedicated de-duplication software redundant on most server operating systems.
Near-duplication is a much more difficult to define, as it has a perceptual component. Remediation is similarly difficult – differences between two text documents may be small, but they may be highly significant and require that two very similar documents are treated in different ways.
Near duplicate images are commonly generated when people take more than one digital photograph of a particular scene. Any movement of the subject or the camera will result in the images having different pixels, and even if this does not occur, the inclusion of time data in embedded file metadata will result in images not being identical. Two image scans of the same document will not produce identical files, as the document is likely to have moved by at least one pixel between scans.
There are many applications for de-deduplicating digital photos, some of which reviewed here. Most are coy about the algorithm which they use, but a common process is simply to sub-sample the images to a small number of pixels and then use the proportion of matching pixels as the degree of similarity. Pixels may either be matched by color or by intensity after conversion to grey scale. This process is fast and gives results which are generally similar to those which would be obtained by human inspection. Users are presented with clusters of images meeting a selected similarity threshold and can select which to keep and which to delete.
The process of near-matching for text documents is much more complex. Part of the complexity comes from the definition of what constitutes similar text content. Issues to consider are
- Does the similarity metric attempt to extract meaning from the text?
- How much text is being analysed?
- What is the similarity metric?
This complexity encourages academic investigation and there are many research papers on text similarity. Different approaches were reviewed in 2013 by Gomaa and Fahmy. Modern AI techniques using neural networks are being used to attempt to extract meaning from text and for similarity estimation. Google have a natural interest in this area for search result ranking and have a blog post on the subject.
Despite the fact that search engines such as Google most likely use some sort of text similarity measure in ranking search results, there is very little software available to deal with the many situations where text-near matching would be useful. These situations include:
- Finding the parent Word document of a PDF file.
PDF versions of Word documents are commonly generated, and are often sent as email attachments, as they are more difficult to edit and historically have been readable on a wider range of platforms than Word documents. Small variations to documents are often made for different recipients. Where the content of such a document needs to be changed, finding the original Word document for editing may prove difficult if the file name has been changed.
- Finding the email recipients of documents
Emails are frequently sent to multiple recipients at different times with the same or slightly different text document attachments but slightly different body and subject content. Identifying all these recipients on the basis of body and subject content can require time-consuming inspection of large numbers of sent emails.
- Finding the latest version of a document
Version control is a feature of document management systems (such as Microsoft SharePoint), but this feature may not always be used in a high-pressure collaborative authoring environment, such as the preparation of tender responses. Deciding whether a document is a new version of an earlier one or a completely new one is often a matter of opinion. Naming conventions, such as including date and author initials in a file name, may be successful but require a degree of discipline among the authors which may not always be present. As later document versions commonly have only small differences to earlier versions, detecting them as near-duplicates provides a method of identifying all versions of document and using timestamp metadata to identify the most recent.
- Finding the correct record-keeping container for documents
Misfiling of documents in a record-keeping system is a problem which always been present but which has become much worse with responsibility for filing being delegated to document creators rather than specialist staff. (See an article here on this topic.) Whilst the documents in a given record-keeping container may not always have any similarity, they are frequently variants of single document (such as a filled-in form) and detecting similarity between a document to be filed and documents already in a container may guide the document creator in correctly filing the document.
Applications for Finding Near-Duplicate Text Documents
Legal discovery applications offer this facility routinely as their task is to locate documents which may be relevant to particular legal issues. Their results are usually filtered by a human before being incorporated into the legal process. Because of the high cost of legal operations, legal discovery applications are far more expensive than general consumer applications. They seldom provide any detail of the algorithms used for similarity matching and would tend to err on the side of generating false positives which would be removed by human review.
General purpose near-duplicate detection software is reviewed here.