Records management is an unglamorous but essential part of the operation of any organization. The computerization of the workplace brought about some paradoxical changes (see Computers and the Death of Recordkeeping) and the rise to prominence of Artificial Intelligence has resulted in renewed interest in creating automated systems to replace the historical role of the filing clerk. The spectacular success of Web search engines such as Google in retrieving desired information naturally makes people wish they could replicate its success within organizations that they work for. Google realized this too and the history of the Google Search Appliance (which reaches its official end of life in 2019, after 17 years) illustrates their ultimately unsuccessful attempt to provide this. No other search vendor has done any better, but a number of specialist Records Management companies are tackling this problem with automatic classification – assigning documents to file plan categories using automated methods. As paper records management gives way to digital, the problems of effective management of organizational records become more obvious and vendors are jumping into the opportunity. One problem at least is becoming less serious – as born-digital documents extend further into the past, the task of extracting machine-readable text from scanned document images is diminishing, and technical improvements in optical character recognition mean that extracted text from scanned documents is more likely to represent what was written than in the past. However, the task of grouping and classifying documents in the manner in which filing clerks used to operate has not seen a similar improvement.
There are two main methods of automatic classification: rule-based and training set based. Rule-based classification has a long history (back to the 1970s) and is simple to apply and understand. The occurrence of a word or phrase in a document, or words or phrases in proximity to each other is sufficient to indicate membership of a particular class. In the simplest case, a single occurrence of word or phrase is sufficient to place a document into a particular class. This approach is used by products intended for tagging rather than records management. Variants of this approach use statistical rather than binary classification, and may perform deterministic processing of text (such as word stemming) to improve performance. The task of defining a set of rules may be purely manual or may use a training set of already classified to identify words and phrases occurring more commonly in a particular class to generate rules semi-automatically.
Modern Artificial Intelligence classification may use neural nets to perform classification, treating words and phrases simply as tokens in the same way as features extracted from images. Such methods require large training sets and the rules generated are unknowable, and cannot be edited. The token –based analysis means that if words in text content are randomly re-ordered, turning the document into gibberish, the classification result remains unchanged, except for effects caused by the disruption of phrases. As more documents are classified, and the classifications approved, the number and quality of rules may increase, allowing vendors to claim that that their systems learn from experience. However, the lack of transparency of the rules used by networks means that the content of documents unrelated to its meaning (such as an organization’s address) may end up forming the basis of a classification rule.
Another problem with content-based classification is that the most important classification features of documents created for a small audience are not explicitly mentioned, as anyone viewing the document was assumed by its creator to know what it was about. A specific example from one company was a document relating to problems encountered with an Oracle database upgrade. The content never mentioned the word “Oracle” and only referred once to the version number of the database. The document name and provenance often are a much better guide to classification than the text content.
The increasing power of computers, and particularly the availability of massive cloud resources, masks the fact that automatic classification (and machine translation between languages) is based on statistical analysis of words and phrases, not on their meaning, as extracted by any human reader. The nuances of language are such that even the detection of negation in a text is a PhD topic, like deciding whether the sentence “Time flies like an arrow” is a statement about a type of fly or about human experience. The spectacular success of Artificial Intelligence in playing rule-based board games such as chess and Go does not involve any automated understanding of the complexities of language use, and its success in quiz games is based more on good database design than anything else. Reputedly, when IBM’s Watson was asked “Who was the first woman in space?” its answer was “Wonder Woman”, as it had no concept of the distinction between fiction and non-fiction.
Notwithstanding the limitations of statistical classification, any type of classification is usually better than none, even with a high error rate, but expectations should be tempered. The problem of language understanding may well be cracked in the future and automated filing clerks will become available, but we are certainly not at this stage yet.