Next: Merging Related Records
Up: Identifying and Merging Related
Previous: Cataloging and the
Libraries have faced the problem of duplicate records for more than 20 years; as they developed computer systems for managing library catalogs and, in particular, for sharing bibliographic records in a networked environment, duplicate records were identified as a problem and techniques for eliminating them were developed. Traditional de-duplication, however, is significantly different from identifying all the citations of a particular work; in many ways, identifying works is easier.
This chapter describes in detail the algorithm for identifying author-title clusters and the characteristics of the source records that make this problem hard. It presents a novel algorithm that uses randomized full-text searches to detect potential duplicates. The algorithm is analyzed and two kinds of failures are examined-including a record in a cluster when it cites a different work (false match) and creating two different clusters whose members describe the same work (missed match). The chapter also surveys other approaches to identifying related records-two early systems used for de-duplication in library catalogs and a recent multidatabase integration system.
The record comparison test addresses three sources of variation between records that describe the same document: entry errors, which are primarily typographic, differences in cataloging standards, and abbreviations.
Entry errors are unintentional errors made during the creation and transcription of bibliographic records. The most common entry errors are simple misspellings and typographical errors, but improperly formatted records and records which omit required fields are also common. Typical format errors include placing information in the wrong field, such as putting the month and year in the year field, or not using the required syntax for a field, e.g. entering the year as a two-digit number instead of a four-digit number.
Misspelling and typographical errors are common to most text databases and the literature about them is substantial. O'Neill and Vizine-Goetz's overview of quality control in text databases  provides a thorough, though somewhat dated discussion of the sources of errors and some techniques for correcting them; Kukich  surveys techniques for automatically detecting and correcting spelling errors.
While entry errors are an unintentional source of variability, differing cataloging standards are often intentional. Many bibliographies apply a consistent set of cataloging rules, but the standards are quite different from one bibliography to another.
Cataloging standards affect what kind of information is recorded in a citation and where it is recorded. If a citation describes an extended abstract, for example, some bibliographies record that fact in the title field and others record it in the note or annote field.
Cataloging standards also determine what document type the citation is assigned or whether a new citation is created for each instance of publication. When an article is published more than once, some databases include a record that describes one of the publications fully and includes information about the other publication in one of the comment fields. Similarly, many conference proceedings are published as issues of a journal; sometime these articles are cataloged as Bibtex InProceedings citations and other times as Article.
Authors' names are a particular source of trouble. The same name can be cataloged many ways: last name only, first initial and last name, all initials and last name, first name and last name, etc. In a list of authors, some bibliographies include only the first author (only sometimes indicating that there are other authors) and others include all the names.
Unusual names frequently cause problems. In a sample of 1,000 records, 14 uses of ``Jr.'' were found, but none follow the specified format . The correct placement of ``Jr.'' within the author string is: ``Steele, Jr., Guy L.''
Abbreviations are another source of problems caused by differences in cataloging. Journal and conference names and publishers are abbreviated in many non-standard ways, making comparisons using these fields difficult. (Bibtex allows users to define abbreviations within a Bibtex file; the abbreviations are expanded when a bibliography is processed. These are not the abbreviations discussed here; rather, the problem lies with abbreviations in the actual text of the field.)
The problems posed by abbreviations are not addressed here. Abbreviations are seldom used in the author or title fields, so they do not present a problem for identifying related records. The problem is discussed in somewhat more detail in Chapter 6 in the section on authority control.
The algorithm for identifying clusters of related records considers each record in the collection and determines whether it has (approximately) the same author and title field as any other record. There are two primary concerns for the algorithm's design. First, the algorithm should avoid comparing each record against every other record-an proposition. Second, the algorithm, and in particular the test for similarity, must cope with the kinds of entry errors described above.
The algorithm uses two rounds of comparisons to identify author-title clusters. The first round creates a pool of potentially matching records using a full-text search of the entire collection; the pool consists of the results of three queries, with words randomly selected from the author and title of the source record. (The index, however, includes words that appear anywhere in a record, not just in the author and title fields.) In the second round, each potential match is compared to the source record, and an author-title cluster is created for the matching records.
Three different queries are used to construct the potential match pool. Each query includes one of the authors' last names and two words from the title field. Title words that are either one- and two-letters long or are in the stoplist, which includes the 50 most common words in the index, are not used. In cases where there are not enough words to construct three full queries, queries use fewer words. (And in some extreme cases only one or two queries are performed.)
The second phase of testing compares the author and title field of each record in the potential match pool against the source record. It uses an n-gram-based approximate string matching algorithm to compare the two fields. The string matching algorithm is explained in detail below.
The algorithm is applied to every source record, regardless of whether it has been placed in a cluster already; thus, a cluster with five records will be checked by five different passes. The algorithm enforces transitivity between records: If record A matches record B and record B matches C, all three are placed in the same cluster, whether or not A matches C.
The algorithm is tolerant of format errors and typographic errors. Given a pair of records that cite documents with the same author and title, the potential match pool for one record will contain the other as long as one of the three queries contains words that match the second record; a small number of errors is unlikely to cause a problem. The approximate string match will tolerate variations as large as transposed or missing words in a long title string.
Comparing two entries' author fields is more complex than comparing title fields, because there is much wider variation in how names are entered than in how titles are entered. Instead of comparing author strings directly, the algorithm performs a more complicated comparison.
To compare two author entries, each author string is separated into a list of individual authors and each name is compared; the name comparison considers each part of the name-first name, last name, etc.-separately. If the names and the order of the names both match, then the two author field are considered to be the same. When one author list contains matching names, in the same order as the first list, but omits names from the end of the last, the two strings are also considered the same; testing suggested it was likely either there was a cataloging error or that the same work had been published with slightly different author lists. Finally, an author list that includes a single name and the notation ``and others'' will match any list of two or more names that has a matching first author. (The Bibtex format specifies the use of ``and others,'' so variants like ``et al.'' are not handled.)
Individual author names are compared by separating the two name strings into four parts: first name, middle name, last name, and suffix (e.g. Jr.). Two parts of a name match if any of three conditions is met:
An n-gram is a vector representation that includes all the n-letter combinations in a string. The n-gram vector has a component vector for every possible n-letter combination. (For an alphabet which contains the letters 'a' to 'z' and the numbers '0' to '9', the vector space is -dimensional.) The n-gram representation of a string has a non-zero component vector for each n-letter substring it contains, where the magnitude is equal to the number of times the substring occurs. The string comparison algorithm uses 3-grams, or trigrams, so I will limit further discussion to trigrams.
Formally, a trigram vector for a string is defined as
The trigram vector for the string ``record'' contains four components: ``rec'', ``eco'', ``cor'', and ``ord.'' If the trigram ``eco'' appeared twice in the string, then .
The string comparison algorithm forms trigram vectors for the two input strings and subtracts one vector from the other. The magnitude of the resulting vector difference is compared to a threshold value; if the magnitude of the difference is less than the threshold, the two strings are declared to be the same. If each trigram appears only once in an input string, the difference vector is the same as the list of trigrams that appears in one string but not the other. In general, the magnitude of the difference vector for two inputs and is:
The threshold value was determined experimentally, using several dozen pairs of strings which had been labelled the same or different, based on how the algorithm should compare them. The threshold that worked best varied linearly with the total number of distinct trigrams in the two input strings. The threshold used for comparing two strings with a total of n trigrams is
The variable threshold allowed the test to tolerate several errors in a long title string, such as missings words or multiple misspellings, and still work well with short title strings. A higher, fixed threshold would have admitted many short strings that had completely different words.
The threshold was chosen using a collection of about 50 pairs of strings, each of which was labelled a priori as the same or different. The computed threshold minimized the number of pairs that was incorrectly identified. The sample strings, however, were not chosen with great care and the effects of small variations in the threshold were not studied.
however, begins the first trigram with the first letter and ends with the trigram that includes the last three letters; this test is slightly more tolerant of errors at the beginning and end of strings.
Figure: N-gram comparison of two strings with a letter-pair transposition
Consider the example comparison in Figure 3-1 between two strings ``Machine Vision'' and ``Machien Vision.'' The difference vector contains eight trigrams, each appearing once, so the magnitude of the difference is 2.828. The threshold, based on a count of 15 trigrams in the input, is 2.861. The strings are declared to be the same, because the magnitude of the difference is just less than the threshold.
The trigram string comparator worked well enough to create author-title clusters that tolerated small errors, but did not include fields that a human cataloger would consider distinct. There were some problems, however, and it seems likely that it could be improved. Many similarity functions for string and document vectors are discussed in the information retrieval literature; Salton  offers a brief overview.
The first round, which identifies a pool of potential matches, greatly reduces the number of comparisons that must be made to create author-title clusters. The brute force approach-compare each record to every other record-requires comparisons:
The pool of potential matches tends to be very small; the average pool contained fewer than 30 records when the entire 240,000 record collection was processed. The savings over pairwise comparison is big-7.5 million comparisons instead of 31.2 billion.
The number of comparisons that must be performed grows with the average pool size, but it is difficult to judge how the collection size affects the average pool size. The pools are the results of a full-text search that includes one of the authors' last names (unless no authors are cited); although the number of occurrences of any word tends to grow linearly with collection size, names behave differently. The number of times ``Szolovits'' appears, for example, is a function of two things: how many papers Szolovits has written and how many duplicate citations for these papers are in the collection.
A closer examination of the potential pools that were produced while processing the full collection helps to map out the possibilities for collections up to hundreds of thousands of records, but does not provide a definitive answer.
I recorded the size of the potential match pools for the first 5,800 records, based on a collection with 50,000, 125,000, and 240,000 records. (Only a single pool was created, but records outside the first 50,000 were ignored for the first sample and records outside the first 125,000 were ignored for the second sample.) The distribution of pool sizes is roughly exponential. The majority of the pools contained fewer than 13 records in each sample. The tail of the exponential grows longer as the collection size increases; the largest pool increases from 14,294 for the 50,000-record sample to 45,750 for the 240,000-record sample.
Table: Statistics for size of potential match pools
Most of the pools are quite small, but occasionally large pools are returned; the latter are often caused by books with with some missing information, e.g. no author, or very common title words, e.g. ``introduction to algorithms'' or, worst of all, both. A reasonable optimization might be to ignore records that contain extremely large potential pools, and treat them as failures for which related records cannot be recognized.
Several papers describe specific systems for identifying duplicates records. Two early duplicate detection projects are surveyed below-the IUCS scheme  and the original OCLC On-Line Union Catalog ; O'Neill, et. al  presents some characteristics of duplicate records in the OCLC catalog. Toney  describes a more recent effort, and provides a good overview of the design space for library duplicate detection systems.
The OCLC and IUCS projects were completed in the late 70s, when the library community had just begun sharing bibliographic records among institutions and building large online union catalogs. Subsequent work reflects the general approach mapped out in these two seminal studies.
The database merge/purge problem, described by Hernández and Stolfo , is very similar to the duplication detection problem. In the merge/purge problem, several different databases with similar records must be merged and the duplicates purged; an example is joining several mailing lists, which may contain inconsistent or incorrect addresses.
The recent information retrieval literature contains several reports from Stanford on duplicate detection in a Usenet awareness service  and several schemes for copy detection in a digital library . Duplicate detection in information retrieval is a substantially different problem, because the actual content of two documents is compared; the reported work discusses several different approaches to identifying overlapping content at the granularity of words and sentences.
The key difference between the OCLC and IUCS algorithms and my algorithm is that the former identify duplicate records. Records are duplicates when they describe exactly the same document and not just the same work. Although the precise rules for defining a duplicate vary somewhat between the algorithms, they both compare many fields. The OCLC rules require that the following fields match for a pair of records to be considered duplicates: author, title, publisher, copyright date, date of impression (if text differs), edition, medium, series, page or number of volumes, illustrator, and translator.
Most techniques for duplicate identification share the same gross structure: They use two rounds of comparisons, identifying possible duplicates on the first round and making more careful examination on the second round. The first round uses a fixed-length key, created from one or more field values; records with the same key are considered potential duplicates. A longer key, incorporating several fields, is used for the second round. (The second-round keys seem to be used primarily to avoid loading entire records into memory for comparison-a concern more important in 1979 than today.)
Duplicate detection keys summarize the information contained in the record; they consist of a series of fixed-length encodings of field values. In the OCLC and UICS schemes, different keys are used for round one and round two of duplicate identification. Fields are typically encoded in one of three ways:
The IUCS scheme uses a short, fixed-length title-date key for its first round, taking eight characters from the beginning and end of titles plus some date information. (MacLaury  described the choice of title characters.) Records that have the same title-date key are compared using three more keys: the first five characters of the author field, a 72-bit Harrison key of the title field, and the highest number taken from the various page fields.
The OCLC scheme is similar. The first round key looks for exact matches of four fields-the date, record type, reproduction code, and the 8-character title selection. The second round compares the entire key, but defines 16 different conditions under which the keys can be declared a match. These conditions examine different parts of the key, looking for either an exact match or partial match. Partial matches are defined for only a few fields of the key; a partial match of of Harrison keys occurs when the bits set in one key field are a subset of those set in the other key. One of the conditions for matching illustrates the approach: Two records are duplicates if there is an exact match of the Government Document Number and page fields, and partial matches of the LCCN, ISBN, edition, and series fields; partial matches occur when one record has no value in a particular field.
The QUALCAT expert system  is an interesting alternative to the key-based comparisons. A team of catalogers developed a set of rules that describe whether a certain combination of fields values makes it more or less likely that two records are duplicates. Each record pair is described by a possibility that it is a duplicate (poss) and the certainty it is a duplicate (cert). Initially, the cert is 0 and the poss is 100. Each rule increases the cert or lowers the poss. Once all possible rules have been applied, a pair of records with sufficiently high cert and poss are automatically labelled duplicates, records with low values are not, and records in between are referred to a human cataloger for review.
The expert system was used to compare records during the second round of QUALCAT testing. The first round was a fairly typical key-based filter using Universal Standard Bibliographic Code (USBC). The USBC is similar to the OCLC and IUCS first rounds keys, but Toney  notes that it is more dependent on clean data. Quantitative results of the QUALCAT system were not provided.
In the OCLC study, Hickey and Rypka  observe three major causes of failure to match duplicates. A difference in the first 34 characters of the title string that the initial date-title keys is drawn from causes 12 percent of the missed matches. The two other causes are differences in place of publication and pagination.
One question that remains to be answered is whether the probalistic first round proposed here offers any improvement over the key-based approach. Because both systems were not available for testing, we can at best hypothesize that the randomized text-search approach is more tolerant of errors. Word order errors, differences in cataloging, and typographical errors would all cause problems for the date-title key. The randomized text-search approach can tolerate these errors as long as one of the three queries uses words that do not contain typographical errors. It is not affected by word errors or by cataloging variations that affect which words are included in the title.
The approach to the merge/purge problem used by Hernandez and Stolfo  is similar to the one used for duplicate detection. Their algorithm use keys to partition their data into sets small enough that they are willing to apply computationally intensive comparisons. The source data varies widely, so any particular key is likely to miss many records; to overcome this problem, their algorithm makes several independent runs over the data with different keys and computes the transitive closure of the results. The accuracy of their test jumped from 70 percent with a single pass to 90 percent with three passes.
The algorithm for creating author-title clusters can fail in two ways. It can create clusters that include citations for more than one work (a false merge) or it can create two separate clusters that both contain citations for the same document (a missed match). The author-title cluster algorithm keeps these failures to a reasonable minimum; analysis of a sample set indicates that less than 1 percent of the clusters contained false merges and that missed matches occurred about 5 percent of the time.
The clusters were analyzed using a 937-record sample. The sample was selected so that each shares at least three title words with one or more other records; this characteristic was intended to increase the possibility of false merges. The clustering algorithm was run on the sample, and its results were examined by hand. The algorithm determined that the sample contained 554 unique clusters.
After the algorithm was run, I checked each cluster to verify that each record described the same document. I checked for missed matches looking for records in other clusters that contained similar title fields. I identified potential missed matches of a record by doing independent searches for each word in the title. (The search used a an approximate string searching program that did not use a n-gram approach.) If a different record was returned by at least two-thirds of the agrep searches, it was compared by hand with the source record.
According to my manual check of the records, the sample actually contained 554 unique works, 250 of which were cited by more than two records. The author-title clusters contained four false merges and 30 missed matches.
Figure: Errors in author-title clusters for 937-record sample
The four false merges all involved closely related bibliographic records. In three of the four clusters, the two different works were part of a series, where each paper was labelled part one or part two. Figure 3-2 shows an example. The papers had the same authors and the approximate string match reported that the titles were the same because they differed only in the final trigram. The final false merge involved two different papers with very similar titles-``Towards Dataflow Analysis of Communicating Finite State Machines'' and ``Dataflow Analysis of Communicating Finite State Machines''-and the same authors.
Figure: Two falsely matched records
The sample set is small enough that it is difficult to extrapolate the results to a large collection with great precision, but it appears that the failure rate would be acceptably low. Measured in percentages, false merges occurred in 0.7 percent of the clusters and 5.7 percent of the clusters were missed matches that should have been merged with another cluster.
I also used the results of two other algorithms to gauge the relative success of the author-title clustering algorithm presented here. The bibmerge program  creates title-date clusters; if two records have the same title and date, they are placed in the same cluster. Bibmerge has only limited utility as a benchmark, because it uses a very simple detection scheme that is not tolerant of formatting and typographical errors. The program's source contains the comment: ``This script is written in the most unprofessional manner available to me, the only reason why I have not scrapped it is that is works.'' Nonetheless, it was the only other duplicate detection system available for testing on the same sample set.
Bibmerge found 650 unique clusters; it missed 126 matches and made one false merge. The title-data clusters identified 10 record pairs that were missed by the author-title clusters, because the title and date field was the same but the author was different. In seven cases, the author fields were improperly formatted. In two cases, the authors were listed in a different order in each record. The final case was bibmerge's lone false match: An issue of Computing Surveys contained two responses to an earlier article, both of which were titled ``The File Assignment Problem.''
A second check of the quality of the algorithm presented here is the results reported for the OCLC duplicate detection scheme described earlier, although the OCLC clusters were created under substanitally different rules. The analysis of the OCLC algorithm is the most substantial reported in the literature. Using a set of 184 pairs of records that had been identified in advance as duplicates, the OCLC system successfully identified 127 pairs, a 69 percent success rate, compared to a 94 percent success rate for my author-title algorithm The false match rate was measured by selecting 1,000 record pairs and applying the duplicate test to each pair; only some of the pairs were duplicates. The error rate on this sample was 1.3 percent-13 false matches-which is comparable to the 4 false merges out of 554 (0.7 percent) in the author-title clusters.
Next: Merging Related Records
Up: Identifying and Merging Related
Previous: Cataloging and the