Next: Presenting Relations and
Up: Identifying and Merging Related
Previous: Identifying Related Records
This chapter describes the creation of a composite record that summarizes the duplicate records in an author-title cluster. It introduces two terms, information dossier and union record, to describe two different ways of grouping related bibliographic information.
A union record is a composite record created by merging several bibliographic records from distinct sources. A different union record is created for each different type of document included in the cluster; thus, a particular cluster may have union records for a journal article, a technical report, and a conference paper.
An information dossier  is a collection of information objects, e.g. bibliographic records, related to some way to one another. Specifically, I use the term to describe the source records that form an author-title cluster and the union records generated for the cluster. Although the current system does not include other objects, the next section presents a system for automatically linking records to electronic copies available on the Internet. The dossier would contain links or perhaps local copies of relevant item; in this way, a dossier is distinct from the records in a cluster.
Union records can be an imprecise summary of the source records, because the quality of the source records is variable and because there are sometimes too few records to be able to resolve conflicts between records. When there are conflicts, a single representative value is chosen for the union record instead of omitting the field or creating a hybrid value.
Because of this decision, I also calculate statistics that describe the quality of the composite. When the source records vary significantly from the union record, the user may wish to examine the source records. These statistics are described in Chapter 5.
There are three primary goals for the merger process that creates union records and dossiers:
Duplicate listings in the results of a query limit the ease with which the results can be used. A long list takes longer to transmit, process, and read than a short list, but the real difficulty is that it can be difficult for a person to identify the duplicate and related records. For a person, identifying duplicates in a long list can be quite difficult and error prone.
When the union record is created, there is an opportunity to eliminate errors and to create a record that is more complete than any of the sources. An error in one or a few source records can be purged, if there are enough records with the correct information; the correct value for the field is chosen by voting. Fields that are missing in one record can be drawn from another record. Unfortunately, these kinds of quality control have limited utility, because in the common case there are only a few records in a cluster.
Finally, the dossier brings together information about each different instance of publication, which allows a user to choose the physical document that is easiest to retrieve. When works are published several times-in different journals, proceedings, or books-having an exhaustive list of these publications makes it easier to find the work when the local library does not hold all of the publications. Because many universities publish their technical reports in digital form, knowing that a work was issued as a technical report means the work is more likely to be found online.
A secondary goal for the dossier is to minimize the amount information lost to the user when false merges occur. The dossier should include information about how closely individual records match the union record and how much variation there is in the value of a field across the source records. When a particular record differs substantially from the union record or when one field has a different value in each record, there may be cause for the user to suspect a false merge. The interface described in the next chapter provides access to all of the source records and a measure of the truthfulness of the union record.
The general strategy for merging records is based on the different kinds of entry types allowed by Bibtex. The source records are grouped by entry and a union record is produced for each type. The fields values in the union record are assigned, for most fields, by counting the occurrences of each value and choosing the most commonly occurring value for the union record. Some fields are treated differently. The author and title fields will be the same regardless of type, so we can apply to counting strategy can be applied globally. The author field is also different because the counting strategy is applied to the component names rather than the full author list.
The merger process described here is very simple, and only copes with a few kinds of errors in the source records. However, a number of refinements are suggested for dealing with a wider range of errors; these refinements deal with specific kinds of errors but use the same basic strategy. (A few of these refinements have been implemented, but most have not.)
Records are merged a field at a time. For each field, the number of occurrences of each different value is counted and the most frequently occurring value is chosen. Sometimes, one or more values will occur with equal frequency, and a tie-breaker is needed; the longest value is chosen. The tie-breaker is arbitrary, although it is hoped that the longer values will be more likely to contain information that helps the user.
There are three exceptions to the rules for merging fields.
There are three rough categories of fields, each of which will be affected somewhat differently by merging. The categories differ in how likely a field is to be the same in two different records for the same document.
The first category includes fields like title, date, or pages, which describe fixed, objective characteristics of the document. These fields are most common and the merger process is tailored to them.
The second category includes fields like note, annote, and keywords, which will vary widely from source record to source record (if they appear at all). There are no specific guidelines for the use of these fields, so each source may describe some different characteristic of the document. Each occurrence of one of these fields in a source record is included in the union record.
The abstract field is hard to classify. Although there should be a single abstract for each document, there is a lot of variation in what is actually recorded as the abstract. Currently, the longest abstract is chosen using the standard process.
The last category is fields which are used to manage bibliographic records or serve some other purpose specific to the record's creator. Standard fields like key and many non-standard fields, like bibdate or location, will appear in the union record, chosen by the standard counting scheme. However, it is unlikely that any of the field values are related and the value selected has little significance. The interface presented in the next chapter ignores these fields in the standard display.
The counting approach does not work very well when there are only a few records of a particular type. Typographic and formatting errors also cause problems.
In the six records for the article "Scheduler
Activations: Effective Kernel Support for the User-Level Management of
Parallelism" by Anderson et al., three different values appear in the
pages field-``53'' and ``53
'' each appear once and
'' appears four times. If the value ``53
appeared only once, it would be impossible to distinguish between the
correct value and the incorrect ones.
Figure: CS-TR records for one TR from two publishers
Another potential problem is the policy of creating a union record for each different document type in a cluster. The policy assumes that there will be only a single document of a particular type in a cluster, i.e. that we will not find two different articles with the same author and publisher. This assumption does not hold in some circumstances, resulting in misleading union records.
One example of a failure is a technical report written by authors from different institutions and issued independently by each institution. (See Figure 4-1.) The system will create a single union record for these reports, which correctly represent most information-author, title, abstract-but will obscure or confuse the issuing organizations and the report's number or identifier. The problem is serious, because a the organization and report number are important for locating a copy of the document.
Figure: Bibtex records exhibiting the conference-journal problem
Another example of a failure is caused by confusion about how to catalog the papers in a conference proceedings that are published as a journal article, e.g. the SOSP proceedings printed in Operating Systems Review. The proceedings is an issue of the journal, so it would be quite reasonable to catalog the conference paper as a journal article. But if a paper is cataloged as an article and is also published in a journal (say the Transactions of Computer Systems), then the record for the SOSP paper and the record for the TOCS article will be merged into a single union record.
The general strategy just described improves significantly with a few refinements. Three refinements have been implemented: The author field is treated separately, as it was during cluster identification, because of its special formatting. Two simple filters are used to prevent field values with detectable errors from being counted. Several other refinements are suggested, but have not been implemented.
The author list is constructed differently because all of the author fields in the source records must match (with the approximate match described in Section 3.2.1) for the records to be placed in the same cluster. The merge algorithm extracts as much information about each name as possible and creates a new author list. When names are compared, each part (i.e. first, middle, last name) is expanded wherever possible; a blank entry becomes an initial or a full name, and an initial becomes a full name.
Filters, which validate field values before creating the union record, prevent invalid data from being included in the union record; they can also normalize field values, by testing for common mistakes and cataloging variants and attempting to correct them.
The month and year fields are merged using filters. Common problems in these fields include:
More powerful heuristics for identifying mis-formatted and incorrect data and either discarding it or converting it to the correct format would further improve the quality of the union records. For example, it may be profitable to identify a group of field values that are similar but not exactly the same, e.g. two titles that differ in only a few positions or years that are similar, like ``199?'' and "1991. The merge process would then determine the most frequently occurring group, and then choose a representative element from that group. If three source records contained the year values ``1989'', ``1991'', and ``199?'', this strategy would choose 1991 as the most frequently occurring, correctly-formatted value.
When there are several fields that occur with equal frequency, we choose the longest value for the union record. There are many other heuristics that could be used instead, like a strictly random choice or choosing the shortest field; heuristics could be applied on a per-field basis, e.g. using the highest number in the year field.
A brief analysis of the author-title clusters in the DIFWICS suggests two broad observations. First, enough related records were found to justify the effort involved in identifying them. Second, within a cluster there is substantial variation among the source field values.
The DIFWICS collection consists of 243,000 source records and 162,000 author-title clusters. More than half of the records belong to a cluster that contains two or more records. Fewer clusters contain more than one record of the same type-about 30,000 clusters or 20 percent of the collection.
Table: Cluster sizes
Table 4-1 shows how many clusters of a particular size there are and what percentage of the total number of records are in clusters of that size. The average number of clusters in a record is 1.49.
The distribution of record types within the entire collection is basically the same as the distribution within clusters: Articles are most common, followed by papers in conference proceedings Table 4-2 shows how many source records of are particular type exist. Most of the clusters contain one type of record.
Table: Souce records, by type
Clusters with more than one type of source record represent less than 10 percent of the total number of clusters. The three most common combinations are Article and InProceedings (4,349 clusters), Article and TechReport (1,808 clusters), and InProceedings and TechReport (1,536 clusters). Fewer than 1,000 clusters contain three different record types and none contain four or more.
Within clusters that contained two or more records of the same type, I examined individual fields to see how often all the source records had the same value. Field values were compared by normalizing them to lowercase alphanumeric strings, eliminating formatting and punctuation.
The statistics were gathered using a 10 percent sample of the clusters, considering only those clusters that had two or more records of the same type. The sample included 3,753 Article clusters (11,836 records), 3,113 InProceedings clusters (9,705 records), and 1,335 TechReport clusters (3,345 records).
Table 4-3 summarizes the results of the analysis on several standard Bibtex fields. It shows the number of times the field appeared in the sample clusters and the number of times all the records in a cluster had the same normalized value.
Figure: Variation in fields values within author-title clusters of the same type
The variation in the title field is interesting because it suggests how many more clusters would exist if approximate string matching wasn't used. About 10 or 15 percent of the titles don't contain the same normalized string, even though they are considered to be the same under the approximate string match.
Among the other fields, the year is the most consistent; records have the same year more than 90 percent of the time. The month field shows much more variation because there are several different abbreviations used for each month. Abbreviations cause similar problems in several other fields with high variation-journal, institution, booktitle, and publisher. The filters for improving the merger process, described above, could increase the number of fields with uniform values.
Next: Presenting Relations and
Up: Identifying and Merging Related
Previous: Identifying Related Records