"What Have We Got to Lose? The Effect of Controlled Vocabulary on Keyword Searching Results"
Tina Gross and Arlene Taylor. College & Research Libraries, 66(3): May 2005, pp. 212-230.
Mark Lindner
LIS 577 6 February 2006
Previous take
Can we do away with subject headings? Only if we keep 'Moral minimalism and libraries' Late May 2005
- "Let me be up front here, it is the authors who baffle me. Certainly this study has merit in its own right, but it seems to have been generated by an unattributed "suggestion." If so, then I am baffled about why someone would go to the trouble of doing this research for a 'simple' comment. If it isn't so simple—say, was it the dean of a major ARL institution?—then why not finger them? I know, I know. Politics. ..."
¤
I first wrote about this article in late May 2005 immediately after receiving it in the mail. I commented mostly on the "politics" of not naming who made the forthcoming helpful suggestion.
¤
Introduction
Someone suggested that most users search by keywords and SHs "could be removed from catalog records to save space and cost (212, emphasis mine).
This study asks:
- "What proportion of records retrieved by a keyword search has a keyword only in a SH field and thus
- would not be retrieved if there were no subject headings" (212)
¤
There is an assumption that most searches are by keyword (as we'll see...)
Suggestion
"This atttitude has lead to the suggestion (in at least one academic library) that subject headings should be stripped from the bibliographic records in the catalog. The argument was that thousands of subject headings needlessly take up gigabytes of space because users hardly ever search for subject headings. (And an unspoken cost saving, of course, would be that catalogers would not need to provide subject headings for new records.)" (213)
¤
Context
- "Any intelligent man..."
- Received wisdom
- Surprised librarians
- Prevalence of received wisdom
- Keyword searching in 2005
¤
"Any intelligent man..." was a prevalent attitude in the 1830s as reported; and it remained prevalent through most of 20th century
"Any intelliegent man who was sufficiently interested in a subject to want to consult material in it could just as well use author entries as subject, for he would, of course, know the names of all the authors who had written in his field" (212, as cited by Ruth French Strout 1956).
Received wisdom
"Many catalog use studies have shown that most searches are for known items or at least for a known author" (212).
- What is this saying/not saying? Does 50.1% = most?
- A few show subject searches as "majority"
- Primarily from public libraries
- Tendency is to ignore them.
Surprised librarians
Early 90s - OPACS: "Many librarians were quite surprised to learn from various transaction log studies that a high proportion of searches in catalogs was for subject matter" (212-3).
Early "subject searching" - apparatus of today not in place: Less fields were included in keyword searches. Once, subject headings, along with few others other than title, were not included in keyword searches. What is the impact of that knowledge on our research question?
Prevalence of received wisdom
Assumption of "known item or author" search is prevalent in our literature. Carole Palmer was discussing it in her Use and Users class last Thursday. I showed her this article right after class. Then as I was leaving she told me to give the Gross and Taylor citation to Alan Renear who had just brought up the prevalence of this received wisdom in relation to some of his work.
These are our stories, and often our motivating stories, and our assumptions. It is critical to understand them as such, and to understand their reach.
Keyword searching in 2005
Point: Most fields can be searched as keywords BUT which are searched as keywords is highly variable. [Applies to how widely applicable the results are, among larger implications.]
¤
Literature Review
- Assorted peripheral studies
- General and specific lit reviews
- 1998 Voorbij: 2 studies
- 1 Compared Descriptors vs. Title keywords:
- 37% of records were "considerably enhanced" and
- 12% were "slightly enhanced" by the addition of a descriptor (215)
- 2 Subject descriptor search vs. Title keyword search:
- Descriptors 86.9% recall
- Keyword 48.2% recall (215)
¤ None
Research question restated
Take an initial step towards finding the answer to:
- "What proportion of records retrieved by a keyword search has a keyword only in a subject field and thus would not be retrieved if there were no subject headings" (215)?
How?
- "Using captured searches from a transaction log, a series of keyword searches was performed to determine what proportion of the records retrieved by each user's search had a keyword only in a subject heading field and thus would not be retrieved if the subject headings were not there" (215).
¤ None
Methodology
Terms taken from keyword searches in a SC university library
- 3397 keyword searches
- 2270 unique single or multiple word strings
- Sample of 227
Re-ran as keyword searches in PittCat (U of PA OPAC)
¤ None
Mo' Methods
Stopwords removed: a, an, and, by, for, from, in, of, on, or, the, to, with (216)
Limit to English language only
- e.g., literature Brazil
- in English loses 33.2% of hits w/o SH
- in all languages loses 56.7% of hits w/o SH
- Thus, SHs (more?) important in case of a high percentage of non-English-language materials in collection
Provisional acquisition records with minimal bibliographic records
¤
Stopwords
Impact of/on non-English-language materials
Limit to Enlgish-language only materials because "the vast majority of bibliographic records for foreign-language materials with English-language subject headings could only contain many of the English-language search terms from the sample in their subject headings" (216).
Some case 100%
SHs (more?) important in case of a high percentage of non-English-language materials in collection
- Removed to broaden applicability of study to libraries with less non-English-language materials
- But decreases result being looked at
- BUT, what if we were interested in the importance of SHs for the retrieval of English-language searches in a primarly non-English-language collection?
Provisional acquisition records
- Could not be excluded
- Also decreases result being looked at
¤
Data retrieved
Number of hits with all keyword(s) anywhere
Number of hits with all keyword(s) and at least one in subject, but not all in title
Number of records (or of the first fifty records) with at least one keyword in subject only
¤ None
Making data manageable
Second search to reduce hits
| Search for: |
|
Search by: |
| metal sculpture |
all of these |
Keyword Anywhere |
| AND |
|
|
| metal sculpture |
any of these |
Subject |
| NOT |
|
|
| metal sculpture |
all of these |
Title |
"Because keywords can still appear in many fields (subject, title, author, series, notes, publication, physical description, etc.) it was still necessary for us to view the remaining hits" (217).
¤
Reproduction of Figure 2 "Second Search Performed to Reduce Hits Needing to be Viewed Manually"
¤
Mo' management
If retrieved set still over 50 hits, used first 50 hits (not sampling)
- PittCat displays results in reverse chronological order
- "Thus, the most recent, and presumably the most useful, hits appear first" (218).
Assumption: Recent = More Relevant
¤
At least 2 issues with this assumption.
¤
Mo' management
If retrieved set still over 50 hits, used first 50 hits (not sampling)
- PittCat displays results in reverse chronological order
- "Thus, the most recent, and presumably the most useful, hits appear first" (218).
Assumption: Recent = More Relevant
- 1 What motivates this assumption of recent = useful? How does it vary across disciplines and other factors?
- 2 Assumes that the variables being looked for/at have been consistent over time
¤
Assumption of Recent = Useful. Another unexamined story we tell ourselves?
Synchronic consistency:
Were titles more (or less) descriptive in the past?
Are as many subject headings assigned? More? Less?
Are older SHs updated to reflect current terminology?
...
¤
Final methods
Determine (or extrapolate [sets > 50]) number of hits with all keywords in a record, and with at least one in SH, but not all in title
Final step: Determine percentage of hits missed out of total number if there were no SHs
- 227 Searches
- 18% rejected for No valid results
- 9 had >10,000 records
- 32 had No hits at all
- 186 valid searches (all data in Appendix, pp. 225-30)
¤
PittCat limits display to 10k
"Given that the total number of hits for these searches was unknown, the proportion of hits lost could not be determined" (218).
But, are all these moves valid? what is being included or excluded?
What are our further unstated assumptions?
At a minimum, how do these moves impact the answer being derived?
A human-, or even machine-manipulable data set does not guarantee or even imply an "accurate" empirical answer. To look is not enough, one must look in certain ways.
¤
Findings
Hits lost in the absence of SHs
- Mean: 35.9%
- Median: 30.2%
Average proportion of lost hits increases as number of keywords goes up to 3
- Thus, what is the impact on lost hits by number of keywords?
¤
Mean: Average - quotient of the sum of several quantitites and their number
Median: Middle value of a series of values arranged in order of size
¤
Table 3 (220)
Results by Number of Keywords in Search
| |
All Searches |
1 KW |
2 KW |
3 KW |
4 or More KW |
| # of searches |
186 |
44 |
98 |
30 |
14 |
| Median # of hits |
66 |
390 |
57.5 |
39.5 |
9 |
| Avg % lost |
35.9% |
26.0% |
37.3% |
44.9% |
38.0% |
| Median % lost |
30.2% |
19.7% |
36.6% |
34.7% |
26.5% |
¤ None
Outliers or exceptions: Table 4 (220)
Individual Searches with High % of Hits Lost w/o SHs
| Keywords |
# of Hits |
% of Hits Retrieved That Would Be Missed w/o SHs |
| airplanes military parts |
23 |
100% |
| businesswomen |
173 |
98.8% |
| divorced people |
55 |
92.7% |
| baptist united states |
916 |
92.7% |
| horror films |
402 |
82.8% |
| mass media politics |
372 |
78.6% |
| history slang |
22 |
77.3% |
| storytelling books |
65 |
71.4% |
| hispanic americans |
762 |
71.4% |
¤
Left out column "Number of hits with a keyword in SHs only"
For about 31.7% of the searches, the percentage of hits with a KW only in a subject field was 50 percent or greater. This means that for about 3 out of every 10 successful KW searches, half or more would not be retrieved if the were no SHs. For about four of every ten successful searches, more than 40% of hits would be lost; and for half of all successful searches, more than a third would be lost" (219-20).
¤
TOCs and Summaries
Positive:
- Substantial augmentation of record by providing chapter-level access
- Easier for users to assess relevance to needs of individual records, but see Neg
- Include highly specific terms not normally present in a MARC record
Negative:
- Reduces precision, that is, increases # of irrelevant hits
¤
Since study was conducted, many English-language monograph records have been augmented with Blackwell's Table of Contents Enrichment Service.
Easier for user to assess relevance of individual records
but, more irrelevant ones are recalled to wade through
Thus, TOCs and summaries
- Increase the number of hits
- Decrease the chance of zero hits
- Reduces precision
For example: metal sculpture (220-1)
"Now yields considerably more hits"
Many among the 1st 25 are there solely due to TOCs and summaries:
- Jazz modernism: from Ellington and Armstrong to Matisse and Joyce
- Rapid prototyping casebook
- Animaculture [book of poems]
- The wound-dressers dream
Questions
- What % of records are enhanced?
- What % of results are non-monographic?
- How does this translate to other catalogs?
- Often these sorts of augmentations are used to argue for no longer needing "expensive" cataloging techniques — thus, doubly important to understand the effects of these sorts of augmentations
Shows other quirks and positives and minuses of TOCs and summaries augmenting our records
This article is a good argument for why we need people educated ala Williamson. We can't just accept that the addition of TOCs and summaries (and other augmentations) is a good thing.
"..,it is essential that all information professionals have a basic knowledge of the principles of subject analysis, and an understanding of their application in indexing and retrieval in online systems of various kinds. ... who are conversant with the characteristics of the catalogs and databases they search and are familiar with their vocabularies (natural language and controlled) and the ways in which they can be manipulated in retrieval (Kesselman 1984)" (Williamson, 82)
¤
More sophisticated searching...?
- Typical users and phrase searching
- 23.7% of searches were for one word terms
- "Athletes" now includes 7 out of 10 highly specific entries for such a general search
- Could see an early relevant hit
- Click through to record
- See Athletes—Biography as subject heading and click through to subject...
Oh, No! They can't.
¤
Thus, less effective ways to reduce # of irrelevant hits
¤
Future research
- "Determine the full effect of the addition of TOC data and summaries to catalog records."
- "Especially important will be an attempt to determine the effect on precision of the dramatic increase in recall that is occuring with this addition" (222-30).
- Replicate this study in other contexts
¤
Replicate study in other libraries
- public and academic (others?)
- Varying collection sizes
- Varying amounts of non-foreign-language materials
- In catalogs where the native language is not English
Study impact of "augmented" records - comes in varying "strengths"
Study impact on precision of increased recall in various contexts.
¤
Conclusions (223)
If SHs are removed from, and no longer added to, bibliographic records:
- Users lose 1/3 of all KW searches currently retrieved
- Loss of other functions provided by SHs and controlled vocabularies (as summarized by Voorbij)
- Enhance bibliographic record of a publication
- Grouping synonyms, other ways to express a topic, and terms in foreign languages under the same heading
- Suggest other entries by cross-references
- Reducing irrelevant hits
- Users would have few options to narrow results with a high proportion of "false" hits
¤
Precison not determined for this study - assumption: this 35.9% includes a high proportion of relevant hits
Users doing keyword searches with a high proportion of "false" hits would have few options to reducing this set w/o SHs; thus, a powerful tool for narrowing searches
Ethical implications
If one accepts Ranganathan's 5 Laws as a professional motivating force, and/or you believe that an attempt like Blair's "Toward a Code of Ethics for Catalogers" is a positive development AND you accept the conclusions of this study:
Then one has an ethical/moral obligation to advocate for:
- The retention of SHs
- The replication of this study in other settings and contexts
- And further study as suggested by the authors
¤
Blair, S. Toward a Code of Ethics for Cataloging. Technical Services Quarterly, 23(1) (2005) p. 13-26.
Sources
"What Have We Got to Lose? The Effect of Controlled Vocabulary on Keyword Searching Results"
Tina Gross and Arlene Taylor. College & Research Libraries, 66(3): May 2005, pp. 212-230.
"The Importance of Subject Analysis in Library and Information Science Education"
Nancy J. Williamson. Technical Services Quarterly, 15(1/2): 1997, pp. 67-87.
Technology Credits
- S5: A Simple Standards-Based Slide Show System based entirely on XHTML, CSS, and JavaScript.
- It is the creation of Eric Meyer.
- It belongs to the Public Domain.
- Is small, lightweight and will run on any current standards-based system.
- Further info and download available at http://www.meyerweb.com/eric/tools/s5/