Using SQL to Reduce Demand Driven Acquisition Duplications

Date: March 09, 2020

Using SQL reduced time spent on removing duplicate records by 75%

The Problem

Brown University uses the Demand Driven Acquisitions plans from JSTOR and GOBI Sciences, it also has an e-book plan with EBSCO. JSTOR and EBSCO were originally being delivered through OCLC onto a single bib record with two different 856 links. There was a lot of overlap between JSTOR, GOBI, and EBSCO which resulted in a lot of time cleaning up records via the Sierra Heading’s Report which unfortunately was unable to limit to just these three plans.

Sierra’s Headings Report resulted in more irrelevant results than relevant - a couple hundred results would have less than 10 relevant bibs. Staff was spending up to 4 hours a week sorting through these results which included duplications between print and electronic resources that were deliberate. Further complicating this was the fact that several of our eresource loads were temporary resources from EBSCO and Serials Solutions whose availability changed over time. As these resources were not purchases, and would eventually leave our catalog, duplications between them and our DDA plans were not a concern.

The Solution

We began by separating our EBSCO and JSTOR DDA records onto separate bibs, which required a lot of clean up and reloading of records into the catalog upwards of 100k. Then we decided on a prioritization between the DDA plans.

Since purchase loads could take a while to become available to download, there was a risk of creating duplication by deleting the purchased’s bib record in our catalog in preference of a new bib before the purchased bib could be updated. Therefore, it was decided that if a non-purchased DDA file resulted in duplications, priority would be given to the file already in the system.

Once vendor prioritization was decided, a SQL query was developed to look for more than one bib number linked to a single isbn. DDA records were separated from other eresources and print resources by using a local 910 tag.

Once the query was successful, it was made into an executable program using python. The executable program would take several minutes to run before resulting in a text file containing a list of isbns and how many bib records were linked to them.

The Results

After uploading 227 jstor dda records: SQL returned 16 results while SIERRA returned 126 records - only 16 were relevant.

Both Sierra and SQL methods were used side by side for over a month, with SQL consistently returning concise and equal amounts of relevant results to Sierra. The executable program took longer to run, averaging 6 minutes, but could run in the background and reduced overall time spent on cleanup.

Staff went from spending 1-4 hours reviewing records to less than 30 minutes. This averaged 50%-88% reduction in time spent each week on this process, with an overall average of 75% reduction of time spent each week.

The Resources

Codecademy.com

Learn SQL and PYTHON for free.

Github.com

Repositories of code, including librarian made code. Download and edit for use or look for examples to learn from.

Deduplication Repository

My python files that run the SQL query. In the DDA file: You will need to edit line 9 with your Sierra host information, and change the v.field and v.content field in the SQL to whatever local field you use to indicate a DDA. In the Physical file: You will just need to edit line 9. Only the Physical.py file currently saves a text file.

Share on

Twitter Facebook LinkedIn

Kylene Kristine Hutchinson

Using SQL to Reduce Demand Driven Acquisition Duplications

The Problem

The Solution

The Results

The Resources

Codecademy.com

Github.com

Deduplication Repository

Sierra DNA

PGAdmin

My Sierra SQL repository

Public Library of Cincinnati & Hamilton County sierra-sql Wiki

UNC Chapel Hill Libraries Sierra SQL repository

Jeremy Goldstein Sierra SQL repository

Share on