Two imports down, how many to go?

As I posted yesterday, we did our first significant test import with 130,000 records. We spent pretty much all of last night analyzing the results of that import and coming up with little tricks to improve some of the inconsistencies in the data. The main things we looked out for were:

  • Different books by the same author linked under the same author record.
  • Different editions of the same work linked to the same work record.
  • Different books in the same series linked to the same series record.
  • …and so on.

Things looked pretty good, but definitely not perfect. This is largely due to the data we’re working with not being perfect and needing a lot of cleaning up!

Here are a few of the improvements we came up with:

  • We convert & to and when doing book title matching. This one came about when we found two different records for “The Wrath and the Dawn”. One was just like that — “The Wrath and the Dawn”, but the other was “The Wrath & the Dawn”. This was affecting our ability to link them together as one work.
  • We fixed a bug that was preventing two different editions of the same work from being linked together under certain circumstances (different from the above issue). That one was my fault!
  • We made huge improvements on 1) stripping duplicate series information from the book title; and 2) using that stripped series name as the actual series name if one wasn’t provided.

To elaborate on that last point…

Titles and series are meant to be in separate fields, but sometimes they are both provided in the title field. For example, here are two sets of data as provided (in their original formats before we work our magic!):

Title: The Kissing Booth #2: Going the Distance
Series: (not provided)

Title: The Book of Dust: The Secret Commonwealth (Book of Dust, Volume 2)
Series: The Book of Dust #2

In that first example, no series information was provided at all. Instead, it was put into the title as “The Kissing Booth #2”. Prior to our tweaks last night, this meant the book was inserted without a series at all, because the real series field was blank!

In the second example, the series information was duplicated. It appeared in the dedicated series field but was then repeated in the actual title name. This meant it was displayed super repetitively: “The Book of Dust: The Secret Commonwealth (Book of Dust, Volume 2) (The Book of Dust #2)”. Yuck!

So we worked on parsing the series information out of the title to 1) fill in a missing series name field; and 2) to remove repetitive data from the title.

After making our changes and re-running the import, we now have these results:

Title: Going the Distance
Series: The Kissing Booth #2

Title: The Secret Commonwealth
Series: Book of Dust #2

The code for making this all work has become so messy and convoluted, but we have the results we’re looking for so I consider that a win!