Frustrations from our first book data import

Last night we did our first significant test import of book data. Until now we’ve been testing smaller data sets between 5,000 and 25,000 book records. Last night we imported about 130,000 records.

I spent this morning analyzing how the data came through and have a few things weighing on my mind.

Editions in different languages

In terms of end result, I want different translations of the same book to be grouped under the same “work”. So if you search for “Lady Smoke” by “Laura Sebastian”, you get one single result. Then if you click through, you can see the English edition and the Spanish edition.

However, the way the data is presented to us means I don’t think we can make this happen automatically, and that frustrates me.

The Spanish edition of Lady Smoke comes in with this data:

  • ISBN13: 9786073184823
  • Title: Dama de humo / Lady Smoke
  • Series: PRINCESA DE CENIZAS
  • Author: Laura Sebastian

The English edition of Lady Smoke comes in with this data:

  • ISBN13: 9781984851918
  • Title: Lady Smoke
  • Series: Ash Princess
  • Author: Laura Sebastian

When we attempt to group editions together, we can do matches on the title name and author, and group them together if they match. But of course, in this case they don’t match. Therefore, they actually get created as two separate works, which then also means if you search for “Lady Smoke” you see both results.

Two search results for "Lady Smoke" -- one for the English edition and a second for the Spanish edition.

This kind of thing can be fixed manually after the fact, but I’m feeling a little disappointed that I can’t think of a way to do it automatically. That means the data will feel a bit messy because I imagine it will take a long time to manually fix records like this.

Tie-ins and graphic novels

The “His Dark Materials” series is a good example of this. The first book in the series has:

  1. An “HBO tie-in” edition.
  2. A graphic novel.
  3. The regular edition.

I immediately knew something was wrong when I saw “14 works”!

I don’t like having the series page polluted like this. When I click through to the series, I just want a simple list of “Book 1, “Book 2”, “Book 3”. I don’t want three entries of book 1!

I think the ideal result here is:

  • The HBO tie-in edition would be linked with the regular edition. (Instead of how it’s shown above, as a separate work.)
  • The graphic novel would perhaps move to its own series record (“His Dark Materials Graphic Novel” or similar).

The struggle is that again, I can’t find a way to do this automatically. The HBO tie-in is a separate work because it doesn’t have the same title as “The Golden Compass” main edition (due to the publishers inserting “(HBO Tie-In Edition)” at the end of the title). Therefore, our matching algorithm doesn’t see them as the same book.

And for the graphic novel, the publishers specifically list that as being in the “His Dark Materials” series, so we’d have to manually move it to a different one.

Those are things we can manually do, but I imagine there will be a lot of little “errors” like this, which means a lot of wonky data. I’m just feeling frustrated that there will be so many manual changes we have to make! I’m torn between releasing the site sooner with “messy” data, or waiting to clean things up and releasing it later on. I’m definitely being a bit of a perfectionist with this data because it’s so important to me to get it right.