Book data parsing woes

Building Bookhype has given me an interesting look into the world of book metadata for the first time. My goals for Bookhype seemed pretty simple:

Books by the same author should be correctly linked to the same author record, so that when you view that author’s profile you see all their books.
Books in the same series should be linked to the same series record.
Books in a series should have the correct position within the series recorded, so you can determine the series order.
Different editions of the same work should be linked to the same work record, so that if you search for “Six of Crows”, you just see the result of the overall work, but when you click through you can see all the editions.
We have distinct fields for: book title, series name, and position. There shouldn’t be any overlap between them (so the series name should not appear in the book title; it should only be in a separate series name field).

Unfortunately, these fairly simple requirements turned into a big challenge, largely due to inconsistencies in data provided by publishers. Every time we found an inconsistency, we had to try to account for it in our code. For a while, our lives were pretty much importing, testing, noting issues, rewriting our parser to fix those issues, then rinse and repeat!

Here are a few examples that drove me bonkers!

Inconsistently spelled author names

Sarah J. Maas – author of Throne of Glass – had her name presented in two different ways:

Sarah J. Maas
Sarah J Maas

One had a period after the “J” and the other didn’t. It seems like a minor thing, but with that difference the names are no longer an exact match! That’s something we have to account for in our code.

Mary E. Pearson – author of The Remnant Chronicles series, had the exact same problem with her name.

Inconsistently presented series names

This one was a bit trickier than just a missing period or two. It would appear that publishers love to occasionally (but not always) tack “Trilogy” or “Duology” onto the end of series names. That wouldn’t be a problem if they were consistent, but they’re not! A few we ran into:

Divergent Trilogy
Divergent

Mistborn Trilogy
Mistborn, 2 (including the position with the series name, rather than putting it in the separate position field)
Mistborn

We had to work to normalize these names to ensure they’d still be linked as one series, rather than entered as separate ones. This took a lot of trial and error as we identified all these little quirks.

But, I will also say this ended up being useful because if they do give us the word “Duology” then we can automatically determine the planned series length as being two books and store that! 😉 Small wins!

Series names in the book title instead of listed separately

There are separate fields for book title and series name, yet it seems very common to find the series name mashed in with the book title. I wrote about this already when we did our first test imports. But to recap, here were a few examples:

Title: The Book of Dust: The Secret Commonwealth (Book of Dust, Volume 2)
Series: The Book of Dust #2

The series actually appears a total of three times!

As a prefix of the title (“The Book of Dust: …”)
In parenthesis, after the title (“(Book of Dust, Volume 2)”)
Then separately in the series name field.

What a mess!

Here’s another example:

Title: The Maze Runner (Maze Runner, Book One)
Series: The Maze Runner Series

Again, the series name was repeated twice — once unnecessarily in the title itself, and a second time in the actual series field.

This duplication was a bit of a nightmare because we had to very carefully to try pull the series out of the title. But some things we had to be aware of were:

Sometimes the book title is legitimately the same as the series name. The Maze Runner is an example. The title of the book is “The Maze Runner” and that’s also the series name. So we had to make sure we were only removing the series name if it appeared in addition to the actual title.
Sometimes the series name was tacked onto the book title but not in exactly the same format. Again, The Maze Runner is a perfect example. It was included in the title as “Maze Runner”, but the actual series name is “The Maze Runner Series”. It’s not an exact match, which means we have to try to look for stray stop words and words like “Series” to remove those as well.

This is still something we haven’t gotten perfect because of all the different possibilities. But it’s something we worked very hard to try to account for because Bookhype stores book title and series name separately. I hated seeing results appearing on book pages like this:

Search result for "The Maze Runner (Maze Runner, Book One) (The Maze Runner Series, #1)"

It just looks so messy with the series name (and position) being duplicated.

This also brings me to a related point…

Series positions – sometimes numeric, sometimes written out… sometimes roman numerals?

Most of the time the position in the series is inserted as an actual number. For example:

Series Name: Aven Cycle
Series Position: 1

But sometimes it’s written out instead:

Series Name: Aven Cycle
Series Position: One

And sometimes it doesn’t have the position separately and just has it in the name… sometimes with the word “volume” or “book”:

Series Name: Aven Cycle Volume 1
Series Position: (not provided)

Series Name: Aven Cycle Book One
Series Position: (not provided)

And sometimes we get roman numerals!

Series Name: Aven Cycle
Series Position: Vol. I

Ugh! This makes it very hard for us to extract the actual numerical value so we can list books in the correct order.

But probably most unfortunately: sometimes we don’t get a series position at all, even though I know there is one. 🙁 It’s a lot easier for us to extract the position from a wonky format than to insert one where it doesn’t even exist!

Even publisher names!

As just one example, here’s the huge list of different ways HarperCollins chose to present their company name:

HarperCollins
HarperCollins Publishers
HarperCollins Publishers Ltd
HarperCollins Publishers Inc

My data-organized brain hates having HarperCollins books spread across so many records. Surely they should all be under “HarperCollins”?

However, this one I wasn’t totally sure of. I know there’s a legitimate reason for “HarperCollins Publishers Ltd” (sounds like the UK division) versus “HarperCollins Publishers Inc” (US division). But the others seem like duplicates to me.

And sometimes you get a date that doesn’t exist

When we get a date, we also get told what format it’s in. Most commonly for publication dates, this is “YYYYMMDD”. But you can’t rely on that actually being accurate. The most common issue was being told it was “YYYYMMDD”, but then they actually only provide the year. That one’s easy enough to deal with. But the funniest one was being told it was “YYYYMMDD”, then being provided with….

19870200

For easier reading, that’s: 1987-02-00. Um… 00? That’s not a date! That’s when I learned I needed to have very strict validation and sanitization on the dates provided!

We’ve done a lot, but it’s still not perfect

We have a ton of logic in place for trying to normalize this data and group related books together nicely. But despite all our efforts, things on Bookhype are not perfect. Sadly, the more popular a book is, the more of a mess its data is (because popular books tend to have more editions, which then invite more inconsistencies)… and because the book is more popular, you’ll probably notice some of this data mess!

I am sorry if you come across any wrong/messy data. I’d love for Bookhype data to be perfect, but if we waited until then, we’d never have launched! We do have “Report” links on most pages across the site, so if you do find an issue feel free to report it and I’ll get it fixed!