Extracting series names from book titles – wins and failures

We’re still working hard on perfecting our code for removing series names from book titles. This definitely requires a lot of trial and error!

Essentially our goal is to have three separate data pieces:

  1. Title of this book only.
  2. Name of the series it’s part of.
  3. This book’s position within that series.

The tricky part is that the series name and/or position are often inserted with the title itself and we have to pull that out.

The formatting is all really inconsistent, which makes this a difficult process to get right.

Here are a few wins I’m pretty proud of:

Successfully removed series names

Original title: The Maze Runner (Maze Runner, Book One)
Our modified title: The Maze Runner
Series name: The Maze Runner
Series position: 1

Original title: Lost Crow Conspiracy (Blood Rose Rebellion, Book 2)
Our modified title: Lost Crow Conspiracy
Series name: Blood Rose Rebellion
Series position: 2

Original title: Unicorn Academy #4: Isabel and Cloud
Our modified title: Isabel and Cloud
Series name: Unicorn Academy
Series position: 4

Original title: Yellow Knight of Oz (Wonderful Oz Book, No 24)
Our modified title: Yellow Knight of Oz
Series name: Wonderful Oz Book
Series position: 24

But sometimes we massively screw up!

Most of the results seem to be coming out pretty good, but we definitely have some screw ups that we need to look into

Original title: La caída de los gigantes (The Century 1) / Fall of Giants (The Century, Book 1)
Our modified title: La caída de los gigantes
Series name: The Century 1) / Fall of Giants (The Century
Series position: 1

We got the modified title and position okay, but that series name is messed up!

Original title: Louis L’Amour’s Lost Treasures: Volume 1
Our modified title: Volume 1
Series name: Louis L’Amour’s Lost Treasures
Series position: 1

This one is tricky. It follows a common convention of (Series name): (title), which is why we parsed it out that way, but obviously “Volume 1” isn’t the title of the book. Additionally, I’m not sure if it’s technically part of a series called “Louis L’Amour’s Lost Treasures” at all.

Original title: Legend: the Graphic Novel
Our modified title: the Graphic Novel
Series name: Legend

Another tricky one, because it is part of the “Legend” series, which is why we picked that out, but it just so happens the book title is the same as the series name, so “Legend” should have stayed in the title in this case.

As you can probably see, this is extremely difficult to get right, and frankly there’s no way it’s going to be perfect. Either we be more conservative with our replacements and end up with series names in titles, or we be more heavy handed and end up with certain words that should not have been removed.

We have some tough decisions to make!