Why building a Goodreads alternative is not so easy…

This post may get a little technical in some places, but you might learn something about the world of book data!

If you’ve ever been involved in a discussion about how subjectively bad some people think Goodreads is, whatever the reasons are, maybe because it’s slow, has bad recommendations, doesn’t get active development, or just simply because it’s Amazon owned – you may have seen people saying things such as “why does no-one just make a Goodreads alternative, it can’t be that hard”. Is there any truth to this? The answer, on the face of it at least, is that yes it is true. For someone who has a bit of web programming time under their belt, creating a site with the same basic functionalities as Goodreads is not a huge feat. So why do we not see Goodreads alternatives springing up all over the place and creating healthy competition? Mostly, the answer is book data.

The biggest problem any new site has is simply where to obtain their data, a book site with no books also gets no users and will die. So let’s take a look at some places a new site could get book data and why those sources may or may not work.

Free Book Data Sources

In case you are not familiar with the term “API”, it simply means that it’s a method of access allowing programmers to easily extract data from it. So rather than the data being on a page that is nicely designed and readable for humans, the data is provided in a format that is easy for programming languages to make use of.

Goodreads API

The Goodreads API is a service Goodreads provides that can allow you to get book data information from their site and is one of the richest sources of information for books. So why not just use this? Well, let’s take a look at some of the terms you have to agree to in order to use it (terms are numbered as they were pulled from the agreement):

1. Not request any method more than once a second. Goodreads tracks all requests made by developers.

2. Clearly display the Goodreads name or logo on any location where Goodreads data appears

3. Link back to the page on Goodreads where the data data appears.

4. Not use the API to harvest or index Goodreads data without our explicit written consent.

5. You may store information obtained from the Goodreads API for up to 24 hours. Goodreads needs the ability to modify, remove, and update the order of our data, which caching would prevent.

9. Not use the Goodreads data as part of a commercial product without our explicit written consent.

https://www.goodreads.com/api/terms

So you can see here that in order to use their API not only do you have to provide promiment linkbacks to Goodreads (including directly to the book page if you used book data), but these aren’t the biggest problems here. The biggest issue immediately for any programmer is term 4. When the terms refer to “indexing” the data, what they mean is doing something like making it searchable – which obviously any Goodreads competitor is going to need.

Terms 1 and 5 when combined together are also particularly problematic. You are only allowed to make one request per second, and you are also not allowed to store data you’ve retrieved for more than 24 hours. Running the numbers on this is pretty easy, there are 86,400 seconds in a day (that’s also the number of API calls you are allowed), that means if you have 100,000 books in your database that came from the Goodreads API, you’re going to have to delete 13,600 of them within 24 hours because you don’t have enough API calls to go back and refresh the data.

Term 9 states it cannot be used in a commerical product. What exactly does that mean? It’s a deliberately grey definition about what makes a commercial product, but probably anything where you expect to make money – maybe even just displaying affiliate links.

Finally, sites using this API are still relying on an Amazon-owned company to provide their data. If your goal is to move to a site not affiliated with Amazon, does this fulfil that? That’s something only you can answer.

Amazon Product Advertising API

This API is, as the name implies, provided by Amazon to access all of their products. This is probably the most complete product collection in the world.

The problem is it’s very restrictive on what you can and can’t do. Some sites seem to ignore some of the terms of the API agreement. Since the API agreement is much longer and more verbose than Goodreads, I won’t include the whole thing, but some paraphrasing.

Here are some of the restrictions:

  • You may not use the API to make a create “any site or application designed or intended for use with a mobile phone or other handheld device”. The terms go on to clarify that this means essentially any site that looks good on mobile, rather than mobile only sites. This pretty much covers any site any developer is going to make these days, and makes this API totally unusable while abiding by the terms.
  • You cannot use the data for any aggregating or analysis, this generally means you won’t be able to use the data to do things like book recommendations, because you are not allowed to analyze the data.
  • You may not store any data from the API for more than 24 hours, this presents the same problem as was noted above with Goodreads, you can’t hold onto this product data to do anything useful with it.
  • You are only allowed to provide Amazon links for purchases, no other affiliate links or store links are allowed.

Just like Goodreads, a site using this API is still relying on Amazon for its entire operation.

Before being purchased by Amazon, Goodreads themselves actually stopped using the Amazon API due to its restrictive terms.

Google Books API

Google Books API is actually one of the more reasonable options in terms of what you can do with the data. Again, their terms are quite long, but here are some paraphrased restrictions:

  • You can only store the data for the amount of time they say you can store it for in the response from the API. This can be variable.
  • You must provide prominent link backs to Google Books on the pages where the data is used, including an image linkback.

The biggest issue with the Google Books API is that by default you only get 1000 requests per day (this would be equal to only being able to add 1000 books per day to your site). Once you factor in the fact that you may need to “refresh” this data daily as with other APIs, you are back to square one in that you will not have enough API requests to both refresh the data and maintain basic functionality for your site. You can ask to increase the requests but as your site grows you will likely need millions of requests per day.

There’s at least one alternative site out there right now that appears to be using the Google Books API without the required linkbacks.

Open Library

Open Library is an amazing project which seeks to index and catalogue as much book data as is possible. You can also download their entire database in a useful format from their website.

There are two main issues with this data, the first is that because it’s user provided, it can sometimes be a little quirky. For example you might get a cat picture that a user has uploaded as a book cover instead of the actual book cover! Secondly, the data is not frequently updated with “new releases” of books which is generally a thing visitors to a Goodreads alternative are going to want to see.

The data can be good as a starting point or to supplement other data, but on its own it’s probably not complete enough to sustain a site like this.

Data Scraping

Data scraping is where, instead of using an API with all of its troublesome terms, you simply programmatically request the actual page from the website, and use tools to extract the data you want directly from it.

This may sound pretty perfect because then I can use Goodreads data and I don’t have to abide by the API terms, right? Well – maybe. Setting aside the technical problems of scraping, in some countries it’s not legal to data scrape when your goal is to create something that is a direct competitor with the thing you scraped from. It’s also not difficult for the site being scraped to block your access which effectively disables your method of getting book data.

There is also the question of whether getting your data from Goodreads like this actually removes you from relying on Amazon, which seems to be a big trigger point for people wanting to move away from Goodreads right now. Again, whether using Amazon sources for your data counts as moving away from Amazon is your call as the person using the site.

One of the most popular Goodreads competitors right now appears to be using this method and lifting their data directly from Goodreads.

Why not just use the data anyway?

Some of you may have read the above terms and thought, well why not just use the data anyway? It’s only Amazon or Google getting “hurt” by this right? They are big and they can take the hit.

That may be true, but it’s important to think about the damage they can do to you, not the damage you are doing to them. If your goal is to create a genuine alternative to Goodreads, then you need to look to the future and assume you are successful. What do you think will happen then? Amazon and Goodreads may well overlook breaches to their API terms (like storing/indexing data, using it commercially, not providing linkbacks, etc.) when you are a small site, but what about when you actually get on their radar? If you have been using their own APIs in breach of their terms then they may suddenly start enforcing them or cut off your access entirely. What if they have their legal team, which probably has more lawyers than you have people working on your entire site, send you a cease and desist and a demand that you also delete all data you pulled from the APIs? Your site is destroyed overnight.

In fact there has already been at least one case of exactly this happening. FictFact unknowingly violated Amazon’s API terms in a small way, and one day Amazon decided to enforce the API terms and pulled the plug on them, killing their entire project and making it impossible for them to continue operating.

Then there’s the legally grey option of data scraping. You sort of have the same problem, it doesn’t really matter if it’s definitely legal or not, if there’s any argument to be made then a huge international corporation will simply tie you up in legal paperwork until you are drained of funds and can no longer operate.

These may seem like very pessimistic views of what could happen, but they are definitely possibilities, and in my view it is simply irresponsible to use any of these options (while breaching their terms) to create your alternative site.

So what can I use?

The answer here is commercial book data feeds. These are the book data feeds provided to libraries and book stores, and can be used to make a relatively complete data set. They can also be used for indexing and do not generally have any restrictive terms like the APIs.

So why doesn’t everyone use these? The answer is in the word “commercial”, these feeds are not free. At Bookhype we personally put our own money up to pay for contracts for these data feeds, in the hope that our site will succeed, and they cost in the high 4 figures to low 5 figures USD per year. That’s right, usually around $10,000/year is required just to purchase the necessary book data. Now perhaps you can see why Goodreads competitors with rich book data do not pop up overnight.

Amazon’s Monopoly & Compromise

If you want to switch away from Goodreads then perhaps try to have everything I’ve said above in mind when you do so.

Amazon has a total monopoly on all self published Kindle books and also Audible Audiobooks. The only way to get data that these books exist is either from Goodreads or from the Amazon Product Advertising API. Amazon holds onto this data and they don’t provide it even to commercial feeds. Of course, Goodreads gets this data directly from their parent company, Amazon, giving them a massive advantage for tracking books only available on Kindle.

Bookhype has over 18 million works, and over 28 million editions, but and as explained above, we do not get any Kindle or Audible editions in our data feeds. We genuinely try to not rely on Amazon for any of our data, and will not use the Product Advertising API to get any of our data. This might mean that when you search for a book that’s Kindle only on our site, you can’t find that edition if it hasn’t been manually added, because Amazon is holding all the cards here. This means that when you are considering switching to a Goodreads alternative, you might have to compromise on the fact that the data won’t quite be as feature complete as Amazon’s is.

If another competitor site has all of the Kindle or Audible specific editions you search for, then ask yourself – how do they have those editions available? Where are they getting their data? Does this site fulfil the reasons I want to switch away from Goodreads? Does the data behind this site come from a sustainable and ethical source? Perhaps now you have your answers.

One comment