The lengths you go to... tips to improve Discogs searches via the API

12 Sep 2011 Discogs logo

bliss has been using Discogs as an online source of album art for some time. It offers enormous coverage of music releases, both mainstream releases and rarities. It even includes records for magazine mounted freebie CDs and the like!

bliss uses the Discogs API to find album art given the album and artist names tagged inside a digital music collection. The album artist names are constructed into a query that is sent to Discogs, and Discogs replies with the results. But blindly passing in these names does not necessarily mean results are returned or the results are accurate. Indeed, the process of tuning searches for music releases against music databases (bliss also queries MusicBrainz and Wikipedia via DBpedia) is a constant balancing act between accuracy and quantity. Make the query too general and you return lots of matches which aren't necessarily the correct release. Make the query very specific and the quantity of matches fall. Neither are desirable situations for an automated tool like bliss.

Edit (12th June, 2013): I've written an updated version of these tips on the OneMusicAPI blog.
Edit (29th September, 2011): It was pointed out to me that the original tips here were based on the old Discogs API. API v2 has been released, (and here's the documentation). I'm happy to report that all these tips continue to work in the new API. I've updated the URLs to reflect the new API's host, api.discogs.com. Also, one of the new features is a choice of JSON or XML results - I choose XML for reasons of re-using old code. Finally, an API key is no longer needed, so you can now click directly on these sample URLs and see the results Discogs returns.

A further nuance on the way Discogs works is that there's a daily limit to the amount of queries you are permitted from any one computer (well, technically, IP address). If you hit this limit: no more queries. This is important because if you attempt to mitigate low numbers of results from highly specific requests by making lots of highly specific requests, you can soon run out of queries.

Over the past few weeks I've been slowly improving the Discogs search within bliss by adopting various mechanisms. I hope these tips are useful to other users of the Discogs API.

Please note: where I've shown URLs below I have not URL encoded them to make it easier to read what's going on. You'll have to do the URL encoding yourself.

Know the API: learn Lucene search syntax

Under the covers, Discogs uses Lucene, via Solr, to index the Discogs database and provide free text searching. The advantage of this is that search becomes extremely powerful. The downside is that you need to learn Lucene syntax to make use of it.

The use of Lucene isn't explained explicitly in the API docs. Instead, mention is made under the search function to simply insert the query as the 'q' querystring parameter. An example is:

http://api.discogs.com/search?type=all&q=thriller&f=xml

This simply searches for all occurences of 'thriller' in all of Discogs' enormous database. This brings back labels, artists and a little-known album by a chap called Michael Jackson. What if it's this album you wanted to find?

Well, this 'q' can be any query made using the advanced Discogs search syntax which is actually Lucene search syntax. So, try prefixing the 'thriller' text with the field name 'title':

http://api.discogs.com/search?type=releases&q=title%3Athriller&f=xml

This brings back only releases with the word 'thriller' in the title. Now, you might have noticed this in the Discogs search guide:

You can add a field name to your query to restrict the search to certain release fields. format: "field:word". The following fields can be used: catno:, format:, artist:, label:, country:, track:, style:

And yet 'title' isn't there. It appears there are more fields that this works for than are mentioned. For instance:

http://api.discogs.com/search?type=releases&q=title:thriller AND genre:"Funk / Soul"&f=xml

Returns all albums with "thriller" in the title which also have "Funk / Soul" in the genre field. Genre is also not mentioned in the search syntax, yet appears to work.

And now we go mad. Lucene supports fuzzy searching with the '~' (tilde) operator.

http://api.discogs.com/search?type=releases&q=title:thriller~&f=xml

... gives us albums called "Thrillzz" as well as "Thriller". This can be useful if you do not 100% trust the exact accuracy of the album name or artist name you are using to search Discogs.

And now the warning. The fact that Lucene is not mentioned in the Discogs docs may make it a bad idea to use it. The absence of this detail from the docs could be interpreted that Lucene is 'unsupported'. Similarly, too, for the use of extra fields in queries other than those explicitly mentioned. If Discogs change their search server, that may well break your queries. Don't say I didn't warn you.

Check your results - sanity check the tracks

File this one under 'accuracy'. When you're firing queries against Discogs for albums, you may get a lot of results back. Don't blindly accept the first one that's returned, check that the album looks like the one you are asking for. Discogs' database is enormous, and for any one album there can be many, many entries for different releases of the same album. For instance, check out these different releases for Is This It. You'll also notice album art differences between different releases. Any one of these could've been returned by your query, and they could arrive in any order.

There are a few ways of sanity checking the results. One useful way is to check the tracks for a release are the same as what you expect. This appears to be a good way of improving accuracy where you have common album names for compilations (there appear to be many different releases of The Best of Nina Simone, for example, with different art and track listings).

Specify release formats

Discogs is so exhaustive it includes bootlegs, promo releases, cassette releases and more. If you know your audience, maybe they are unlikely to have these. In this case, it is possible to remove such releases as so:

http://api.discogs.com/search?type=releases&q=-format:"promo"+format:"album"+format:"CD"+title:"thriller"&f=xml

This looks for all releases entitled "thriller" which are CD albums and NOT promos. This, again, is Lucene syntax allowing us to specify negative searches.

Try synonyms

Time to increase the number of matches! One I noticed that can be useful is to attempt to try synonyms. I was searching, for example, for "7 Drunken Nights" by The Dubliners. It turns out this is stored in Discogs as "Seven Drunken Nights" , probably correctly. By swapping "7" for "Seven" I got a hit. The point is that sometimes neither your source data, nor Discogs data, can be 100% trusted for such amorphous concepts as 'album titles'. For instance, Sufjan Stevens "Illinoise" album seems to have many different canonical titles.

Of course, results from such a strategy should be sanity tested to make sure they are retaining accuracy.

One concern here is to balance the number of attempts using synonyms with the request limit. It may be advisable to temper the attempts you make if you have a lot of requests queued up.

Split album titles around a token

It's common to experience artifacts within an album title separating what may be titles, subtitles and sometimes artist names. This happens where the provider of the title is not 100% sure on the canonical name of the release, and is quite common in online tagging databases. For instance, I tried to query for one album with the title "After Hours: Northern Soul Masters", where the title had been provided by FreeDB, but this is recorded in Discogs as simply "After Hours".

By choosing common delimiters to split titles around (I chose colons, forward slashes and hyphens) and then querying using the separate parts it can be possible to improve the number of matches. Again, sanity check to make sure the results are accurate.

Mixed caps

It turns out that a multiword query within quote marks, let's say "Ally McBeal", is case sensitive. For instance,

http://api.discogs.com/search?type=releases&q=title:"ally mcbeal"&f=xml

Gives zero results, while:

http://api.discogs.com/search?type=releases&q=title:"ally mcBeal"&f=xml

Gives the lot. Note that http://api.discogs.com/search?type=releases&q=title:"Ally mcbeal"&f=xml also gives no results, which suggests the case sensitivity only applies in the middle of strings, not the start. So, if you see a similar title, try preserving the case in the middle of strings.

And in case you were wondering what an Ally McBeal album was doing in my collection... it's my wife's, right?

I hope these nuggets have helped you improve your Discogs searching!

blog comments powered by Disqus