Hacking Azure Table Storage to do LIKE queries

We use Azure Table Storage as our database for Report URI and it's an incredibly simple yet powerful storage solution. It scales transparently, has amazing performance and is cheap to use too. There was one limitation with querying data in Table Storage that we recently worked around and that's the lack of LIKE queries.


Azure Table Storage

It was way back in June 2015 when I first talked about Choosing and using Azure Table Storage for report-uri.io and whilst we released many updates to Report URI since then, we're still using Azure Table Storage as our underlying storage provider. If you want more information about Table Storage I have a blog on Working with Azure Table Storage - The Basics and also details on how to Build a cloud scale PHP session store with Azure Table Storage and Optimising for performance with Azure Table Storage. Despite everything though, there are some limitations to how you can query Table Storage as it's not your typical relational database, it's a key-value store.


Querying Table Storage

There are a defined set of Comparison Operators that you can use when querying Table Storage and they are:


Equal - eq
GreaterThan - gt
GreaterThanOrEqual - ge
LessThan - lt
LessThanOrEqual - le
NotEqual - ne

So for a quick example when we're quering reports out for Report URI if we wanted all reports sent by Chrome and logged against my blog, I'd use the following filter:


hostname eq 'scotthelme.co.uk' and browser eq 'chrome'

Simple and easy, that'd return all of the reports that I wanted. You simply pass it the filter to apply to your data and it returns any entities (what SQL people call rows) that match that filter.


More powerful queries

Report URI also allows users to query against the path and you can specify an exact path to look for reports on a specific page:


path eq '/how-we-turned-average-joe-into-psychic-joe/'

That would show me all reports sent from the page of the last blog post I published and does the job if you have a fairly simple structure to your site. A lot of people don't have a simple structure to their URLs and one of our enterprise customers needed slightly more powerful queries. Their URLs looked a little like this:


/account/{$userId}

This represented a problem because they couldn't look at the specific URL for every user to see if they had any CSP reports and they were a high volume site in terms of the number of reports so sifting through reports manually wasn't an option. What they really needed to do in the query was this:


/account/*

That doesn't fit exactly into one of the comparison operators listed above in Table Storage so we had to adapt our Filter Factory (that's the class that generates all of our filter strings for us) to produce an appropriate filter string.


path ge '/account/' and path lt '/account0'

This filter now does exactly what we need it to do using the ge and lt operators in tandem. The ge comparison will find anything that is greater than /account/ but that would include a lot of things that we don't want like /whatever and so on. The lt operator ensures that doesn't happen by incrementing the last character in the query, in this case if we "ASCII increment" / we get 0, and using that as a cap for the top end of the filter. Simple! This is the equivalent to doing the following in SQL:


SELECT * FROM reports WHERE path LIKE '/account/%'

We now had exactly what was needed and could release an update to allow the use of wildcards in the path component. Of course, we all know what comes next don't we!


Wildcards in the hostname

Searching on wildcards on the path was great and worked exactly as inteneded so the customer asked if we could also enable the same functionality on the hostname field too. The problem was, it's not so easy for hostnames. Think of how you might want to run a query like that on a hostname:


SELECT * FROM reports WHERE hostname LIKE '%.scotthelme.co.uk'

There is no way that we could see to replicate that functionality using the comparison operators listed above that Table Storage allows us to use. I spent some good time thinking about a way we could do this and ultimately arrived at the conclusion that we couldn't, the wildcard component had to be the rightmost component of the string we were searching with, it couldn't be on the left. Then it hit me, just store the hostname string in reverse! Now we could easily do exactly the query we needed to do:


hostname ge 'ku.oc.emlehttocs.' and hostname lt 'ku.oc.emlehttocs/'

Similar to the path query above we use an "ASCII increment" on the last character of the query string to allow the ge and lt comparison operators to function exactly as we need them and the query works perfectly.


More to come

In the end it turned out to be pretty easy to take Table Storage and use it in a way that at first didn't seem like it was possible. With the tiniest of changes to how we store reports we were able to add some significant features to Report URI and make CSP reporting even more useful with powerful filtering. As we release new features throughout 2018 I will write more about how we use Table Storage and some of the optimisations we've made along the way.