Reading 400k+ key/values from Redis fast

At Wehkamp we use Redis a lot. It is fast, available and implemented as a managed AWS service called ElastiCache. Sometimes we need to extract data from Redis, and usually I use the redis-cli to interact from the command-line. But what if you need to get the values of 400k+ keys? What would you do? Is there an effective way to query multiple key/values from Redis?

[outline]

Only redis-cli+bash is sloooooow

When you use the red-cli and bash, your script might look a bit like this:

URL=redis://my.redis.url
echo "KEYS product:*" | \
redis-cli -u $URL | \
sed 's/^/GET /' | \
redis-cli -u $URL >> test.txt

For small queries this works like a charm. You don't have the keys for each value that is written to the file, which - depending on your use case - might be okay. The biggest problem I have with the approach is not that it is slow, but that it does not show me any progress. I have no clue when the process is finished!

Python-scripting to the rescue

Let's write a small Python-script that uses KEYS and MGET to write the key and value to a file while showing the progress.

First, we need to install the Python Redis client:

pip install redis

Our script will do the following:

When we turn it into a script, it looks like this:

#!/usr/bin/env python3

import redis

file='result.txt'
url='redis://my.redis.url'
query='product:*'

print('Reading keys... ', end='')
client = redis.StrictRedis.from_url(url, decode_responses=True)
keys = client.keys(query)
print(f'{len(keys):,} keys found.')

def chunks(lst, n):
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

partitions = list(chunks(keys, 10000))

with open(file, 'w', newline='\n', encoding='utf-8') as f:
    for i in range(0, len(partitions)):

        progress = ((i+1)/len(partitions)) * 100
        print(f'\rProcessing values... {progress:.2f}%', end='')

        keys = partitions[i]
        values = client.mget(keys)
        for i in zip(keys, values):
            f.write(i[0])
            f.write('\n')
            f.write(i[1])
            f.write('\n')

print('\nDone!')

It shows the following progress:

Reading keys... 1,069,715 keys found.
Processing values... 55.14%

But we can do you one better...

Introducing: redis-mass-get cli

The script above generates two lines per key/value. This might not suit your use-case, especially when the value contains new lines as well. Sometimes a CSV or JSON output is better. Based on this Python script I've created the Python redis-mass-query cli which can be installed with:

pip install redis-mass-get

JSON format

It can even parse the JSON value (-jd) before writing its output to a file, all while showing the progress:

redis-mass-get -d results.json -jd redis://my.redis.url product:*

CSV format

Since working with Spark I see CSV taking flight again. The CLI can also generate a CSV file with a key, value header:

redis-mass-get -d results.csv redis://my.redis.url product:*

Pipeline CLI commands

Sometimes you want to pipe the output to another program. This example shows how to pipe the key/values in the CSV format, ignoring the CSV header (-och):

redis-mass-get -f csv -och redis://my.redis.url product:* | less

When no destination is specified, the data will be written to the stdout.

Conclusion

Querying multiple key/values from Redis is easy using KEYS and MGET. If you need to write key/values to a JSON, TXT or CSV file, just use the Python redis-mass-get CLI. Quick and easy.

Improvements

2020-08-17: added the Pipeline CLI commands section 2020-08-16: Initial article