Reading 400k+ key/values from Redis fast

At Wehkamp we use Redis a lot. It is fast, available and implemented as a managed AWS service called ElastiCache. Sometimes we need to extract data from Redis, and usually I use the redis-cli to interact from the command-line. But what if you need to get the values of 400k+ keys? What would you do? Is there an effective way to query multiple key/values from Redis?

Intro
Only redis-cli+bash is sloooooow
Python-scripting to the rescue
Introducing: redis-mass-get cli
Conclusion
Improvements
Comments

Only redis-cli+bash is sloooooow

When you use the red-cli and bash, your script might look a bit like this:

URL=redis://my.redis.url
echo "KEYS product:*" | \
redis-cli -u $URL | \
sed 's/^/GET /' | \
redis-cli -u $URL >> test.txt

For small queries this works like a charm. You don't have the keys for each value that is written to the file, which - depending on your use case - might be okay. The biggest problem I have with the approach is not that it is slow, but that it does not show me any progress. I have no clue when the process is finished!

Python-scripting to the rescue

Let's write a small Python-script that uses KEYS and MGET to write the key and value to a file while showing the progress.

First, we need to install the Python Redis client:

pip install redis

Our script will do the following:

Get all the keys that match te query.
Partition the list of keys into lists of 10.000 items.
Perform an MGET per 10.000 items.
Zip the keys and values and write it line by line to a file.
Show the progress.

When we turn it into a script, it looks like this:

#!/usr/bin/env python3

import redis

file='result.txt'
url='redis://my.redis.url'
query='product:*'

print('Reading keys... ', end='')
client = redis.StrictRedis.from_url(url, decode_responses=True)
keys = client.keys(query)
print(f'{len(keys):,} keys found.')

def chunks(lst, n):
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

partitions = list(chunks(keys, 10000))

with open(file, 'w', newline='\n', encoding='utf-8') as f:
    for i in range(0, len(partitions)):

        progress = ((i+1)/len(partitions)) * 100
        print(f'\rProcessing values... {progress:.2f}%', end='')

        keys = partitions[i]
        values = client.mget(keys)
        for i in zip(keys, values):
            f.write(i[0])
            f.write('\n')
            f.write(i[1])
            f.write('\n')
        
print('\nDone!')

It shows the following progress:

Reading keys... 1,069,715 keys found.
Processing values... 55.14%

But we can do you one better...

Introducing: redis-mass-get cli

The script above generates two lines per key/value. This might not suit your use-case, especially when the value contains new lines as well. Sometimes a CSV or JSON output is better. Based on this Python script I've created the Python redis-mass-query cli which can be installed with:

pip install redis-mass-get

JSON format

It can even parse the JSON value (-jd) before writing its output to a file, all while showing the progress:

redis-mass-get -d results.json -jd redis://my.redis.url product:*

CSV format

Since working with Spark I see CSV taking flight again. The CLI can also generate a CSV file with a key, value header:

redis-mass-get -d results.csv redis://my.redis.url product:*

Pipeline CLI commands

Sometimes you want to pipe the output to another program. This example shows how to pipe the key/values in the CSV format, ignoring the CSV header (-och):

redis-mass-get -f csv -och redis://my.redis.url product:* | less

When no destination is specified, the data will be written to the stdout.

Conclusion

Querying multiple key/values from Redis is easy using KEYS and MGET. If you need to write key/values to a JSON, TXT or CSV file, just use the Python redis-mass-get CLI. Quick and easy.

Improvements

2020-08-17: added the Pipeline CLI commands section
2020-08-16: Initial article