Part IV: Parsing GDELT with Spark/Shark on EC2

This tutorial walks through how to query the GDELT dataset using Spark/Shark.

I highlighted a few of Spark’s advantages earlier in a prior post, but two bear repeating here. First, since Spark holds your dataset in memory across a cluster, queries run up to 100x faster than they would in a more common engine like Hive. Second, since there’s no need to reload data with each iteration, machine learning and maximum likelihood estimation all run much faster in Spark. (I hope to show how to take advantage of this in a future tutorial.)

Before getting started I should probably also note that this tutorial was written for use on Mac OS X, but it should be fairly trivial to replicate on Linux. With a bit of work it should also be adaptable for use with Windows.

Also, although I try not to assume familiarity with the command line, there may be some things I’ve overlooked. If you’re new to working with the shell and run into any problems, let me know.

SET UP

This is an optional first step, but to make the rest of the tutorial easier to follow for those just starting out on the command line, I’m including it here.

To start off, open Terminal in your Applications folder and for each line enter the following at the $ prompt:

mkdir ~/workspace
mkdir ~/workspace/spark
mkdir ~/workspace/scala

If you type ls you should see ‘workspace’ listed. Likewise, if you type ls workspace, you should now see both ‘spark’ and ‘scala’ listed.

In case you’re curious, the logic here is that since Spark and Scala are under active development, a directory structure like this makes it easy to switch between versions as they’re released.

BUILD SPARK LOCALLY

Now that we’ve got a basic workspace set up, let’s download the latest stable version of Spark, and the version of Scala it depends on. At present, that means we need to download Spark 0.8.0 and download Scala 2.9.3.

Once you’ve got each downloaded, unpack them to their appropriate directories:

tar -xf ~/Downloads/spark-0.8.0-incubating.tgz -C ~/workspace/spark
tar -xf ~/Downloads/scala-2.9.3.tgz -C ~/workspace/scala

Now we need to tell Spark how to find Scala. You can use any text editor for this, but for the sake of consistency, I’m going to use vim here. (We’ll also use vim later on.)

To edit the file, run the following:

cd ~/workspace/spark/spark-0.8.0-incubating/conf/
vim spark-env.sh.template

The first line changes to the shell to the appropriate directory. The second opens the file in vim. Type G to move to the end of the file. Then type i so that you can insert text.

Then enter:

SCALA_HOME=~/workspace/scala/scala-2.9.3/

Now, hit esc and then :wq. That will save the file and return you to your shell session.

Once you’ve edited the template file, you need to rename it as just ‘spark-env.sh’. You can do this from the command line as follows:

mv spark-env.sh.template spark-env.sh

Finally, since the build process for Spark requires Git, you’ll also need to install Git if you haven’t already. You can install it very easily here.

At this point you should be ready to build Spark. To do that, run:

~/workspace/spark/spark-0.8.0-incubating/sbt/sbt assembly

You should see a bunch of lines start whizzing by. Basically, the command is downloading a slew of libraries and dependencies and then compiling everything for you. Depending on your internet connection, it could take a little while, so you may want to go grab a cup of coffee.

Once the build process is done, you should be able to run Spark locally. Run:

cd ~/workspace/spark/spark-0.8.0-incubating
./spark-shell

If a Scala interpreter opens, you’ve got everything running. Enter exit; to close the interpreter.

Alternately, if you prefer Python to Scala, you can run:

./pyspark

For more on using PySpark (especially how to use it with any of Python’s scientific computing libraries), see the official docs here.

CONFIGURING SPARK AND AWS

It’s nice to have Spark running locally, but to fully make use of it we want to get it going on a computer cluster.

There are several ways to do this, but using Amazon’s Web Services is probably the easiest. To set up a cluster with Amazon, you’ll need to do three things:

  1. Create an AWS account.
  2. Get your security credentials.
  3. Generate a keypair.

Each of those steps is fairly straightforward. Although the first step takes the longest, if you want you can speed things up by linking your AWS account to a pre-existing Amazon account.

At the end of steps 2 and 3, your browser should download a .csv file and a .pem file, respectively.

Once the .pem file has downloaded, return to Terminal and run the following:

mkdir ~/.aws
mv ~/Downloads/YOURKEYNAME.pem ~/.aws/ 
chmod 400 ~/.aws/YOURKEYNAME.pem

The first command creates a ‘hidden’ folder to hold your AWS key, and the last sets permissions for how it’s to be accessed. Don’t skip this step; later on you won’t be able to launch an EC2 instance unless your key has the proper permissions.

Once you’ve set permissions, you then need to tell your shell how to access each file. The variables you create here are what your local install of Spark uses under the hood to talk to Amazon’s servers. To do this, in your Terminal session, enter the following:

export AWS_KEY_FILE='/Users/YOURUSERNAME/.aws/YOURKEYNAME.pem'
export AWS_KEY_PAIR='YOURKEYNAME'

Note that in the second command, the ‘.pem’ should NOT appear in the key name. You just enter the name itself.

Finally, open the .csv file that you downloaded in step 2 above in a spreadsheet or text editor. Go back to Terminal in a separate window, and based on the information in the spreadsheet, enter the appropriate information as follows:

export AWS_ACCESS_KEY_ID='YOURACCESSKEYID'
export AWS_SECRET_ACCESS_KEY='YOURSECRETACCESSKEY'

Again, those are variables that Spark will use later on to talk with your Amazon account. Note that if you’ll be running Spark or EC2 frequently, you’ll probably want to store all the AWS security variables more permanently, possibly in a .bash_profile or .bashrc file. That way you won’t have to enter them each time you want to spin up a cluster.

Finally, one option I would HIGHLY recommend before going any further is setting alerts on your AWS account. If you start a cluster and forget to stop or terminate it, you could be in for a nasty surprise at the end of the month. By setting an alert, Amazon will email you any time your monthly bill increases above a specified threshold. To set an alert, see here.

CONNECTING TO AWS

Now that you’ve defined your AWS credentials for your environment, you can finally try to get a cluster up and running.

Before you begin though, you’ll want to know what kind of EC2 instances you want in your cluster. The available types, and their prices, are listed here. Since Spark runs in memory, what you’ll want to pay attention to is the memory per instance; make sure the total memory in your cluster will be a sufficient multiple of the data you’ll be working with.

For this tutorial, to keep the costs down, we’re only going to analyze a few GB of data. So we’ll only use a small instance (it’ll cost less than $1/hr). To launch your cluster, run:

cd ~/workspace/spark/spark-0.8.0-incubating/ec2/
./spark-ec2 -k $AWS_KEY_PAIR -i $AWS_KEY_FILE -s 1 --instance-type m1.large launch YOURINSTANCENAME

You should see a bunch of status updates scroll by over a couple minutes. At the end, if everything works, you should see output like this:

Ganglia started at http://ec2-XXX-XX-XXX-XXX.compute-1.amazonaws.com:5080/ganglia
Done!

There are a few cases in which the launch script doesn’t work, mostly notably when there aren’t any instances available at a given moment. In that case, you’ll see some Boto messages such as “The requested Availability Zone is currently constrained … " If that happens, either specify a separate availability zone, or just wait a few seconds. I’ve never had to wait longer than a minute or so for one to become available.

Once you’ve got a cluster up and running, you’ll want to login to the master. To do that, run:

./spark-ec2 -k $AWS_KEY_PAIR -i $AWS_KEY_FILE login YOURINSTANCENAME

Alternately, you can also run the following manually, using the same URL that’s provided at the very end of output when Spark finishes launching the cluster:

ssh -i $AWS_KEY_FILE root@ec2-XXX-XX-XXX-XXX.compute-1.amazonaws.com

Either way, that should put you through to your EC2 instance.

LOAD YOUR DATA

At this point we’ve got everything we need, except we don’t yet have data. There are a lot of different ways to handle this, but for this tutorial we’re going to store some data in S3. To do that, set up an S3 bucket. You can name it whatever you want. Then also create one subdirectory entitled ‘tutorial_data’ and another ‘tutorial_output’. You can read how to create an S3 bucket and subdirectories here.

Once you’ve set up your S3 bucket and subdirectories, you need to get the data from GDELT into the S3 bucket.

One way to do that is to download GDELT to your desktop/laptop, and then upload to S3. I don’t recommend that. Unless you work at Amazon it will take forever. Instead, download GDELT to your EC2 instance, and then upload to S3. That way, the upload trip happens entirely within Amazon’s internal network, which is much faster.

First we need to set up a way to send data to S3. Once you’re logged into your EC2 shell, run:

curl https://raw.github.com/timkay/aws/master/aws -o aws
chmod +x aws
export EC2_ACCESS_KEY='YOURACCESSKEYHERE'
export EC2_SECRET_KEY='YOURSECRETACCESSKEYHERE'
perl aws --install

Now that we can talk with S3, let’s download some GDELT data:

curl -O http://chrismeserole.com/code/gdelt_downloader.sh
sh gdelt_download_tutorial.sh

That script downloads the files for 2000 and 2001. If you’d like to download different years, just edit gdelt_downloader.sh. For most years the changes you’d have to make are pretty self-explanatory, though for 2006-present, GDELT releases the files by month, so be sure to account for that. In that case be sure to visit the GDELT downloads page to see what the URL pattern is.

Once the files have all downloaded, upload them to your bucket:

s3put YOURBUCKETNAME/tutorial_data/ 2000.csv
s3put YOURBUCKETNAME/tutorial_data/ 2001.csv

When the uploads are finished, you can check that the files are where they should be by running s3ls YOURBUCKETNAME/tutorial_data or visting the s3 console in your browser.

CONFIGURING SPARK/SHARK AND S3

Now that we’ve got our data, we need to tell Shark how to access it. Hopefully in the future Spark will automatically pass your AWS keys to Shark, but for now we need to use the following work-around:

vim ephemeral-hdfs/conf/core-site.xml

That opens a config file in Vim. Type G, then scroll up a line so you’re just above </configuration>. Press i to switch to edit mode, then paste the following to the file, using your own key and ID:

<property>
  <name>fs.s3.awsAccessKeyId</name>
  <value>XXXXXXXXXXXX</value>
</property>

<property>
  <name>fs.s3.awsSecretAccessKey</name>
  <value>XXXXXXXXXXXX</value>
</property>

<property>
  <name>fs.s3n.awsAccessKeyId</name>
  <value>XXXXXXXXXXXX</value>
</property>

<property>
  <name>fs.s3n.awsSecretAccessKey</name>
  <value>XXXXXXXXXXXX</value>
</property>

Press esc, then :wq. That will save the config file with the changes you made. Don’t worry if the tab-spacing gets distorted.

LAUNCHING SHARK

Now that everything is set up, you can finally run Shark. In your shell, enter:

export MASTER=`cat /root/spark-ec2/cluster-url`
./shark/bin/shark

The first command above specifies where the master node is located. If you leave it out, the ‘slave’ nodes can’t communicate with the master, and the master node will effectively run everything on its own. Be sure to include it.

The second command launches the Shark shell. By default, I *think* it allocates the maximum RAM minus 2gb to each machine, but if need be you can allocate memory manually.

At this point, the Spark interpeter should be open. (Just a tip: Shark is still fairly immature, and will often seem to hang. Sometimes you may need to hit enter to make the shell interactive again.)

Also, if you need to set the number of mapreduce tasks, you can use something like:

set mapred.reduce.tasks=10;

QUERYING SHARK

Now that Shark is working, let’s import our data from S3.

First we need to create a table to house the data. Fortunately the syntax for Shark is effectively identical to Hive, which in turn is very similar to basic SQL syntax. To set up a table for GDELT, just run:

CREATE EXTERNAL TABLE IF NOT EXISTS gdelt (
 GLOBALEVENTID BIGINT,
 SQLDATE INT,
 MonthYear INT,
 Year INT,
 FractionDate DOUBLE,
 Actor1Code STRING,
 Actor1Name STRING,
 Actor1CountryCode STRING,
 Actor1KnownGroupCode STRING,
 Actor1EthnicCode STRING,
 Actor1Religion1Code STRING,
 Actor1Religion2Code STRING,
 Actor1Type1Code STRING,
 Actor1Type2Code STRING,
 Actor1Type3Code STRING,
 Actor2Code STRING,
 Actor2Name STRING,
 Actor2CountryCode STRING,
 Actor2KnownGroupCode STRING,
 Actor2EthnicCode STRING,
 Actor2Religion1Code STRING,
 Actor2Religion2Code STRING,
 Actor2Type1Code STRING,
 Actor2Type2Code STRING,
 Actor2Type3Code STRING,
 IsRootEvent INT,
 EventCode STRING,
 EventBaseCode STRING,
 EventRootCode STRING,
 QuadClass INT,
 GoldsteinScale DOUBLE,
 NumMentions INT,
 NumSources INT,
 NumArticles INT,
 AvgTone DOUBLE,
 Actor1Geo_Type INT,
 Actor1Geo_FullName STRING,
 Actor1Geo_CountryCode INT,
 Actor1Geo_ADM1Code STRING,
 Actor1Geo_Lat FLOAT,
 Actor1Geo_Long FLOAT,
 Actor1Geo_FeatureID INT,
 Actor2Geo_Type INT,
 Actor2Geo_FullName STRING,
 Actor2Geo_CountryCode STRING,
 Actor2Geo_ADM1Code STRING,
 Actor2Geo_Lat FLOAT,
 Actor2Geo_Long FLOAT,
 Actor2Geo_FeatureID INT,
 ActionGeo_Type INT,
 ActionGeo_FullName STRING,
 ActionGeo_CountryCode STRING,
 ActionGeo_ADM1Code STRING,
 ActionGeo_Lat FLOAT,
 ActionGeo_Long FLOAT,
 ActionGeo_FeatureID INT,
 DATEADDED INT,
 SOURCEURL STRING )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3n://cmes-gdelt-tutorial/tutorial_data/' ;

Now we’ve got a full copy of all our data stored in the ephemeral-hdfs. But this is still stored on the hard drive, and as I mentioned above, what we want is for the data to be stored in RAM.

For that, all you have to do is run:

CREATE TABLE gdelt_cached as SELECT * FROM gdelt;

All the magic happens in that ‘_cached’ suffix! Or put differently, it’s the whole point of this tutorial. Any table with that suffix appended will automatically be stored in memory rather than to disk, which makes it much faster to query.

Once Shark finishes building the cache – in this case, it’ll probably take seven or eight minutes – try the following:

SELECT count(*) from gdelt_cached;

That should produce output like this:

shark> SELECT count(*) FROM gdelt_cached;
OK
9536449
Time taken: 1.86 seconds

Not too bad.

EXPORTING SUBSETS

Often times you’ll work with your data and want to export some portion of it. I don’t generally think it’s best practice to subset with Shark (Hive or Impala is a lot more cost effective), but there are times you might need to.

Shark currently doesn’t have an ability to export to s3 directly, so instead let’s grab all the protest events in our data and store them locally on your cluster:

INSERT OVERWRITE LOCAL DIRECTORY '/home/tmp/protestdata' select * from gdelt_cached WHERE EventRootCode=='14';

Now type exit; to quit Shark and return to the EC2 shell. Then run:

cd /home/tmp/protestdata/
s3put YOURBUCKETNAME/tutorial_output/ 000000_0
s3put YOURBUCKETNAME/tutorial_output/ 000001_0

END EC2 SESSION

Once your subset is exported, you’ll definitely want to pause or terminate the EC2 cluster when you’re done, so that you don’t have to pay an exorbitant bill.

If you think you’ll use the cluster again and want your data to persist, type exit in the EC2 shell and then run:

./spark-ec2 -k $AWS_KEY_PAIR -i $AWS_KEY_FILE stop YOURINSTANCENAME 

If you do this, you won’t be charged for EC2, but you will be charged for any data in an attached EBS volume.

If you’re not going to be using this again for a while, you may want to terminate the cluster altogether. In that case, run:

./spark-ec2 -k $AWS_KEY_PAIR -i $AWS_KEY_FILE destroy YOURINSTANCENAME 

And that’s it. If you created a subset and exported to S3, you can download it locally to your machine via the aws command line tools, or just visit the aws web console.

Part III: Big Data in the Cloud

As I mention in the previous post, there are a few ways you can speed up analysis of big data on your desktop or laptop.

However, the fastest/most efficient solution is often to just shift data storage and processing to a networked computer cluster. Generally speaking, this solves the I/O problem by splitting the processing across multiple computers rather than one.

Fortunately there are now several options available for large-scale data storage and analysis, with varying degrees of cost-effectiveness. Most solutions use Hadoop for data storage and then one of the following open-source solutions for analysis:

With the exception of Presto, which was recently released by Facebook and is designed more for petabyte-scale analysis, I profile each of those options below.

Hadoop + Hive

At this point, probably the most straightforward way to analyze your data in the cloud is to use Hadoop to store your data and then Hive to query it.

Fortunately, companies such as Amazon have now abstracted away most of the pain points involved with setting up your own Hadoop cluster, so getting up and running with Hadoop is relatively painless. If you’re looking for a way to try out Hadoop and Hive, I highly recommend this excellent tutorial by John Beieler.

Hadoop + Impala + Parquet

Although Hadoop and Hive are great, for even a 50GB dataset it can still take a few minutes for queries to finish.

However, since exploring data is generally an iterative process, getting the query time down as low as possible makes a huge difference.

At present probably the most cost-effective way to reduce the query time is to use a combination of Impala and Parquet. The former is an improved query engine developed by Cloudera. The latter is a novel storage format developed by both Cloudera and Twitter that stores data in columns rather than rows, and as a result allows queries to be run as vectorized operations. (There’s a lot of other fancy stuff it does too, but the gist is that Parquet is a storage format that makes data retrieval/processing really, really efficient.)

All told, the performance gains are significant: using Impala and Parquet, Travis Pinney was able to get query times down to a few seconds or less.

To try Impala and Parquet yourself, either follow Travis’s instructions on Github here or a more fleshed-out tutorial by Pierre-Yves Taunay on the GDELT blog.

Spark/Shark

Another way to reduce the query time dramatically (for GDELT, also in the single seconds) is to use Hadoop coupled with either Spark or Shark.

Run out of Berkeley’s AMPLab, what Spark/Shark does is load your dataset into memory across an entire cluster. In a way, it’s very similar to Hive — Shark even uses the identical syntax — except instead of having each machine query the portion of data that’s on its hard drive, each machine instead queries the data that’s in its RAM.

The one catch is that because RAM is comparatively expensive, if all you’re interested in are basic counts, subsetting, etc, Spark is not cost-effective compared with Impala and Parquet, much less Hive. It’ll cost up to 10x more for the same performance times.

However, the real value of Spark isn’t just its basic query speeds. Instead, it’s that because the entire dataset resides in memory, you can, for instance, run a k-means algorithm or fit a hierarchical model without constantly shuffling data in and out of memory. Further, since Spark has a Python API, you can use any of Python’s scientific computing libraries to analyze the in-memory datastore as well.

If you want try out Spark/Shark, see the next post or the Shark documentation. For more about the project itself, this Wired article is a good place to start.

Best Practices

At this point, for basic data exploration using Impala and Parquet is probably the most cost-effective way to query or subset large datasets.

However, once you have a small dataset that you want to query quickly or run more sophisticated operations on, it may be worth spinning up a small cluster and switching to Spark/Shark.

Accordingly, the next post is a tutorial for getting Shark up and running with the popular GDELT dataset.

Big Data and Social Science: Introduction

Big datasets are increasingly common in social science today, and understandably generating a lot of excitement.

However, for anyone just beginning to work with such data, the task of merely managing large datasets – let alone analyzing them – can be daunting.

Since I’ve been working a lot lately with relatively big datasets, I thought I’d offer up a series of posts here on how to go about dealing with them. Although I’m not a computer scientist, hopefully anyone just starting out will find the overview useful.

The series runs as follows:

If you have any corrections or suggestions for the series, please be in touch.

Part I: The I/O Problem, Or Why Big Data Takes Forever to Process

As I mentioned in the introduction to this series, large datasets are increasingly common in social science, but they’re also difficult to deal with efficiently.

In the next few posts I’ll look at various solutions to that problem. First though we need to figure out what’s causing it to begin with.

Big Data: What’s the Problem?

So: why exactly does it take so long to parse large datasets, even with modern computers?

The intuitive answer is that the processor must be the problem. Get a newer, snappier processor, and surely the query time would come down.

Yet it turns out the processor isn’t really the problem. A slow processor today is 1GHz, meaning the processor can run through 1 billion cycles per second. For a dataset with “only” 200 million observations, in theory even a slow computer should be able to subset it very quickly.

Yet on a modern Macbook, a single pass through a dataset of that size can take 45 minutes rather than 4 to 5 seconds. So what’s going on?

In technical terms, large datasets face an I/O-bound rather than a CPU-bound. Or in plain English: the bottleneck isn’t in the processor, it’s in getting the data to and from the processor.

For most of your computer’s billion-plus cycles each second, the processor is just twiddling its thumbs while waiting for the data to come in. Too much data has too far to travel (and on too few potential routes) to take advantage of how fast the processor is.

Hard Drives and … Kitchens?

By way of (a very rough) analogy, to understand the problem here think of how you get food in a buffet restaurant. In general there are three places where such a restaurant “stores” food: your plate, the buffet table, and the back kitchen.

Similarly, in a computer there are three places that data is “stored.” The first is a small set of caches on the processor itself. If the data the processor needs is already in the processor’s cache, it can just take that data and immediately use it – just like when the food you want is already on your plate, you can just go ahead and eat it. The only catch is that the speed comes with a cost: you can’t store much data in your processor’s cache, just as you can’t store much food on a single plate.

Meanwhile, the second place the processor searches is “main memory”, or RAM. In this case, data in RAM is analogous to food on a buffet table. Getting to it is still pretty fast, but it’s at least an order of magnitude slower than if the data was already in the processor’s cache. Moreover, this level of storage is also constrained by size; you can fit a lot more data in memory than in the cache, but not as much data as you may need.

Finally, the third place your computer stores data is your hard drive. As with a restaurant kitchen, the advantage of this layer is that you can store a lot of data in it. Once again though, the catch is that it takes a lot more time to access. Think about what happens when you want some salad but there’s none on your plate or the buffet table. In that case, someone has to go back into the kitchen, open the freezer, figure out where the salad is, and then grab it. The end result is that it’s going to take several orders of magnitude longer to get your food. The same is true for your hard drive. Although you can speed things up somewhat by using an SSD, getting your data to and from the hard drive is still going to be orders of magnitude slower than getting the same data to and from the CPU cache or RAM.

GOING FORWARD

To sum up: the reason it takes so long to query big datasets is that shuffling data to and from your processor takes a while, particularly if the data is all the way over on the hard drive.

Obviously there’s a bit more to the problem than that, but the gist is that size of big data compounds whatever I/O constraints your computer may have. Fortunately though there are a few things you can do to at least cut down on the query times, which we’ll begin to look at in the next post.

Part II: Big Data on the Desktop

As the previous post in this series notes, large data takes a while to process not because processors are too slow, but because getting data from the hard drive to the processor takes a while.

Ultimately the best way to deal with such data will probably be to shift your processing to a computer cluster, but if that’s not an option, there are still a few things you can do to cut down query times dramatically.

This post profiles two in particular: adding RAM, and parallelizing your code.

ADDING RAM

As the prior post mentions, there are three places a computer “stores” data: the CPU cache, RAM (also known as memory), and the hard drive. Each of those places holds orders of magnitude more data than the last, but each is also orders of magnitude slower to access – meaning the hard drive can hold way more data than the cache or RAM, but also takes way longer to reach.

One obvious way to speed things up then is to just increase the amount of RAM on your computer. The more RAM you have, the more data you can store in memory as opposed to the hard drive, and the faster you can process it.

Fortunately upgrading RAM is pretty straightforward, with the notable exception of unibody laptops. Most modern computers can also handle between 8GB and 32GB of total RAM, so if your data is big but not that big, buying more RAM is probably the way to go.

That said, there is one caveat to all this: datasets can “blow up” by several multiples when you store them in memory, meaning a 5GB dataset can take up more than 10GB of RAM. Add in the fact that most operating systems now take up significant space as well, and you’ll quickly hit limits on how much data you can actually store in memory on your personal computer.

PARALLELIZE WHERE POSSIBLE

If upgrading RAM isn’t an option, another thing you can do is to ‘parallelize’ your queries, or spread them across all the cores of your processor. In the restaurant analogy I used in the last post, it’s (very loosely) akin to having multiple waiters go and get your food rather than just one. As a result it can dramatically reduce query times on a local computer.

To take advantage of all your computer’s cores, there are two issues: first, you need to be doing operations that can actually be spread across multiple cores; second, you need to be using software that allows for parallelization.

For basic data exploration, parallelizing your operations is usually possible. In common languages like R or Python, it’s just a question of installing the right packages and structuring both your code and data appropriately. By way of example, see John Beieler’s tutorial on how to subset the popular GDELT dataset using Python’s joblib library.

Parallelizing your operations won’t get you down to second-level latency, but the improvements can still be impressive. On my desktop, for example, parallelizing alone cut the time it takes to subset GDELT by half. Meanwhile, both adding RAM and parallelizing the code cut the same operation by over two-thirds.

BEST PRACTICES

If there’s a specific subset of your data that you’re interested in, you’ll probably want to make a slow initial cut through the data, and then explore the subset using Python or R.

By contrast, if you don’t know which subset you want to use or the subset won’t fit in RAM, then it’s going to be tough to get the latency to the point where you can quickly interact with and explore the data on just your local machine alone.

In that case, to explore your data efficiently, you can either draw and explore a random subset of your data, or you can shift the processing to a cluster, which I discuss in the next post.