Drilling for 21st Century Oil: Creating a Data Pipeline in AWS

Data has been termed the oil of the 21st century by many. An interesting topic of research is therefore the ingestion of new information, or to stay with our metaphor; to keep drilling for more oil. In our vocabulary, we term this a data pipeline, wh

Data has been termed the oil of the 21st century by many. An interesting topic of research is therefore the ingestion of new information, or to stay with our metaphor; to keep drilling for more oil. In our vocabulary, we term this a data pipeline, where we move from a source of data and transform it to our likings in a data storage environment of our choice. Ideally as raw as possible to not lose any data we might like later!

Data pipeline

Enhancing your private data with public data could add to the quality and accuracy of your machine learning models. If you would like to add such information in a live environment it is unevitable that you collect, organize and structure such a data flow. Historical weather data could be a strong predictor for ice cream sales. If we therefore aim to make predictions for sales based on weather data, we need an accurate weather forecast.

In this post we will briefly show you how a weather API can be used to forecast 7 days of weather information. We will walk you through the following steps:

  • How to use a specific API to forecast weather data.
  • How to store such requested data in a cloud environment.
  • To schedule such a process to keep a live feed with forecasted data.

Some design choices we made during implementation and which will be essential if you would like to follow along in coding are:
- An AWS account: our prefered cloud solutions provider which will be doing our scheduling/ storage of the data.
- A DarkSky developer account which provides us with a freemium API for weather predictions.
They are based on personal preferences and setting both up should not take long.

Note: in this tutorial everything should keep you in a free tier of the services of these partners. We also have no affiliations with either of these companies!

Weather data

We like DarkSky because they have good API documentation, their free tier supplies us with plenty requests (1000 per day) and provides hourly predictions. Further we can gather meteorological parameters which include essential values we opted to get (visibility, predicted rain, temperature and wind speeds).

Details of weather data

In order to request data through this API, we need an API Secret Key in combination with a location (longitude and latitude) where we wish to make weather predictions. An API key can be requested by creating a developer account (1 min work!) here. The GET request will look as follows:


In this tutorial we’re only predicting the weather in a single location in the center of the Netherlands but this API could easily be used to predict the weather for multiple locations (i.e. on a big city & population base).

In order to schedule such a process, to for example pull in a 7 day forecast every morning, there are various options nowadays. One of them is to use an online scheduler in a cloud solutions provider like AWS. The key benefit of an online scheduler is that they will be responsible for running your code. An empty battery will simply not happen. We like AWS simply because we have most experience with it.

From AWS we will use the following services for this task:

  • AWS Lambda: this will run our code. Request data and write it to storage.
  • AWS S3: which will store our data.
  • AWS Cloudwatch: which will schedule and monitor when the Lambda function should be triggered.

The code for the lambda function is quite brief and can be found below. We set some basic variables like location we want to predict the weather for and our secret API key from the Dark Sky website. A new lambda function can be created by going to services/lambda in the AWS management console. Remember you will need to sign up for an AWS account to do so.

The Lambda handler function contains our main code. It sends a GET request to pull forecasted data and checks if the request has a proper status code (200). If correct it encodes the respons as a string.

Once the Lambda function has loaded the data, we write it to our S3 data storage location (data bucket) and into our desired folder. For us this folder is raw/weather.

import json
from botocore.vendored import requests
import time
import boto3

basic_string = 'https://api.darksky.net/forecast/'
api_key = '{secret_api_key}'

# Information for 'the bilt' the dutch centre of meteorology.
lattitude = '52.11'
longitude = '5.18056'
api_string = 'https://api.darksky.net/forecast/{}/{},{}?extend=hourly&units=si'.format(api_key, lattitude, longitude)

def lambda_handler(event, context):
    response = requests.get(api_string)
    print("status code:", response.status_code)
if response.status_code == 200:
string = response.text
encoded_string = string.encode("utf-8")
cur_timestamp = int(time.time())
bucket_name = "{bucket_name}"
file_name = "{}-7d_hourly_forecast.json".format(cur_timestamp)
folder_name = "raw/weather"
s3_path = "{}/{}".format(folder_name, file_name)
print("writing as: ", s3_path)
s3 = boto3.resource("s3")
s3.Bucket(bucket_name).put_object(Key=s3_path, Body=encoded_string)
print("Bad Request. Check your API key / API request")

In order to make this executable, we require an S3 bucket to be created. This can be done by going to services/s3 and creating a new bucket. Basic settings will suffice.

In addition we require to set permissions for the Lambda function so it can write to S3 and be accessed by CloudWatch. If your Lambda function does not have the correct rights, you will not be able to write to S3.

Let's get in touch