PostgreSQL is and remains one of our most popular and growing storage platforms; other storage technologies come and go but modern Postgres is a solid choice for so many applications. When you spin up your first Aiven PostgreSQL, you'll want to take some time to play with the features ... but there's a problem. Your new shiny database is empty.
Finding and using some open datasets is a great way to fill this gap, and one option is go try the Kaggle platform. It's a place to find open data, advice about data science, and some competitions you can participate in to hone your skills. There's quite a selection of datasets to choose from, but today we'll be using the exoplanets data from the Kepler mission. You'll need a (free) account to log in and download the data. Go ahead and extract the zip file too, we'll be using
cumulative.csv for the example in this post.
Here's the process in brief:
- Get started with Aiven
- Create PostgreSQL service
- Adding CSV data to PostgreSQL
- Dreaming of Exoplanets
Get started with Aiven
If you are not already an Aiven user, you can sign up for an Aiven account to follow the steps in this post - we will wait for you!
We will also be using the Aiven CLI. This tool requires Python 3.6 or later, and can be installed from PyPI:
pip install aiven-client
You will also need to authenticate your Aiven account against the CLI tool. Replace your own details in the command below:
avn user login <email@example.com>
You have everything you need to create an Aiven database in the cloud.
Create PostgreSQL service
A good first step for a new project is to create a project to keep the services in. All it needs is a name:
avn project create exoplanets
Aiven offers many options when creating services but to get us going quickly, we'll use the newest postgres available and the smallest package, called hobbyist. One of the most fun things though is being able to choose any cloud platform you like so take a moment to check the list and copy the
CLOUD_NAME field of your favorite:
avn cloud list
google-europe-west1 for my example, but you can replace that with the cloud you chose. Here is the command to run to create the postgres database:
avn service create -t pg -p hobbyist --cloud google-europe-west1 pg-exoplanets
It takes a few minutes for the node to be ready, but the Aiven CLI has a handy "wait" command that doesn't return until the service is ready to talk to us. This is less critical when we're running the commands by hand as we do here, but it's super useful when your CI system is spinning up the data platforms by itself!
avn service wait
When the command returns, our PostgreSQL cluster is ready to use. Let's create a database to hold the sample data; the command below creates one named "exoplanets":
avn service database-create --dbname exoplanets pg-exoplanets
Now we have our own sad and empty database, let's look at the sample data and get it imported.
Adding CSV data to PostgreSQL
PostgreSQL has built-in support for importing CSV data into an existing table, but we don't have the table structure, just a CSV. Luckily there's a tool for that - ddlgenerator is another Python commandline tool.
Here's how to install the
ddlgenerator tool and then generate the
CREATE TABLE statement from the CSV we downloaded earlier:
pip install ddlgenerator ddlgenerator postgres cumulative.csv > create.sql
Have a look inside the file and you will see that we have the structure we need to explain to PostgreSQL how to hold the data. The
avn service cli command will give us a
psql prompt on the new database:
avn service cli pg-exoplanets
psql we can connect to the database we created, and then run the SQL file to create the table structure:
\c exoplanets \i create.sql
Adding the final piece to the puzzle, and still from the
psql prompt, the next command brings in the CSV data:
\copy cumulative from data/cumulative.csv csv header
Nice work! The
cumulative table should now have some data for you to play with!
Dreaming of Exoplanets
Now you have a database full of measurements of exoplanets taken by the Kepler Space Telescope. If you're not already familiar with the project, the NASA mission page is worth a read. The mission went into a second phase when one of the controls failed, which serves to remind us that engineering systems we can see and touch, or at least ssh into, is much easier gig than operating in space!
You can explore the dataset, which describes observations and compares the Kepler assessment of each exoplanet with its official status in the pre-existing literature. For example, try this to see the false-positives identified by Kepler:
select kepler_name, koi_pdisposition from cumulative where koi_disposition = 'CONFIRMED' and koi_pdisposition = 'FALSE POSITIVE';
You can also connect this data to other tools to use the dataset further. Either grab the connection details from the web console, or use jq with
avn for a one-liner:
avn service get pg-exoplanets --json | jq ".service_uri"
Good cloud experimentation practice suggests that if you've finished with your exoplanets database, you can delete it:
avn service terminate pg-exoplanets
For even more fun and learning, how about one of these resources:
- Kaggle Open Datasets in case you don't fancy exoplanets, there are some excellent alternatives here
- In our documentation you can find instructions for migrating your existing PostgreSQL to Aiven
- More about the Aiven CLI,
- Need more PostgreSQL? Check out our introduction to PostgreSQL post
- Go on to connect to your Aiven for PostgreSQL service
Not using Aiven services yet? Sign up now for your free trial at https://console.aiven.io/signup!