Where to Host Public Datasets?

April 26, 2021 #california-traffic-data #data-science

An overhead photo of a Department of Energy supercomputer showing the ethernet interconnects.

Cleaning a dataset is tough work. I spent weeks figuring out what all the columns of the California Statewide Integrated Traffic Records System (SWITRS) dataset meant and additional time writing scripts to parse and fix it. I wanted other people to be able to make use of the data without going through the same hassle, so I released the scripts.

It wasn’t enough. People still had to request the data, download it, and then run the scripts. Too much of a hurdle for most people. And worse, California no longer provided some of the oldest data. It was suddenly impossible for other people to reproduce my earlier work!

Luckily, I had saved all of the data. So I decided to host the dataset online to make it easy to start using right away.

But there was a problem: the dataset was so large that finding a site to host it was not easy. In the end I choose two places: Kaggle and Zenodo. In this post I’ll share the lessons I learned, the benefits of each site, and why I think using both is the right thing to do.

My Requirements

I had four requirements:

Free: The service had to be free for me, and free for the end users as well. Requiring someone to pay for access to an open source dataset that I had volunteered to curate is unfair and would certainly drastically reduce the number of people making use of it.
Easy: It had to be easy for me to set up. I wanted a service where I could get started without having to email someone for permission. I also wanted it to be easy for the end user to get the data so that the largest number of people could make use of it.
Discoverable: It was important to me that people could easily find the dataset. There are dozens of sites that host large files, but they aren’t places where people would go look for data. To help people find it I would have to set up a web page pointing to the download, which wasn’t something I wanted to maintain.
Permanent: Finally, there was no point in going through all this work if it was just going to disappear tomorrow. I wanted the data to be available for years and years.

AWS Open Data was one option I considered, but it looked like a lot of work to set up and it was unclear exactly how free it was. Further, getting the data wasn’t easy if you had never worked with S3 before. Ideally, I wanted a service that had a big button that said “Download this data!”

I also considered self-hosting on my Raspberry Pis, but quickly dismissed it. Availability would be terrible, download speeds would be even worse, and it would force me to perform a lot of maintenance to keep it running.

In the end I settled on two services:

Kaggle

Kaggle is a great place to host a dataset. It’s free, easy to use (with the exception that you need an account), and is a well known place to find datasets making it discoverable. The one downside is that Google is infamous for killing services, so the data might not last.

Kaggle allows users to download the file or work with the data directly in Kaggle’s hosted notebooks. The author can even set up a demo notebook to demonstrate how to work with the data. Kaggle will even help you set up a DOI for your data. Mine is: 10.34740/kaggle/dsv/1671261

Kaggle supports really deep data documentation. You can write an introduction for each table and each column. Additionally, Kaggle will automatically generate histograms of each column and some summary statistics.

Kaggle is more than just hosting; it is a community. Other people can share their work, set up challenges using the data, and ask questions in the forum. This community makes it easy for people to find the data and get started working with it.

Zenodo

Zenodo—hosted by CERN—is a much simpler and smaller service than Kaggle. It does not hat a community built up around it. It does not have attached cloud compute. They have almost no users.¹

So why use Zenodo? Simple: Google kills products left and right while CERN knows a little something about keeping websites online. I trust CERN’s stewardship of the dataset. I am far more confident that you will be able to download it from Zenodo in 10 years than from Kaggle.

Zenodo has great support for academic dataset usage. It allows you to use any valid DOI, so I was able to reuse the one from Kaggle, although it will also generate one for you if you wish. It will track citations to your dataset and provides links to the papers. It will even let you export the citation to BibTex or generate a text citation on the website. Zenodo lets you link your identity to your Open Researcher and Contributor ID.

Unlike Kaggle, Zenodo makes downloading easy. You do not need an account, you just go to the page and click the download button. It also shows a MD5 hash of the file so you can verify your download is exactly the same as the file on the server.

Zenodo has a problem in addition to its low usage though: uploading a dataset often fails. I originally tried to upload the uncompressed database but that failed multiple times. After reaching out to their support (who were very responsive and helpful), I compressed the database tried again. The smaller file succeed where the larger one had failed.

Conclusion

I think using both Kaggle and Zenodo is the perfect way to host a public dataset. Kaggle has a great community and lets people quickly discover your dataset and make use of it. The downside is the uncertain longevity and the fact that you need an account to download the dataset. Zenodo perfectly complements Kaggle’s weakness as its backed by CERN, an organization that takes data hosting seriously, and makes it very easy to download the dataset.

As of this post, my dataset has been viewed 161 times on Zenodo and downloaded just 47 times. It has been viewed 47,800 times on Kaggle and downloaded 3017 times. In addition, there have been 24 Kaggle notebooks posted that make use of the data. ↩