Google's Dataset Search Engine

The one-stop-shop for all things data

The Problem

Like the summary suggests, my PI had tasked me with scraping websites for keywords such that we could gather data for projects related to ophthalmology. The endeavor overwhelmed me, so before beginning to type the first line of code, I searched the web for existing projects. I had made the search during my winter break but didn’t find anything. I then began typing a code for a simple search, but continuously ran into errors. At that point, Google’s Dataset Search Engine was in beta, so it didn’t really appear in my radar. I kept on fumbling with my own code for weeks, making no progress.

What makes the search engine so unique?

Unlike a simple query in either Google or PubMed, this new dataset engine is unique in that it parses specifically through raw or semi-processed data. Machine learning models and most experiments rely on these datasets, and they are usually obscured in the supplemental files of most major publications, making them hard to harness. This new engine plunges into the supplementals and presents the results in a concise, user-friendly format. Below is an example of a query for “P53”, a tumor suppressor gene whose mutations are frequently associated with common cancers.


I would describe this new engine as a hub for all things data related. I even recommended that students in my deep learning class rely on this engine when searching for data-sets for building their final project machine learning models. It’s extremely convenient, and has a very pleasant interface. Moreover, it saved me days of work related to building something similar on my own. This resource is very valuable, but does have some marked short-comings, some of which I discuss in the next section.


Now while this is definitely a useful tool, it does have some major drawbacks. Oftentimes, the links that it does reference are not the datasets themselves, but rather original research publications. There’s still an additional step of locating the supplemental data files. Sometimes the engine does isolate the supplemental data from a publication and provide it as an output, but this is the exception rather than the norm. This set-back is of course understandable. The engine was in beta testing for most of 2019 and was not officially released until late January of 2020. Moving forward, I am sure there will continue to be improvements.


At least for the questions I ask in my work and the projects I work on, this search engine will accelerate my progress. Although I don’t think the datasets here will replace clinical datasets collected at major institutions, I do think that mining the existing datasets here will be a start for making conclusions that can be then validated with new data nonexistent in our current knowledge-span. By meshing world of new and existing data, this engine will expand the breadth of insights we as scientists and engineers can make!