Default banner
Data Science

Geneia's Data Science Sandbox

May 9, 2017
The healthcare industry needs the kind of innovation made possible by a data science sandbox.
Chief Analytics and Technology Officer

Innovation in the American healthcare industry lags just about every other sector. This has been true throughout the more than 20 years I’ve worked as a healthcare data scientist and software architect. Yet, it still surprises me.

After all, every American is touched by the healthcare system throughout their lives, national healthcare expenditures are rapidly approaching 20 percent of gross domestic product (GDP), and Americans’ health is declining as the rates of obesity, diabetes and high blood pressure skyrocket. And on top of these challenges, there is the so-called Silver Tsunami with 10,000 Americans turning 65 years old each and every day.

If any industry sorely needs innovation, it’s healthcare. It pleases me that slowly but surely innovation is coming to healthcare.

Companies like Geneia are leading the way by using:

Just as importantly, we’re accelerating the culture of innovation for our employees and giving them the environment and tools to experiment and create change in healthcare.

Geneia’s data science sandbox – our experimentation and exploration environment that is separated and protected from client-facing servers and products – was designed from the ground up to encourage collaboration among our data scientists and data engineers and support data science workflow. The technical specs for the sandbox include:

  • Is completely agnostic towards programming languages, and supports Java, Julia, Jupyter, Python, RStudio and SAS, meaning our data scientists can work in their favorite tool or one that is best-suited to the problem they’re trying to solve.
  • Has some of the most advanced machine learning libraries in the industry including, but not limited to, H2O, scikit-learn, MXNet, TensorFlow, caret and others.
  • Contains a robust healthcare analytic data model with support for financial, administrative and clinical data sets, and can be augmented with other data models.
  • Is pre-configured with a full set of medical vocabulary and code sets, e.g. LOINC, SNOMED, ICD9/10, Revenue Codes, NDC, First Data Bank and others.
  • Uses the Amazon Redshift data warehouse to allow users to query large amounts of data many, many times faster than a traditional relational database, and therefore test and try to prove hypotheses very quickly.
  • Includes a Spark cluster and its library of machine learning algorithms to support non-structured and semi-structured data as well parallel processing for more complex algorithms, both of which speed up the time to generate insights.

De-Identified Data for One Million Lives

The Geneia sandbox now has a large corpus of de-identified medical, pharmacy, dental, vision, clinical and enrollment data for one million lives. As the result of de-identifying the data, we are now able to form and test hypotheses without the risk of exposing protected health information (PHI). Although this data has been de-identified, we have made sure it still retains its analytical value.

As part of the process of de-identifying our data, we moved it to the Observational Health Data Sciences and Informatics (OHDSI) data mode.  Pronounced “Odyssey,” it’s “a multi-stakeholder, interdisciplinary collaborative to bring out the value of health data through large-scale analytics.” There are about 650 million lives stored in this data model, a model that supports faster development because it is highly normalized with a comprehensive set of medical vocabulary.

The magnitude of the data – the number of lives plus the more than two years of data - is sufficient to support even some of the most complex healthcare analytic tasks.

For example, currently our data scientists are using the sandbox to better understand variations in treatment and identify patterns across patients and providers. From a predictive analytics perspective, we’re also exploring opioid dependency and identifying patients with a high propensity for dependency.  These are just two of many instances where we are using this sandbox to identify areas in healthcare where we can improve the Triple Aim.

Without a doubt, the introduction of the Geneia sandbox is an exciting development for our industry, our clients and our employees. In the words of one of our young data scientists, “I feel like a kid in a candy store. For the first time, we can readily take an idea, create a hypothesis and quickly test it. We’re able to gain more insights much faster about patients than was previously possible, and also more easily collaborate with other data scientists on the Geneia team.”