Developing Data Management Strategies in the Age of Cloud Computing

June 30, 2017


Chris Dwan

Owner, Dwan Consulting LLC

Chris Dwan’s experience shepherding the Broad Institute’s transition from an on-premise data solution to a hybrid cloud infrastructure, and designing of the New York Genome Center’s computing, data storage and network infrastructure have given him a front-row seat in managing genomic data. In his current role as an industry consultant, he helps lead clients through cloud transformations, with a focus on the data. Here he provides advice on best practices of developing data management strategies for start-up and maturing life science organizations.

What are some of the pertinent considerations companies need to make when considering whether to use the cloud and cloud-based services?

Dwan: Everybody knows that data is the lifeblood of business, and we want to have this data-science, machine-learning approach to what we do. When you are starting a biotech company, or you have a research lab, an important question is what are the practices that make it feasible to do what you do? I think the framing of on-prem versus cloud, begs the question: “for what?” Any technologist who does not at least consider public clouds in 2017, you have to wonder why. We are in the second decade here of public, exoscale clouds. It’s not tenable for someone to believe it is just emerging. In considering technology choices, you need to consider the purpose for which you are building.

For startups, there is a lot where public cloud providers are a no-brainer. It is not so much in scientific data, but simply for productivity software. Things like Office365 or the G-Suite from Google, you can’t beat them, and they have all of the good properties of clouds: they do things that your in-house staff don’t need to do anymore; the software updates are fully automated and invisible; you license in terms of head count, rather than the opaque version numbers.

So the cloud makes sense for the basic business tools, what about considerations for scientific data?

Dwan: If you are generating data from a laboratory, one of the things you might consider is do you want your laboratory to be idled if your internet connection is down? Because if you have gone absolutely, pure public cloud, then your instrument is in your lab, and you need your internet service provider to be perfect to achieve perfect up time. For most labs, they copy the data off in batch at the end of the day and, if they miss a day, they will do it the next day. So it doesn’t push them away from public cloud right away. But this is the usually the first place where my clients hit a real conversation that is driven by requirements. When you are capturing data off lab instruments and they need to decide where they want that data to be long term, and perhaps not wanting to push that data to a public cloud only to pull it back later.

What kind of cost considerations should newer companies consider?

Dwan: Honestly, a lot of people get distracted by costs, when they first consider potential operational constraints. While costs of data storage are a major thing for companies that have large amounts of information, startups in their first year usually don’t. They are worried more about velocity and the ability to rapidly adjust their pipelines, so pre-optimizing on notional costs is usually a mistake.

At what point in a company’s progress does it come time for them to consider an on-premise versus cloud solution to managing their data, or do they tend to adopt a hybrid approach?

Dwan: I’m glad you mentioned the hybrid approach. Most of the time when people talk about on-premise, it is a strawman, and it is based on the very words. On-premise, I assume, would mean “in the office space.” You have a network closet and you are going to stuff a lot of servers in there. But if a company is deliberate about it when making their choice to capitalize and own the metal that implements your computing infrastructure, you would probably host that at a co-location facility that has 24-hour staff, generators, and all the security features of high-performance networks. So it’s not really on-premise, on-premise. As you go further down that road, the solution that you wind up building looks a lot more cloud-like.

What are the implications of this distinction?

Dwan: We should architect our solutions for a cloud architecture, whether or not we choose to buy the hardware, because then we have a choice. If we architect our information assuming it is always going to be on metal that we own, then we have cut ourselves off from the possibility of using public clouds. So even if we choose to buy metal, we should use it in a cloud-y way.

How would that work?

Dwan: The big one is using object storage, rather than file-based storage, as early in the process as possible. That usually means adopting the S3 standard that most of the cloud providers and storage vendors support. What that gives you is the ability to natively have information available at the URL endpoint rather than as a file system mount. It’s a big lift to make that change, because almost all of the software we have as a legacy assumes that your data is in the file system. But this traps you if there is only one path to get to (the data), or there is only one owner and you can’t do anything around role-based permissioning, or replicating and moving data around. Going to object storage, rather than the file system, especially for the scientific data that you are going to want to keep, and use, and replicate in the future is really important.

How would managing object storage versus a file system differ?

Dwan: It requires additional work up front. When you have a file system, everyone knows you make it at the top level of say a proteomics directory, a sequencing directory or a clinical directory and under those you have year, month, day, or a protocol and from that you can essentially read the organization. When you go to object storage, you need to do that work of creating that data model before you can put any of the data in because you are not given the directory hierarchy to just populate. But the organization that chooses to do this work early in its life cycle is miles ahead of one that doesn’t. You can prevent the siloing at the beginning.