What is data science? Transforming data into value
Data science definition
Data science is a method for gleaning insights from structured and unstructured data using approaches ranging from statistical analysis to machine learning. For most organizations, data science is employed to transform data into value in the form improved revenue, reduced costs, business agility, improved customer experience, the development of new products, and the like. Data science gives the data collected by an organization a purpose.
Data science vs. analytics
While closely related, data analytics is a component of data science, used to understand what an organization’s data looks like. Data science takes the output of analytics to solve problems.
“Data science is coming to conclusions that drive your data forward,” says Adam Hunt, CTO at RiskIQ. “If you’re not solving a problem with data, if you’re just doing an investigation, that’s just analysis. If you’re actually going to use the outcome to explain something, you’re going from analysis to science. Data science has more to do with the actual problem-solving than looking at, examining, and plotting [data].”
The difference between data analytics and data science is also one of timescale. Data analytics describes the current state of reality, whereas data science uses that data to predict and/or understand the future.
Data science vs. big data
Data science and big data are often viewed in concert, but data science can be used to extract value from data of all sizes, whether structured, unstructured, or semi-structured. Of course, big data is useful to data scientists in many cases, because the more data you have, the more parameters you can include in a given model.
That said, more isn’t always better. As Hunt says, “If you take the stock market and try to fit it to a line, it’s not going to work. But maybe, if you only look at it for a day or two, you can set it to a line.”
The business value of data science
The business value of data science depends on organizational needs. Data science could help an organization build tools to predict hardware failures, allowing the organization to perform maintenance and prevent unplanned downtime. It could help predict what to put on supermarket shelves, or how popular a product will be based on its attributes.
Ted Dunning, CTO for MapR at HPE, says enterprises can get the most value out of data science when data analysts or data scientists are embedded in business teams.
“Almost by definition, a novelty-seeking person, someone who really innovates, is going to find value or leakage of value that is not what people otherwise expected,” Dunning says. “Often they’ll surprise the people in the business. The value wasn’t where people thought it was at first.”
Data science teams
Data science is generally a team discipline. Data scientists are the forward-looking core of most data science teams, but moving from data to analysis, and then transforming that analysis into production value requires a range of skills and roles. For example, data analysts should be on board to investigate the data before presenting it to the team and to maintain data models. Data engineers are necessary to build data pipelines to enrich data sets and make the data available to the rest of the company.
For further insight into building data science teams, see “How to assemble a highly effective analytics team” and “The secrets of highly successful data analytics teams.”
The embedded approach to data science
Some organizations opt to commingle data specialists with other functions. DataOps is an increasingly common approach in which data engineers are embedded in DevOps teams with business line responsibilities. These DataOps teams tend to be cross-functional — cutting across “skill guilds” such as operations, software engineering, architecture, and product management — and can orchestrate data, tools, code, and environments from beginning to end. DataOps teams tend to view analytic pipelines as analogous to manufacturing lines.
According to Michele Goetz, vice president and principal analyst at Forrester, DataOps teams include:
- Data specialists who support the data landscape and development best practices
- Data engineers who provide ad hoc and system support to BI, analytics, and business applications
- Principle data engineers who are developers working on product and customer-facing deliverables
For further insight into DataOps, see “What is DataOps? Collaborative, cross-functional analytics.”
Data science goals and deliverables
The goal of data science is to construct the means for extracting business-focused insights from data. This requires an understanding of how value and information flows in a business, and the ability to use that understanding to identify business opportunities. While that may involve one-off projects, more typically data science teams seek to identify key data assets that can be turned into data pipelines that feed maintainable tools and solutions. Examples include credit card fraud monitoring solutions used by banks, or tools used to optimize the placement of wind turbines in wind farms.
Incrementally, presentations that communicate what the team is up to are also important deliverables. “Making sure they’re communicating out results to the rest of the company is incredibly important,” RiskIQ’s Hunt says. “When a data science team goes dark for too long, it starts to get in a little trouble. Product managers take work for granted unless we’re talking about it all the time, selling it internally.”
Data science processes and methodologies
Production engineering teams work on sprint cycles, with projected timelines. That’s often difficult for data science teams to do, Hunt says, because a lot of time upfront can be spent just determining whether a project is feasible.
“A lot of times, the first week, or even first month, is research — collecting the data, cleaning it,” Hunt says. “Can we even answer the question? Can we do it efficiently? We spend a ton of time doing design and investigation, much more than a standard engineering team would perform.”
For Hunt, data science should follow the scientific method, though he notes that it’s not always the case, or even feasible.
“You’re trying to extract some insight out of data. In order to do that repeatedly and confidently, and to make sure you’re not just blowing smoke, you have to use the scientific method to accurately prove your hypothesis,” Hunt says. “But I don’t think many data scientists actually use any science whatsoever.”
Real science takes time, Hunt says. You spend a little bit of time confirming your hypothesis and then a lot of time trying to disprove yourself.
“With data science, you’re almost always in a for-profit company that doesn’t want to take the time to dive deeply enough into the data to validate these hypotheses,” Hunt says. “A lot of the questions we’re trying to answer are short-lived. In security, for instance, we’re trying to find the threat actor tomorrow, not next year — tomorrow, before he can release his threat to the wild.”
As a result, data science can often mean going with the “good enough” answer rather than the best answer, Hunt says. The danger, though, is results can fall victim to confirmation bias or overfitting.
“If it’s not actually science, meaning you’re using scientific method to confirm a hypothesis, then what you’re doing is just throwing data at some algorithms to confirm your own assumptions.”
Data science tools
Data science teams make use of a wide range of tools, including SQL, Python, R, Java, and a cornucopia of open source projects such as Hive, oozie, and TensorFlow. These tools are used for a variety of data-related tasks, ranging from extracting and cleaning data, to subjecting data to algorithmic analysis via statistical methods or machine learning. Some common tools include:
- Apache Hadoop. Hadoop is used to solve complex computational problems and data-intensive tasks using parallel processing across clusters of nodes.
- SAS. This venerable, proprietary statistical tool is used for data mining, statistical analysis, BI applications, clinical trial analysis, and time series analysis.
- Tableau. Now owned by Salesforce, Tableau is a data visualization tool.
- TensorFlow. Developed by Google and licensed under the Apache License 2.0, TensorFlow is a software library for machine learning often used for training and inference of deep neural networks.
- DataRobot. This automated machine learning platform is used for building, deploying, and maintaining AI.
- BigML. BigML is another machine learning platform. It’s focused on simplifying the building and sharing of datasets and models.
- Knime. Knime is an open source data analytics, reporting, and integration platform.
- Apache Spark. This unified analytics engine is designed for processing large-scale data, with support for data cleansing, transformation, model building, and evaluation.
- RapidMiner. This data science platform is geared to support teams, with support for data prep, machine learning, and predictive model deployment.
- Matplotlib. This open source plotting library for Python offers tools for creating static, animated, and interactive visualizations.
- Excel. Microsoft’s spreadsheet software is perhaps the most extensively used BI tool around. It’s also handy for data scientists, working with smaller datasets.
Data science salaries
Here are some of the most popular job titles related to data science and the average salary for each position, according to data from PayScale:
- Analytics manager: $68K-$127K
- Associate data scientist: $60K-$102K
- Business intelligence analyst: $50K-$95K
- Data analyst: $43K-$85K
- Data architect: $76K-$155K
- Data engineer: $65K-$132K
- Data scientist: $67K-$134K
- Data scientist, IT: $60K-$134K
- Lead data scientist: $98K-$177K
- Research analyst: $41K-$81K
- Research scientist: $50K-$120K
- Senior data scientist: $93K-$160K
- Statistician: $50K-$108K
Data science skills
While the number of data science degree programs are increasing at a rapid clip, they aren’t necessarily what organizations look for when seeking data scientists. Candidates with a statistics background are popular, especially if they can demonstrate they know whether they are looking at real results; have domain knowledge to put results in context; and communication skills that allow them to convey results to business users.
Many organizations look for candidates with PhDs. “I’m biased toward people who have PhDs, but I wouldn’t pass up someone who has a lot of experience,” Hunt says. “What a PhD tells me is you’re capable of doing very deep research on a topic, and you’re able to disseminate that information to others. But having a solid background or personal project is incredibly interesting.”
Hunt says he particularly looks for PhDs in physics, math, computer science, economics, or even social science. He wouldn’t turn his nose up at applicants with degrees in data science or analytics, but he does have reservations. “My personal experience is I find they’re very useful, but they focus too much on the operations of the models and not the mindset,” he says.
Some of the best data scientists or leaders in data science groups have non-traditional backgrounds. HPE’s Dunning says that some of the best he’s worked with include someone who spent six years working as a gardener before going to college, a person with a background in fine arts, another with a French literature degree, and yet another who was a journalism student and very little formal computer training.
“You want to test people in terms of data perception, not knowing formulas,” Dunning says. “You want the ability to look at things and understand them.”
For further information about data scientist skills, see “What is a data scientist? A key data analytics role and a lucrative career,” and “Essential skills and traits of elite data scientists.”
Data science training
Given the current shortage of data science talent, many organizations are building out programs to develop internal data science talent.
Bootcamps are another fast-growing avenue for training workers to take on data science roles. For more details on data science bootcamps, see “15 best data science bootcamps for boosting your career.”
Data science degrees
According to US News and World Report, these are the top graduate degree programs in data science:
- Master of Science in Statistics: Data Science at Stanford University
- Master of Information and Data Science: Berkeley School of Information
- Master of Science in Data Science: Harvard University John A. Paulson School of Engineering and Applied Sciences
- Master of Science in Analytics: University of Chicago Graham School
- Master of Computational Data Science: Carnegie Mellon University
- Master of Science in Data Science: University of Washington
- Master in Interdisciplinary Data Science: Duke University
- Master of Applied Data Science: University of Michigan School of Information