data scientist

Data science is the study of the generalizable extraction of knowledge from data,[1] yet the key word is science.[2] It incorporates varying elements and builds on techniques and theories from many fields, including signal processing, mathematics, probability models,machine learning, statistical learning, computer programming, data engineering, pattern recognition and learning, visualization,uncertainty modeling, data warehousing, and high performance computing with the goal of extracting meaning from data and creating data products. The subject is not restricted to only big data, although the fact that data is scaling up makes big data an important aspect of data science. Another key ingredient that boosted the practice and applicability of data science is the development of machine learning – a branch of artificial intelligence – which is used to uncover patterns from data and develop practical and usable predictive models.

A practitioner of data science is called a data scientist. Data scientists solve complex data problems by employing deep expertise in some scientific discipline. It is generally expected that data scientists be able to work with various elements of mathematics, statistics and computer science, although expertise in these subjects is not required.[3] However, a data scientist is most likely to be an expert in only one or two of these disciplines and proficient in another two or three. Therefore, data science is practiced as a team, where the members of the team have a variety of expertise.

Data scientists use the ability to find and interpret rich data sources, manage large amounts of data despite hardware, software and bandwidth constraints, merge data sources together, ensure consistency of data-sets, create visualizations to aid in understanding data, build mathematical models using the data, present and communicate the data insights/findings to specialists and scientists in their team and if required to a non-expert audience.

Data science techniques affect research in many domains, including the biological sciences, medical informatics, health care, social sciences and the humanities. It heavily influences economics, business and finance. From the business perspective, data science is an integral part of competitive intelligence, a newly emerging field that encompasses a number of activities, such as data mining and data analysis.[4]

Domain Specific Interests

data-scientist

Data Science Venn Diagram

Data science is the practice of deriving valuable insights from data. Data science is emerging to meet the challenges of processing very large data sets i.e. “Big Data” consisting of structured, unstructured or semi-structured data that large enterprises produce. A domain at center stage of data science is the explosion of new data generated from smart devices, web, mobile and social media. Data science requires a versatile skill-set. Many practicing data scientists commonly specialize in specific domains such as the fields of marketing, medical, security, fraud and finance. However, data scientists rely heavily upon elements of statistics, machine learning, optimization, signal processing, text retrieval and natural language processing to analyze data and interpret results.

Criticism

Although use of the term data science has exploded in business environments, many academics and journalists see no distinction between data science and statistics. Writing in Forbes, Gil Press argues that data science is a buzzword without a clear definition and has simply replaced “business analytics” in contexts such as graduate degree programs.[13] In the question-and-answer section of his keynote address at the Joint Statistical Meetings of American Statistical Association, noted applied statistician Nate Silver said, “I think data-scientist is a sexed up term for a statistician….Statistics is a branch of science. Data scientist is slightly redundant in some way and people shouldn’t berate the term statistician.”[14]

Security Data Science

Data science has a long and rich history in security and fraud monitoring reference needed. Security data science is focused on advancing information security through practical applications of exploratory data analysis, statistics, machine learning and data visualization. Although the tools and techniques are no different that those used in data science in any data domain, this group has a micro-focus on reducing risk, identifying fraud or malicious insiders using data science. The information security and fraud prevention industry have been evolving security data science in order to tackle the challenges of managing and gaining insights from huge streams of log data, discover insider threats and prevent fraud. Data science companies like Feedzai[15] use a mix of big data, machine learning, and human intelligence to identify fraudulent payment transactions. Security data science is “data driven, ” meaning that new insights and value comes directly from data.[16]

Clinical Data Science

Data science has always been prominent in the field of clinical trials. Timely insight into clinical data provides answers to medical questions documenting the safety and efficacy of novel and existing therapeutic compounds. With large and complex data, clinical data scientists have been producing statistical analyses of clinical trials for marketing applications since clinical development has been required. In the early 2000s, the clinical data scientist evolved from a role of a consultant to statisticians to a strategic one. Now the clinical data scientist assists in the planning, collection, transformation, analysis and reporting of clinical trial data and communication of their results. These scientists are crucial to the determination of safety and efficacy of novel therapeutic compounds.