Data Science Introduction

Notes from the John Hopkins Coursera Data Science Specialisation

Types of Data Science Questions

  • Descriptive
    • Describe a set of data
    • Commonly applied to Census data (example describing the population).
    • Its not trying to infer anything
  • Exploratory
    • Find relationships you don’t know about
    • Exploratory analysis alone should not be used for generalising / predicting
    • Correlation does not imply causation
  • Inferential
    • Use a relatively small sample of data to say something about a bigger problem
    • Estimation of both the quality and uncertainty around your estimate
  • Predictive
    • To use the data on some object to predict values for another object
    • If X predicts Y it does not mean that X causes Y
    • More data and simple models tend to work well
    • Prediction is hard - especially when it involves the future
  • Causal
    • To find out what happens to one variable when you make another variable change
    • Generally done through randomised studies
    • Generally an average effect
  • Mechanistic
    • Using the exact changes in variables that leads to changes in other variables for individual objects
    • Incredibly hard to infer

What is Data

“Data are values of qualitative or quantitative variables, belonging to a set of items”

  • Set of items is sometimes called the population.
  • Data is the second most important thing, the question being the most important.
    • With that said, the data may limit the question later.

Experimental Design

  • Know about the analysis plan
  • Formulate your question in advance
  • Confounding
    • Pay attention to other variables which could be causing correlation. Observed correlation does not mean that the variables are always related (correlation does not imply causation)