Data science cheat sheet


I originally wrote this as preperation for a job interview (that I never needed in the end), so the contents are in a format that makes sense to me. I am going to keep adding to this as I find data-related concepts that I need to understand. This is basically my warning to say that there may be errors within this page. This was converted from markdown using pandoc so it may be terrible formatting wise, but should improve over time


  1. Stats Theory
  2. Probability Theory
  3. Data
  4. Aphorisms and wisdom

1. Statisitcal Theory

Personal statistical philosophy

Look at the data by plotting it and make a decision from that. For classification, use a model that is easy to explain e.g. decision tree. For regression use as simple as model as possible. If you can sample all or close to all of the data, use a neural network to make a model because they can be made so large they are able to model real-world problems - driving for example.

ANOVA - A form of regression. Generally you use this if you have more than two groups. Two independant variables such as exercise and calorie consumption on the outcome of obesity for example. Essentially you’re asking the question, is the mean of any of these groups different from any other. IF all means are equal then the null hypothesis is true. The F-statistic is calculated by Between group variability / Within group variability and if the value is low it means that the variation in the groups is greater than the variation between them. You use F-statistic and compare it to a critical value from an F-table similar to the t-table etc.

General Model assumptions (look up particulars for your model)

Independant and dependent variable
- What you change. Your experimental variable, that you think will have an affect on the dependent variable. - The thing you measure.

Linear Regression - Fit a regression line to your data by minimising the error of the residuals (difference between line and actual data) by a method e.g. MSE. When doing hypothesis testing you’re comparing the “slopes”, which is one of the “coefficients” along with the error, bias, intercept etc. So you compare the slopes to get a t-statistic which can then be used to get a p-value from “one of those tables”.

Logistic Regression - When you want a binary outcome, yes or no etc. Essentially you get a probability from the log function and can then determine if it is one outcome or the other (0 or 1). You don’t have to concept of a residual with log regress, so instead of R-squared you use maximum likelihood

Fit an s-shape logistic function to the data rather than a line. With both linear regression and logisitic regression, if you have more than one predictor variable, the reuslting line is a linear combination of the variables providing they have a non-zero affect, e.g. a x 0.8 + b x 0.15 + c x 0.05, where a,b,c could be height, weight, age etc

Maximum likelihood - Hard to explain this one. But the distribution is important. Essentially you look at all the values, do a calulation and calcualte the likelihood of observing the data given a distribution e.g. normal dist. You move through all the points and record the likelihood of the data given a x distribution and choose the max - which is the point where the data is most likely given a distribution e.g. normal. It is easier to visualise than explain.

p-value - Can think of it as the probability that the null hypothesis is true. Typically you set a value for which you will concede that, given your data and model, the null hypothesis is unlikely and that there is another explanation for your results. My personal view, and from talking to statistitions, is that they are not very usefull. If you plot your data, you can usually see if there are differences in population, and if there aren’t there probably is no difference.

If the null is true: The p-value represent to probability of getting an outcome as, or more, extreme than the data your experiment as produced.

T-statistic - is the ratio of the departure of the estimated value of a parameter from its hypothesized value to its standard error. The T-statistic is used in estimating the population mean from a sampling distribution of sample means of the population std is unknown. The greater the value for T, the greater the evidence against the null hypothesis.

Z-score - Gives you an idea of how far from the mean a data point is. It is (the observed sample - mean of the sample) / std of sample.

2. Probability

Odds - provide a measure of likelihood. For example rolling a six with a normal die is 1:5. It is key to remember it is not a fraction although it can be the same. rolling a 4 or 6 is 2:4 - there are two chances of getting what you want and 4 chances of not getting what you want. Written as x:y is a ratio (odds ratio)

Probability - likelihood of an event occuring. The number of ways of achieving something / the total number of possible outcomes. For a die, the chance, or likelihood, of rolling a six is 1/6.

And or Rule - The probability for getting a six AND then a five is 1/6 * 1/5. The probability of getting a six or a five is 1/6 + 1/6 or 2/6 or 1/3. Generally the formula for or is P(a or b) = P(a) + P(b), but if both are simultaneuosly possible you add P(a or b) = P(a) + P(b) - P(a and b)

Conditional probability - Based around AND OR, but has the added (with or without replacement). With a die you cannot remove the number after you have rolled it but with cards for example, the probability of getting two aces in a row (providing you’re not replacing the cards - keeping them in your hand) is 4/51 * 3/50

Combinations - How many combinations are there if you have three things of three different colours? the answer is 3! (3 factorial) 3 2 1 combinations = 6.

If there are multiple different items, say 6 sweets, 2 chocolate bars, and 3 apples, you multiply them all, so 6 * 2 * 3 = 36 different combinations.

When you’re choosing n combinations from r choices the genneral formula is. n! / (n-r)! * r! where n is the number of choices and r is the number of ways things can be selected for the above example it is.

11! / (11-2)! * 2! = 55. Seems to often be permutations /2

Permutations - If you could only pick combinations of two from the above situation where there are 11 different items. The general formula is n! / (n-r)! where n is the number of combinations and r is number of ways things can be selected from a group of n things. in this case 11! / (11-2)! = 110. Order counts in permutations, that is choosing red and yellow is different from choosing yellow then red. When order of choice doesn’t matter you use combinations.

General theoretical stuff

Monte carlo simulation - is essentially where you repeatedly sample from a population to try and understand the underlying data, bootstrapping is an example.

Programs and software

Python - Very confident with this, probably don’t need to do any prep.

R - Have used this enough to be able to navigate it, but I only really use it if I need to do slightly more obscure statistical tests.

SQL et al - May have to do some research on database structures, but I think they are all very similar. Technology wise there are two paradigms I think. SQL and NoSQL. Microsoft Access is slightly different, may be worth having a little look at this.


Maybe should have something prepared about PyTorch, Keras and Tensorflow etc.

3. Data

Before collecting data


SQL versus (not only)SQL - noSQL was developed for fast scaling, frequent changes (agile) and being simpler. SQL was eveloped in the early days of computing and focuses on reducing data duplication (as storage used to be much more expensive) and have rigid, complex schemas.


Aphorisms and Wisdom

For statisitical models the Anna Karenina principle applies - a model has only one way in which it can be valid (all assumptions met) but there are many ways in which it can be incorrect (each one of the assumptions not met). Basically all good models (families) are good in the same way, whereas as bad models (families) are bad in many different ways.

Counterfactual - How do you factor in to a model what people will do when they see the model? If you have a model of food waste predicition, how do you factor in the

Goodheart’s law - As soon as you make a statistical metric it becomes useless, as people game the system toward that measure.

From Robert Grant’s blog - Usually where there is a clear signal from the noise humans have usually already acted on it.