Data science cheat sheet

About

I originally wrote this as preperation for a job interview (that I never needed in the end), so the contents are in a format that makes sense to me. I am going to keep adding to this as I find data-related concepts that I need to understand. This is basically my warning to say that there may be errors within this page. This was converted from markdown using pandoc so it may be terrible formatting wise, but should improve over time

Stats Theory
Probability Theory
Data
Aphorisms and wisdom

1. Statisitcal Theory

Personal statistical philosophy

Look at the data by plotting it and make a decision from that. For classification, use a model that is easy to explain e.g. decision tree. For regression use as simple as model as possible. If you can sample all or close to all of the data, use a neural network to make a model because they can be made so large they are able to model real-world problems - driving for example.

ANOVA - A form of regression. Generally you use this if you have more than two groups. Two independant variables such as exercise and calorie consumption on the outcome of obesity for example. Essentially you’re asking the question, is the mean of any of these groups different from any other. IF all means are equal then the null hypothesis is true. The F-statistic is calculated by Between group variability / Within group variability and if the value is low it means that the variation in the groups is greater than the variation between them. You use F-statistic and compare it to a critical value from an F-table similar to the t-table etc.

General Model assumptions (look up particulars for your model)

Distributional assumptions about the residuals of the model. This is partiticularly true in frequentist statistics where each individual point is not that important and should be random, but the whole of the data tends toward the central limit theorem (if you take sufficiently large random samples the dist will tend toward normal distribution.) a.k.a stochastic

Independence: A value from the sample does not affect another. Non-independant examples introduce bias. If two samples are correlated then it means there is something else going on with the data that won’t be captured by your model (unless you know about it). One Example offered by Peter Flom is that of the 2000’s housing crisis where lenders assumed that each person’s risk of defaulting was indpendent of each others, which turned out not to be true.
Normality: This is because of the maths and central limit theorem I think. The mean value is what is used to compare two populations along with the assumptions about the normal distribution (sd and stats tables for example), so if the data do not follow normality then it all falls down.
Constant variance in a linear model - Ff the variance of the residuals is not random then it is likely the regression line does not fit the data correctly, may need to use a model with more paramters. If you continue the model is likely to be quite inaccurate.

Independant and dependent variable
- What you change. Your experimental variable, that you think will have an affect on the dependent variable. - The thing you measure.

Linear Regression - Fit a regression line to your data by minimising the error of the residuals (difference between line and actual data) by a method e.g. MSE. When doing hypothesis testing you’re comparing the “slopes”, which is one of the “coefficients” along with the error, bias, intercept etc. So you compare the slopes to get a t-statistic which can then be used to get a p-value from “one of those tables”.

Logistic Regression - When you want a binary outcome, yes or no etc. Essentially you get a probability from the log function and can then determine if it is one outcome or the other (0 or 1). You don’t have to concept of a residual with log regress, so instead of R-squared you use maximum likelihood

Fit an s-shape logistic function to the data rather than a line. With both linear regression and logisitic regression, if you have more than one predictor variable, the reuslting line is a linear combination of the variables providing they have a non-zero affect, e.g. a x 0.8 + b x 0.15 + c x 0.05, where a,b,c could be height, weight, age etc

Maximum likelihood - Hard to explain this one. But the distribution is important. Essentially you look at all the values, do a calulation and calcualte the likelihood of observing the data given a distribution e.g. normal dist. You move through all the points and record the likelihood of the data given a x distribution and choose the max - which is the point where the data is most likely given a distribution e.g. normal. It is easier to visualise than explain.

p-value - Can think of it as the probability that the null hypothesis is true. Typically you set a value for which you will concede that, given your data and model, the null hypothesis is unlikely and that there is another explanation for your results. My personal view, and from talking to statistitions, is that they are not very usefull. If you plot your data, you can usually see if there are differences in population, and if there aren’t there probably is no difference.

If the null is true: The p-value represent to probability of getting an outcome as, or more, extreme than the data your experiment as produced.

T-statistic - is the ratio of the departure of the estimated value of a parameter from its hypothesized value to its standard error. The T-statistic is used in estimating the population mean from a sampling distribution of sample means of the population std is unknown. The greater the value for T, the greater the evidence against the null hypothesis.

Z-score - Gives you an idea of how far from the mean a data point is. It is (the observed sample - mean of the sample) / std of sample.

2. Probability

Odds - provide a measure of likelihood. For example rolling a six with a normal die is 1:5. It is key to remember it is not a fraction although it can be the same. rolling a 4 or 6 is 2:4 - there are two chances of getting what you want and 4 chances of not getting what you want. Written as x:y is a ratio (odds ratio)

Probability - likelihood of an event occuring. The number of ways of achieving something / the total number of possible outcomes. For a die, the chance, or likelihood, of rolling a six is 1/6.

And or Rule - The probability for getting a six AND then a five is 1/6 * 1/5. The probability of getting a six or a five is 1/6 + 1/6 or 2/6 or 1/3. Generally the formula for or is P(a or b) = P(a) + P(b), but if both are simultaneuosly possible you add P(a or b) = P(a) + P(b) - P(a and b)

Conditional probability - Based around AND OR, but has the added (with or without replacement). With a die you cannot remove the number after you have rolled it but with cards for example, the probability of getting two aces in a row (providing you’re not replacing the cards - keeping them in your hand) is 4/51 * 3/50

Combinations - How many combinations are there if you have three things of three different colours? the answer is 3! (3 factorial) 3 2 1 combinations = 6.

If there are multiple different items, say 6 sweets, 2 chocolate bars, and 3 apples, you multiply them all, so 6 * 2 * 3 = 36 different combinations.

When you’re choosing n combinations from r choices the genneral formula is. n! / (n-r)! * r! where n is the number of choices and r is the number of ways things can be selected for the above example it is.

11! / (11-2)! * 2! = 55. Seems to often be permutations /2

Permutations - If you could only pick combinations of two from the above situation where there are 11 different items. The general formula is n! / (n-r)! where n is the number of combinations and r is number of ways things can be selected from a group of n things. in this case 11! / (11-2)! = 110. Order counts in permutations, that is choosing red and yellow is different from choosing yellow then red. When order of choice doesn’t matter you use combinations.

General theoretical stuff

Monte carlo simulation - is essentially where you repeatedly sample from a population to try and understand the underlying data, bootstrapping is an example.

Programs and software

Python - Very confident with this, probably don’t need to do any prep.

R - Have used this enough to be able to navigate it, but I only really use it if I need to do slightly more obscure statistical tests.

SQL et al - May have to do some research on database structures, but I think they are all very similar. Technology wise there are two paradigms I think. SQL and NoSQL. Microsoft Access is slightly different, may be worth having a little look at this.

Other

Maybe should have something prepared about PyTorch, Keras and Tensorflow etc.

3. Data

Before collecting data

Does it help answer the question?
Will it be pheasible to collect quality data all the way through. ### Once you know what data you want
Make good documentation about how to collect the data
Archive data as if you may need to use it in a legal case one day. (Just a way of saying back stuff up)
Don’t do anything manual if you don’t have to.
Related to the point above, have formatting in excell etc. But Ideally have it automatic before the excell stage.
Have a column for each peice of data
Only collect data that is relevant
don’t put zeroes in for missing values, only when it is a legitimate option.
Minimise copies of data so you don’t confuse data sets, which leads to the next point.
have a good naming convention Descrition_of_data_day_month_year

Databases

SQL versus (not only)SQL - noSQL was developed for fast scaling, frequent changes (agile) and being simpler. SQL was eveloped in the early days of computing and focuses on reducing data duplication (as storage used to be much more expensive) and have rigid, complex schemas.

SQL -

Aphorisms and Wisdom

For statisitical models the Anna Karenina principle applies - a model has only one way in which it can be valid (all assumptions met) but there are many ways in which it can be incorrect (each one of the assumptions not met). Basically all good models (families) are good in the same way, whereas as bad models (families) are bad in many different ways.

Counterfactual - How do you factor in to a model what people will do when they see the model? If you have a model of food waste predicition, how do you factor in the

Goodheart’s law - As soon as you make a statistical metric it becomes useless, as people game the system toward that measure.

From Robert Grant’s blog - Usually where there is a clear signal from the noise humans have usually already acted on it.