comparing2variables.utf8

Often when handling big dataset and when (desperately) searching for some sense in this big pile of numbers, you will find yourself searching for connections, for variables which seem to change at the same time, in the same or opposite direction… Most of the time you will be looking at big values to try to define a pattern, and later try to see whether this pattern can be applied to the rest of the dataset. Even if this is not really a good method or way to look at a dataset to find trends or connections, you have at least started to check for correlation between two (or more) variables. But how to proceed?

Plotting the data could be a good start. The global shape of the “cloud of data points” might reveal a trend (linear, inversed…).

Here, on the left plot, we see a clear, positive linear correlation between the two variables. The plot in the middle also shows somehow a form of linear correlation, this time negative and possibly “weaker” than in the previous case. The right plot, on the opposite, does not appear to show any form of correlation between the variables.

So, once that a form of correlation has been revealed, what can we do to confirm it?

Three “tools” exist to help you come to a conclusion. All three measure the degree of relationship between two variables:

Pearson’s r product-moment correlation coefficient (parametric test),
Kendall rank correlation coefficient (non-parametric),
Spearman rank correlation coefficient (non-parametric).