My Life Stats: I Tracked My Habits for a Year, and This Is What I Learned | by Pau Blasco i Roca

I first looked at the individual time series for four variables: Sleep, Studying, Socializing and Mood. I used Microsoft Excel to quickly draw some plots. They represent the daily number of hours spent (blue) and the moving average¹ for five days MA(5) (red) which I considered to be a good measure for my situation. The mood variable was rated from 10 (the greatest!) to 0 (awful!).

Regarding the data contained in the footnote of each plot: the total is the sum of the values of the series, the mean is the arithmetic mean of the series, the STD is the standard deviation and the relative deviation is the STD divided by the mean.

My Life Stats: I Tracked My Habits for a Year, and This Is What I Learned | by Pau Blasco i Roca | Nov, 2023 - image 1qi6vunqRHJMBXsK4BHhLgA on https://aiquantumintelligence.com — Total: 2361h. Mean: 7,1h. STD: 1,1h. Relative deviation: 15.5% (image by author).

All things accounted for, I did well enough with sleep. I had rough days, like everyone else, but I think the trend is pretty stable. In fact, it is one of the least-varying of my study.

These are the hours I dedicated to my academic career. It fluctuates a lot — finding balance between work and studying often means having to cram projects on the weekends — but still, I consider myself satisfied with it.

Regarding this table, all I can say is that I’m surprised. The grand total is greater than I expected, given that I’m an introvert. Of course, hours with my colleagues at college also count. In terms of variability, the STD is really high, which makes sense given the difficulty of having a stablished routine regarding socializing.

This the least variable series — the relative deviation is the lowest among my studied variables. A priori, I’m satisfied with the observed trend. I think it’s positive to keep a fairly stable mood — and even better if it’s a good one.

After looking at the trends for the main variables, I decided to dive deeper and study the potential correlations² between them. Since my goal was being able to mathematically model and predict (or at least explain) “Mood”, correlations were an important metric to consider. From them, I could extract relationships like the following: “the days that I study the most are the ones that I sleep the least”, “I usually study languages and music together”, etc.

Before we do anything else, let’s open up a python file and import some key libraries from series analysis. I normally use aliases for them, as it is a common practice and makes things less verbose in the actual code.

import pandas as pd               #1.4.4
import numpy as np                #1.22.4
import seaborn as sns             #0.12.0
import matplotlib.pyplot as plt   #3.5.2
from pmdarima import arima        #2.0.4

We will make two different studies regarding correlation. We will look into the Person Correlation Coefficient³ (for linear relationships between variables) and the Spearman Correlation Coefficient⁴ (which studies monotonic relationships between variables). We will be using their implementation⁵ in pandas.

Pearson Correlation matrix

The Pearson Correlation Coefficient between two variables X and Y is computed as follows:

where cov is the covariance, sigma X is std(X) and sigma Y is std(Y)

We can quickly calculate a correlation matrix, where every possible pairwise correlation is computed.

#read, select and normalize the data
raw = pd.read_csv("final_stats.csv", sep=";")
numerics = raw.select_dtypes('number')#compute the correlation matrix
corr = numerics.corr(method='pearson')
#generate the heatmap
sns.heatmap(corr, annot=True)
#draw the plot
plt.show()

This is the raw Pearson Correlation matrix obtained from my data.

And these are the significant values⁶ — the ones that are, with a 95% confidence, different from zero. We perform a t-test⁷ with the following formula. For each correlation value rho, we discard it if:

where n is the sample size. We can recycle the code from before and add in this filter.

#constants
N=332 #number of samples
STEST = 2/np.sqrt(N)def significance_pearson(val):
if np.abs(val)<STEST:
return True
return False
#read data
raw = pd.read_csv("final_stats.csv", sep=";")
numerics = raw.select_dtypes('number')
#calculate correlation
corr = numerics.corr(method='pearson')
#prepare masks
mask = corr.copy().applymap(significance_pearson)
mask2 = np.triu(np.ones_like(corr, dtype=bool)) #remove upper triangle
mask_comb = np.logical_or(mask, mask2)
c = sns.heatmap(corr, annot=True, mask=mask_comb)
c.set_xticklabels(c.get_xticklabels(), rotation=-45)
plt.show()

Those that have been discarded could just be noise, and wrongfully represent trends or relationships. In any case, it’s better to assume a true relationship is meaningless than consider meaningful one that isn’t (what we refer to as error type II being favored over error type I). This is especially true in a study with rather subjective measurments.

Spearman’s rank correlation coefficient

The spearman correlation coefficient can be calculated as follows:

As we did before, we can quickly compute the correlation matrix:

#read, select and normalize the data
raw = pd.read_csv("final_stats.csv", sep=";")
numerics = raw.select_dtypes('number')#compute the correlation matrix
corr = numerics.corr(method='spearman') #pay attention to this change!
#generate the heatmap
sns.heatmap(corr, annot=True)
#draw the plot
plt.show()

This is the raw Spearman’s Rank Correlation matrix obtained from my data:

Let’s see what values are actually significant. The formula to check for significance is the following:

Here, we will filter out all t-values higher (in absolute value) than 1.96. Again, the reason they have been discarded is that we are not sure whether they are noise — random chance — or an actual trend. Let’s code it up:

#constants
N=332 #number of samples
TTEST = 1.96def significance_spearman(val):
if val==1:
return True
t = val * np.sqrt((N-2)/(1-val*val))    
if np.abs(t)<1.96:
return True
return False
#read data
raw = pd.read_csv("final_stats.csv", sep=";")
numerics = raw.select_dtypes('number')
#calculate correlation
corr = numerics.corr(method='spearman')
#prepare masks
mask = corr.copy().applymap(significance_spearman)
mask2 = np.triu(np.ones_like(corr, dtype=bool)) #remove upper triangle
mask_comb = np.logical_or(mask, mask2)
#plot the results
c = sns.heatmap(corr, annot=True, mask=mask_comb)
c.set_xticklabels(c.get_xticklabels(), rotation=-45)
plt.show()

These are the significant values.

I believe this chart better explains the apparent relationships between variables, as its criterion is more “natural” (it considers monotonic⁹, and not only linear, functions and relationships). It’s not as impacted by outliers as the other one (a couple of very bad days related to a certain variable won’t impact the overall correlation coefficient).

Still, I will leave both charts for the reader to judge and extract their own conclusions.

Source link

My Life Stats: I Tracked My Habits for a Year, and This Is What I Learned | by Pau Blasco i Roca | Nov, 2023