Currently developing: **Harvest** integration
Currently developing: **Harvest** integration

# The Science Behind Crow's Nest's Burn Rate Types

In the previous post I presented Crow’s Nest’s central idea, that project burn rates aren’t constant and can be divided into certain categories. That certainly sounds plausible, but you don’t have to take my word for it, there’s actual evidence.

### Preprocessing Harvest Time Entries

So I put on my data scientist’s hat and started an investigation. I exported my Harvest data from 2018 till today and fired up Jupyter Lab. Using the pandas library, time entries can be easily be grouped by project:

```
time_entries = pd.read_csv("data/harvest_time_report_from2018-01-01to2020-09-30.csv", decimal=",", index_col=0, parse_dates=True)
only_billable = time_entries[time_entries['Billable?'] == 'Yes']
grouped_by_project = only_billable.groupby("Project")
```

Resampling the data to an equal amount of bins and plotting it lead to the conclusion that in some long-running projects (mostly retainers that I don’t close), this grouping wasn’t as informative as I needed it to be.

```
index = 0
length = 8
for name, group in grouped_by_project:
timeSpan = group.index[-1] - group.index[0] + timedelta(days=1)
rule = math.ceil(timeSpan.days / length)
group.resample(str(rule)+"D", closed='right')['Hours'].sum().plot.bar(ax=axes[math.floor(index/4),index%4])
```

The problem, as you can observe below, is that for long-running projects there are bound to be gaps of sometimes weeks between consecutive workloads:

### Split Projects Into Workloads

So we need to split up our data into singular workloads, a sub-grouping of projects. To do this, I first resampled the data to business days (`'B'`

). Next we `mask`

our data-frame columns that contain no recorded hours, **and** are preceded or followed by another empty row. Thus, those hours entries become `NaN`

, and can later be dropped and serve as split boundaries:

```
hours_each_business_day = filtered_grouped.resample('B').sum()
masked_hours_each_business_day = hours_each_business_day.mask((hours_each_business_day['Hours'] == 0)
& ((hours_each_business_day['Hours'].shift(1) == 0)
| (hours_each_business_day['Hours'].shift(-1) == 0)))
# Hours
# Date
# 2018-07-20 2.51
# 2018-07-23 1.96
# 2018-07-24 3.97
# 2018-07-25 0.00
# 2018-07-26 0.93
# 2018-07-27 1.57
# 2018-07-30 NaN
# 2018-07-31 NaN
# 2018-08-01 NaN
# 2018-08-02 NaN
# 2018-08-03 4.94
# 2018-08-06 3.88
# 2018-08-07 4.88
# 2018-08-08 1.28
```

We can now split at those breaks and drop empty and too short workloads:

```
workloads = []
for name, hours_by_project in masked_hours_each_business_day.groupby('Project'):
breaks = hours_by_project.isnull().all(axis=1)
workloads_by_project = [group.dropna(how='all') for _, group in hours_by_project.groupby(breaks.cumsum())]
# remove empty workloads and too short ones
non_empty_workloads_by_project = [workload for workload in workloads_by_project if not (workload.empty or len(workload.index) < 4)]
workloads += non_empty_workloads_by_project
```

Resampling to equal-width time bins again gives us a hint at how typical workloads are distributed:

### Clustering Workloads

To arrive at a meaningful conclusion, we have to cluster our workloads to look for burn rate types. To do this, I chose the DBSCAN clustering algorithm because it’s relatively insensitive to outliers (i.e., signal noise) and capable of determining the optimal amount of clusters unsupervisedly. Here are some typical curves from the four resulting burn rate categories:

Observe that not only are the four clusters distinctly different, they also display two types of **hill** burn rates (clusters 1 and 2), along with an **oscillating** one (cluster 0), as well as a **valley** type (cluster 3).

Certainly the source data here is biased, since it’s only produced by one person (me), and the data preparation and processing techniques can be further refined. What can be assumed as certain is the fact that we deal with different burn rates types in our day-to-day freelancer work schedule and there’s statistical evidence to corroborate this.