How big is the data used here?

Data size

Short answer:

  • ~1TB of data,
  • ~20 billon rows

is to be used in blog exercises.

The data were resampled, x10 times replicated, and the date-time stamps altered. As the result, the data become big enough and any potential matches with the real data is highly unlikely.

As a cost reduction exercise the ~1TB of raw csv files were compressed into ~250Gb of using gzip Linux utility.

Data Sources

The data used are primarily of Keggle competitions:

Following Keggle policy, the data will not be used for retrieving any personal identifiers