Understanding Dataset Formats

Parquet Format

Recommended

Advantages

  • 10-100x smaller file size - Efficient columnar compression
  • Much faster to read - Only load columns you need
  • Preserves data types - No type conversion needed
  • Better for large datasets - Handles millions of rows efficiently
  • Industry standard - Used by Pandas, Spark, Arrow

Example

btcusdt_1d_2020-2024.parquet - ~15 MB for 1,800 candles with 200+ features

JSON Format

Alternative

Advantages

  • Human-readable - Easy to inspect in text editors
  • Universal support - Works in any language
  • Simple structure - No special libraries needed
  • Web-friendly - Native JavaScript support

Disadvantages

  • Much larger file sizes (~150 MB for same dataset)
  • Slower to parse for large datasets
  • Must load entire file into memory

Which Format Should I Use?

Use Parquet if:

  • You're working with Python/Pandas, Spark, or other data science tools
  • You need to analyze large datasets efficiently
  • You want to minimize storage and download time
  • You're building production trading systems

Use JSON if:

  • You need to inspect data manually in a text editor
  • You're using a language without Parquet support
  • You're working with small datasets (< 1,000 rows)
  • You need web browser compatibility