TradeXil

Parquet files are highly efficient but not human-readable. This guide shows you various methods to view and analyze Parquet data directly.

Quick Peek - View First Rows

Quickly inspect the structure and first few rows of your dataset

import pandas as pd

# Read Parquet file
df = pd.read_parquet('btcusdt_1d_2020-2024.parquet')

# Display basic info
print("="*60)
print("DATASET OVERVIEW")
print("="*60)
print(f"Rows: {len(df):,}")
print(f"Columns: {len(df.columns)}")
print(f"Date Range: {df['datetime'].min()} to {df['datetime'].max()}")
print(f"Memory Usage: {df.memory_usage(deep=True).sum() / (1024**2):.2f} MB")

# Show first 5 rows
print("\n" + "="*60)
print("FIRST 5 ROWS")
print("="*60)
print(df.head())

# Show basic statistics
print("\n" + "="*60)
print("PRICE STATISTICS")
print("="*60)
print(df[['open', 'high', 'low', 'close', 'volume']].describe())

Detailed Analysis - Explore All Features

Comprehensive analysis of all 200+ features in the dataset

import pandas as pd

# Read Parquet file
df = pd.read_parquet('btcusdt_1d_2020-2024.parquet')

print("="*80)
print("COMPLETE DATASET ANALYSIS")
print("="*80)

# 1. Column Information
print("\nCOLUMN INFORMATION")
print("-"*80)
print(f"Total Columns: {len(df.columns)}")
print(f"\nColumn Names:")
for i, col in enumerate(df.columns, 1):
    print(f"  {i:3d}. {col}")

# 2. Data Types
print("\nDATA TYPES")
print("-"*80)
print(df.dtypes.value_counts())

# 3. Missing Values Analysis
print("\nMISSING VALUES ANALYSIS")
print("-"*80)
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100
missing_df = pd.DataFrame({
    'Column': missing.index,
    'Missing': missing.values,
    'Percentage': missing_pct.values
})
missing_df = missing_df[missing_df['Missing'] > 0].sort_values('Missing', ascending=False)

if len(missing_df) > 0:
    print(f"Columns with missing values: {len(missing_df)}")
    print("\nTop 10 columns with most nulls:")
    print(missing_df.head(10).to_string(index=False))
    print("\nNote: Nulls in first ~200 rows are normal (lookback period for indicators)")
else:
    print("No missing values found!")

# 4. Sample Data
print("\n" + "="*80)
print("SAMPLE DATA (First 3 rows)")
print("="*80)
print(df.head(3).T)  # Transpose for better readability

print("\nAnalysis complete!")

Filter & Search - View Specific Columns

Load only the columns you need for faster performance

import pandas as pd

# Efficient: Load only specific columns
columns_to_load = [
    'datetime', 'close', 
    'RSI_14', 'MACD_12_26_9_line', 
    'BB_20_2_upper', 'BB_20_2_lower',
    'VWAP', 'volume'
]

df = pd.read_parquet(
    'btcusdt_1d_2020-2024.parquet',
    columns=columns_to_load
)

print(f"Loaded {len(df)} rows with {len(df.columns)} columns")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / (1024**2):.2f} MB")
print("\nData Preview:")
print(df.head(10))

# Filter by date range
print("\nFILTER BY DATE")
df_2024 = df[df['datetime'] >= '2024-01-01']
print(f"Rows in 2024: {len(df_2024)}")

# Filter by condition
print("\nFILTER BY CONDITION")
overbought = df[df['RSI_14'] > 70]
print(f"Overbought periods (RSI > 70): {len(overbought)}")
print(overbought[['datetime', 'close', 'RSI_14']].head())

Save Script to File

Create a Python file to run the viewing script

# WINDOWS:
# 1. Open Notepad or any text editor
# 2. Copy the code from above
# 3. Save as: view_data.py
# 4. Open Command Prompt in same folder
# 5. Run: python view_data.py

# LINUX/MAC:
# 1. Open terminal
# 2. Create file: nano view_data.py
# 3. Paste the code, press Ctrl+X, then Y to save
# 4. Run: python3 view_data.py

# Alternative - Run directly from terminal:
# python -c "import pandas as pd; df=pd.read_parquet('file.parquet'); print(df.head())"

Pro Tips

Missing Values: The first ~200 rows may have nulls for indicators requiring lookback periods (e.g., SMA_200 needs 200 candles). This is normal and expected.
Memory Efficiency: Use columns parameter when reading Parquet to load only what you need.
Compression: Parquet with 'snappy' compression offers the best balance of speed and size.
Data Types: Parquet preserves exact data types (float32, int64, etc.), unlike JSON which may require conversion.
Tools: Use pandas, polars, or duckdb for working with Parquet files in Python.
Windows Users: Use python command. Linux/Mac users: use python3.
File Paths: Windows uses backslashes \ (or forward slashes /), Linux/Mac uses forward slashes /.

Check Python Installation

Verify Python is installed and check version

# WINDOWS (Command Prompt or PowerShell):
python --version
pip --version

# LINUX/MAC (Terminal):
python3 --version
pip3 --version

# If not installed:
# Windows: Download from python.org
# Linux: sudo apt install python3 python3-pip
# Mac: brew install python3