Overview
- Data warehouse
- Storage plus analytics
- SQL queries
- Serverless
- Multi-regional
- SQL column store
- Terabytes to petabytes
- BigQuery ML, geospatial analysis and BI
- Public datasets available
- Real-time analytics
- Automatic replication, 7 day storage of changes
- Cloud Monitoring—e.g. number of jobs running, bytes scanned during a query, distribution of query times
- Encrypted at rest
- Built in machine learning
- Write ML models directly in BigQuery using SQL
Pricing
- Two options:
- Pay by amount of data queries process
- BigQuery Slots
- Up-front purchase of processing capacity
- Flat-rate pricing
- Useful for CapEx optimization model
Data Ingestion
- Data sources:
- Internal data
- External data
- Multi-cloud data—AWS, Azure
- Public datasets
- Replicated
- Backed up
- Autoscaled
External Data Sources
- Query data stored in other locations, e.g.
- No need to ingest into BigQuery
- Note: inconsistencies could form from saving and processing data separately
- Consider using Dataflow to build streaming pipeline
Query Jobs
Interactive Query Jobs
- Default
- Query run as soon as possible
- Count towards concurrent rate limit and daily limit
Batch Query Jobs
- Queued by BigQuery
- Query started as soon as idle resources available in shared resource pool
- If not started within 24 hours—automatically changed to interactive
- Don’t count towards concurrent rate limit
Authorized Views
- Grant specific users access to subsets of data via authorized views
- Source dataset contains the source data
- Create separate dataset to container authorized view
- Users granted access to authorized view, but not underlying source dataset
- Authorized view granted access to the source dataset
- Apply to columns
- Define access to data when using:
- Column-level access control
- Dynamic data masking, e.g. PII, financial data, customer order history
References