To calculate load time I initially sent the files to both Amazon S3 and Google Cloud Storage then loaded them into each datastore. All columns are either integers, double precision or floats.This article is a basic comparison on data loading and simple queries between Google BigQuery and Amazon Redshift and its cousin Athena.įor this test we will be loading a CSV/Parquet file which is basically an enlarged version of the STAR Experiment star2002 dataset. I will definitely share a link to their article if they publish one! I added a links section with useful articlesįinally, a few people reached out asking for the dataset to try to load it and benchmark the performance on other databases.Using Redshift admin tables I was able to add the data scanned per query for Redshift (Thanks rockostrich).I converted the CSV format to Parquet and re-tested Athena which did give much better results as expecte (Thanks Rahul Pathak, Alex Casalboni, openasocket, Robert Synnott, the amazon Redshift team with Joe Harris, Jenny Chen, Maor Kleider and the Amazon Athena/EMR team with Abhishek Sinha).This is the first update of the article and I will try to update it further later. 6/22/17 updateĪlmost 3,000 people read the article and I have received a lot of feedback. The benchmark below has been updated with these new tests. With Joe Harris' help (he is a Redshift Database Engineer at AWS), I measured the performance of an optimized schema on both dc1.large and ds2.xlarge Redshift instances. Following Tino Tereshko's advice (he is Big Data Lead at Google Cloud Office of CTO), I added the metrics for BigQuery standard SQL and re-calculated the data loading time (from Google Cloud Storage to BigQuery) following their recent optimizations.
0 Comments
Leave a Reply. |