Introduction
Priocept consultants have recently been participating in the Google Cloud Platform beta certification exams. We have been working with Google Cloud Platform for many years – since the original launch of Google App Engine – but the new certification scheme allows us to formalise our consultants’ expertise on the platform.
The “beta” nature of the exams means that our consultants have acted as Google guinea pigs to some degree. Very little study material or practice questions are available at the moment for the certification exams, and you have to rely on prior practical experience and reading the core documentation. So this blog article is intended to give an overview of the content for the Certified Data Engineer exam, as taken by our consultants in January 2017.
The Data Engineer certification covers a wide range of subjects including Google Cloud Platform data storage, analytical, machine learning, and data processing products. Below we have given an overview, product-by-product, of what we were subjected to in the exam.
Cloud Storage and Cloud Datastore
Surprisingly, these products are not covered much in the exam, perhaps because they are covered more extensively in the Cloud Architect exam. Just know the basic concepts of each product and when it is appropriate (or not appropriate) to use each product, and you should be fine.
Cloud SQL
There were surprisingly few questions on this product in the exam. If you have practical experience using the product, you should be fine to answer any questions that do come up. As with questions related to other data storage products, be sure to know in what scenarios it is appropriate to use Cloud SQL and when it would be more appropriate to use Datastore, Bigquery, Bigtable, etc.
Bigtable
This product is covered quite extensively in the exam. You should at least know the basic concepts of the product, such as how to design an appropriate schema, how to define a suitable row key, whether Bigtable supports transactions and ACID operations, and you should also know (at least approximately) what the size limits for Bigtable are (cell and row size, maximum number of tables, etc).
BigQuery
Lots of questions on BigQuery in the Data Engineer exam, as expected. You should know about the basic capabilities of BigQuery and what kind of problem domains it is suitable for. You should also know about BigQuery security and the level at which security can be applied (project and datastore level, but not table or view level). Partitioned tables, table wildcard queries (“backtick” syntax), streaming inserts, query planning and data skew are also covered. You should also have an understanding of the methods available to connect external systems or tools to BigQuery for analytics purposes, how the BigQuery billing model works, and who gets billed when queries cross project and billing account boundaries.
Pub/Sub
The exam contains lots of questions on this product, but all reasonably high level so it’s just important to know the basic concepts (topics, subscriptions, push and pull delivery flows, etc). Most importantly you should know when it is appropriate to introduce Pub/Sub as a messaging layer in an architecture, for a given set of requirements.
Apache Hadoop
Technically not part of Google Cloud Platform, but there are a few questions around this technology in the exam, since it is the underlying technology for Dataproc. Expect some questions on what HDFS, Hive, Pig, Oozie or Sqoop are, but basic knowledge on what each technology is and when to use it should be sufficient.
Cloud Dataflow
Lots of questions on this product, which is not surprising as it is a key area of focus for Google with regard to data processing on Google Cloud Platform. In addition to knowing the basic capabilities of the product, you will also need to understand concepts like windowing types, triggers, PCollections, etc.
Cloud Dataproc
Not many questions on this besides the Hadoop questions mentioned above. Just be sure to understand the differences between Dataproc and Dataflow and when to use one or the other. Dataflow is typically preferred for a new development, whereas Dataproc would be required if migrating existing on-premise Hadoop or Spark infrastructure to Google Cloud Platform without redevelopment effort.
TensorFlow, Machine Learning, Cloud DataLab
The exam contains a significant amount of questions on this – more than we were expecting. Fortunately we have been busy working with TensorFlow and Cloud Datalab at Priocept for a while now. You should understand all the basic concepts of designing and developing a machine learning solution on TensorFlow, including concepts such data correlation analysis in Datalab, and overfitting and how to correct it. Detailed TensorFlow or Cloud ML programming knowledge is not required but a good understanding of machine learning design and implementation is important.
Stackdriver
A surprising numbers of questions on this, given that Stackdriver is more of an “ops” product than a “data engineering” product. Be sure to know the sub-products of Stackdriver (Debugger, Error Reporting, Alerting, Trace, Logging), what they do and when they should be used.
Conclusion
The Data Engineer certification exam is a fair assessment of the skills required if you want to be able to demonstrate the ability to work effectively with Google Cloud Platform on analytics, big data, data processing, or machine learning projects. If you have used the majority of these products already on real-world products, the exam should not present you with too many problems. If you haven’t yet used some of the products above, then get studying!
If you take the exam and get caught out in any areas that we haven’t covered above, please let us know.
Priocept provides both consultancy and bespoke training services for Google Cloud Platform, so please get in touch if we can help your organisation on your journey towards adopting the platform.
Hi,
Am planning to appear for data engineer certification exam.
can you please let me know if we have to know Python as well to attend this exam?
Also am new to Machine Learning, id statistics base mandatory to attend this exam?
Hi Team,
Would need your input to begin the preparation, as I do not have hands on experience but I would like to give my attempt with Data Engineer Certification. I have sound 4 years of experience in Hadoop and its various landscape s/w.
Any pointer will be helpful.
Much Thanks!!
I took the exam today and there were 50 questions. The notes in the blog helped me to focus my study a lot; thanks!
Here are my revisions as the exam isn’t a beta anymore:
Cloud Storage and Cloud Datastore – ~2 questions
Cloud SQL – ~2 questions
Bigtable – ~8 questions. How to optimize perf or troubleshoot slowdowns. What use cases would fit, etc.
BigQuery – ~8 questions. Data partitioning techniques, optimising performance. Sharing data with other orgs. How to give list priv needed to users. How to avoid costly queries. Loading and differences of availability of data for streaming vs. batch.
Pub/Sub – ~3 questions. Basic knowledge was enough
Apache Hadoop – ~3 questions that were not GCP knowledge but Hadoop ecosystem knowledge; specifically pig and hive and what scenarios would push you to one or the other
Cloud Dataflow – ~8 questions. Understand batch and streaming designs. How to integrate with BigQuery and constraints you might have.
Cloud Dataproc – ~3 questions.
TensorFlow, Machine Learning, – ~10 questions that were mostly ML domain (and not TensorFlow specific) basically about training. Nothing about the Cloud ML service!
Cloud DataLab – ~2 questions about visualisation, permissions/restricting access, creating dynamic dashboards
Stackdriver – ~1 question about auditing and viewing who did what in BigQuery
Hi, Thanks very much for this post. I am planning to take the exam in next couple of weeks.
what is the pass percentage ? I could not find this detail anywhere.
In the exam do we get SQL or PYTHON code snippets to answer ? or all theoretical questions. I just want to make sure.
Only one practice exam is available in google site. any suggestion where to look for more practice questions ?
thanks a lot.
has anyone taken the exam recently? Is there any case study questions in there related to the given 2 case studies on Google website?
Two of my colleges cleared both of Google Certs last week, data cloud and professional arch, they have used study questions from marks4sure.com , they were happy with these questions input and they said mostly are covered and helped in exam. Surely you can study and get a help.
Best of Luck
Hi, Thank you for the high level details of the exam. I am planning to give the exam in a week or so. And by now, the exam is in good shape, and you may have a better idea about the question paper and so on. Could you throw some more light on the exam? thank you in advance.