What if you could use the power of two of the most successful Open Source data projects to mix and load data coming from different sources (on-premises or Cloud) just with SQL queries ?
Why not use best-of-breed dbt and Trino solutions as a powerful ETL tool ?
This post is linked to the Github repository https://github.com/victorcouste/google-data-catalog-dataprep explaining how to create or update Google Cloud Data Catalog tags on BigQuery tables with Cloud Dataprep Metadata and Column’s Profile via a Python Cloud Function.
The 2 Data Catalog tags created or updated:
Building a modern data stack to manage analytic pipelines — such as Google Cloud and a BigQuery data warehouse or data lake — has many benefits. One such benefit is the ability to automatically monitor the quality of your data pipelines. You can ensure that accurate data is fueling your analytics, track data quality trends, and, if a data quality issue does arise, react quickly to resolve it.
Let’s review how to create a beautiful Data Studio dashboard to monitor Cloud Dataprep data quality pipelines.
For this article, let’s assume that you’re responsible for managing a data pipeline for a…
All code can be found on GitHub https://github.com/victorcouste/google-cloudfunctions-dataprep
Here you will find examples and use cases of Trifacta Flows to clean, transform and manipulate your data. These flows can be used with Trifacta Enterprise (deployed on-premise, in Azure or in AWS), Trifacta SaaS or Google Cloud Dataprep.
Maybe if you manage your data analytics pipeline on a modern data stack such as Google Cloud for your BigQuery data warehouse or data lake, you would like to monitor it and get a comprehensive view of the end-to-end analytics process to react quickly when something breaks, or just get the peace of mind when everything works properly.
In a previous article, I have demonstrated how to monitor your Dataprep jobs with a simple Google Sheet, but why not boost it to the next level with a beautiful and actionable dashboard in Data Studio?
This article explains how to capture…
If you manage a data and analytics pipeline in Google Cloud, you may want to monitor it and obtain a comprehensive view of the end-to-end analytics process in order to react quickly when something breaks.
This article shows you how you can capture Cloud Dataprep jobs status via APIs leveraging Cloud Functions. We then input the statuses to a Google Sheet for an easy way to check the statuses of the jobs. Using the same principle, you can combine other Google Cloud service statuses in Google Sheets to obtain a comprehensive view of your data pipeline.
To illustrate this concept…
With a better mastery of Cloud Functions, you can trigger a Dataprep job via API when a file lands in a Cloud Storage bucket
Ever dreamt about automating your entire data pipeline to load your data warehouse? Without automation each user needs to manually upload its data, then manually start a transformation job or wait for a scheduled task to be executed at a specific time. This is quite tedious and resources intensive.
After reading this article, you will be able to drag and drop a file in a folder, get your entire data pipeline executed and loaded in your…
Apache Spark is web-based notebook that enables interactive data analytics. Zeppelin interpreter concept allows any language/data-processing-backend to be plugged into Zeppelin. Currently Zeppelin supports many interpreters such as Scala(with Apache Spark), Python(with Apache Spark), SparkSQL, Hive, Markdown, CQL, and Shell.
More details and documentation can be found here https://zeppelin.incubator.apache.org/
For people who need general presentations on Apache Cassandra and DataStax, here some ressources.
On SlideShare, you will find several presentations in french and english around the NoSQL Apache Cassandra database with links to Cloud, IoT and Analytics.
In french a podcast explaining Apache Cassandra.
A video in french presenting DataStax Enterprise.
A video in french around the Apache Spark + Apache Cassandra integration.
Finally some links to learn and know more on Apache Cassandra et DataStax :
Data Fan / Starburst Data