Before data science became mainstream, data quality was mostly mentioned for the reports delivered to internal or external clients.
Nowadays, because machine learning requires a large amount of training data, the internal datasets within an organization are in high demand. In addition, the analytics are always hungry for data and constantly search for data assets that can potentially add value, which has led to quick adoption of new datasets or data sources not explored or used before.
This trend has made data management and good practices of ensuring good data quality more important than ever. Data quality is not something that can be fundamentally improved by finding problems and fixing them. Instead, every organization should start by producing data with good quality in the first place.
In this article, Stephanie Shen gives you a clear idea of how to build a data pipeline that creates and sustains good data quality from the beginning.