Start a new topic

Discovery as a data quality tool

IPC is the distribution arm of Subway Restaurants North America, and we are doing things differently with our partner and choice of Data Governance tooling.   We started off to build the framework for a Business Data Catalog based on the Semarchy Sales Demo Governance Model.   We had issues with another partner in completing our Redshift Data Warehouse, but we are nearly completed on that journey.   

After the partner had helped build out the Data Catalog piece. we had the training and got excited about the features of the Discovery component and the reverse engineering component.  

However, our excitement turned recently to despondency after finding out that we could not use the Redshift native database as a data source.   We installed the Redshift JDBC driver and seem to connect well - although it does say somewhere in the documentation that you CAN install your own database drivers, but they will only be useful for Semarchy features if they are the supported data platforms of Oracle, SQL Server or Postgres.

While I am fine being restricted to one of those 3 to be used as Semarchy's own data engine, I think that connectivity to other things would be very important as a source, and not have the suggestion so far from Semarchy to move the metadata into Postgres by other means. 

I initially built or had extracted metadata into spreadsheets which we loaded to initially populate the Semarchy xDM Data Catalog Governance Application we were building - however, it is all about the soon to be superceded old Redshift structures and I need a refresh.   I had worked out a possible path with the Semarchy Pre-sales team to use the Data Modeling tool we had chosen, Hackolade, that has its model details in a regular and readable JSON format, using the API's into the data model parts that matched our Domains and Entities.   I could do that to refresh the metadata with what it is, but then we learnt about the 'Reverse Engineering'   feature.  

No good if you cannot connect to one of the 3 relational databases currently supported, though.

I then wondered about my Data Quality metadata that I was collecting.   In order to make a better way for doing glue jobs, not stuck with crawlers, trigger and such but just coding into Python the SQL queries that would do the transformations, extracts and loads - no need to get into Spark or PySpark, as that was yet another skill set we did not have much access to qualified and available resources, and also was one of the problems we had before.   

So I designed, and with my strong team, created a framework of jobs linked to trains with metadata in Redshift tables to orchestrate the data movement, and soon completing that with additional metadata about data quality and performance - no digging into cloudwatch native logs to tease out this would make things easy and simple to view.

So now, ready to do Data Profiling and Quality Metrics part of the Data Governance normal range of activities, I was quite sad to find out I could not use the DIscovery feature on my data warehouse as I could not use it as a Data Source as it was Redshift.   In the errror logs, it  basically said ' not sure what you connected me to, I am not going to try to get a list of tables or rows or any other metadata for you - you're on your own'.

So this is a plea to have support and interest to see how more amazing and how many other users may like to have Semarchy take the pain of being in a Data Engine that currently is not supported to be used for reverse engineering, or Data Discovery, and even not allowed to create any dashboards against.

The successful board-room presentation I was dreaming on having in a month or so has evaporated in a whiff of smoke.   I was hoping to take the journey, with one of the Business Data Analysts trained to use our application, to show an executive how wonderful SemarchyxDM is.  THe story would be, Executive 1 asks DA 1 a business data question, DA1  searches semarchy for what he asked for in the search, found where it was defined and a link to a data set if that data already had a data set built, and skip over to the dashboard and quickly put together a view against that data - or use the link to an existing report in our Tableau framework to show the same thing  - or if there is some data in a data source but not in a data set, start a data set request workflow to request this information to be added to an existing data set, or a new data set created to support this Executive request.  

Alas, that will have to be done with smoke and mirrors if I have to do that as part of the 'begging for a budget' to complete the other parts of the data governance tooling that I need.

Hope you enjoyed reading my post, and would consider commenting in support of having other database engines able to be used as data sources for discovery, reverse engineering and light dashboarding and reporting.

Best Answer

Hi Gleigh, please refer to this other topic for our reply

1 Comment


Hi Gleigh, please refer to this other topic for our reply

Login to post a comment