BI & Analytics, MDM

Avoiding Data White Elephants

white elephant

As an IT professional working in the data management space (dare I say with Big Data) you currently face some particularly challenging & critical technology choices that can have a profound impact on your business operations so you need to avoid any more white elephants: Wikipedia, “A white elephant is a possession which its owner cannot dispose of and whose cost, particularly that of maintenance, is out of proportion to its usefulness.” – Does that ring any bells with your incumbent technology stack? Then read on.

It was my respected competitor Talend’s CMO that prompted me to write this post when he claimed: “Talend is the only one-stop shop for a modern, agile, integration platform (with all the bell & whistles) at a great price.” LinkedIn 13/8/15 which whilst being understandably narcissistic is clearly disingenuous.

A more honest approach will deliver a long & rewarding career, so allow me to set the record straight in a broader market context: As you know, every technology choice available to you has dependencies which fall into two categories; ‘outbound’ & ‘inbound’ with a question associated with each; namely “Can we get rid of this stuff without taking our business down with it?” & “What’s the catch if we get this new stuff in?” With regard to the latter, there is a bunch of really good new stuff out there but it comes with a heap of pre-requisites plus downstream costs which you’re not told about before you sign up for it!

For example why change your data store to get high performance analytic capability, any BA or IA will know you can use what you’ve got if you build the right architecture & data platform around it. Feel sorry for those that live in hope that Hadoop or SAP HANA will cure their problems, the recent 30-day trial offer is like giving Crossrail a free JCB for a month to get the Underground kick started?! Don’t you just have to love your SI for that very reason, yeah right they love you for the ratio of services to software they make out of your struggles. You really can turn that ‘norm’ on it’s head with a modern data management platform but be diligent with your choice.

There is no one stop for all your needs but what’s really needed here is an option that has zero dependency on additional components is non-intrusive from both technology and business process aspects, covers not just Data Integration but Data Quality, Master & Reference Data Management & runs anywhere (on-premise/cloud) with no mods’ plus can be implemented with incremental results in an Agile sprint like way that has an SME price tag with no multi-year TCO sting-in-the-tail like the aforementioned.

OK Sherlock, where can this be found? – Request a Proof of Value PoV for the Semarchy “Evolutionary Data Convergence Platform” & you’ll get palpable results in days – we use our own technology to run Semarchy the company, ask our CEO he’ll be happy to demo how Semarchy drink their own champagne – this is no gimmick, we’re refreshingly honest brokers – you can ask our customers…The final word

Standard
Uncategorized

Semarchy comes of age

As a newcomer to the UK market, Evolutionary MDM vendor Semarchy has earned the right to sit at the top table with the other so called ‘mega-vendors’ of Data Integration, Data Quality & Master Data Management software such as IBM, Oracle & Informatica.

This is not a self-opinionated claim, rather a result of customer driven selection criteria in response to RFI’s, competitive bids & general success in this market.

To understand their claims in more detail, please refer to this enlightening review of their software platform by a leading independent consulting firm who operate exclusively in the MDM space:-

————————————————————————————————————————————————————

This report is a summary of the considered professional observations for the Semarchy “Convergence for MDM” software product suite, made by the top UK/European independent specialist MDM consulting firm who also represent & use other market leading vendor solutions in this specific sector:

Conclusions

We liked the product.  It seems modern, well thought-out, and without legacy baggage.
Compared to the Gartner leading quadrant vendor’s solutions, it’s not easy to compare directly but the others look clunky by comparison.
As a browser delivered product rather than a physical install on each users machine, it does have appeal, and it seems to have all the right elements in all the right places (in terms of a sound architecture, deployment story, developer story, customization and so on).

What we did come away with, is that it would be an ideal product for Local Gov’t, CLA, Health & SME types rather than the corporates of this world – and that’s the market they’re focused on at present.

We were also impressed that they’ve released to the Amazon (web service) marketplace – that effectively gives anyone the ability to start up a machine with it running almost at the press of a button – meaning that it’s ideal for people to “try out” without needing any heavy infrastructure or support.

Supporting detail

  • Repository style – imports data into the hub for matching, viewing, editing etc.
  • Allows payload data as well
  • Delivered as a web application
  • Runs on Tomcat with Oracle back end.  Quick setup.
  • Design part using Eclipse RAP – so has the look and feel of “Eclipse in the Cloud”
  • End User part has a modern “Windows 8” styling.
  • Basic workflow is:
  • Design your data model (entities)
  • Design Matchers
  • Design Enrichers
  • Design Validators
  • Design Business Objects (hierarchies of entities, e.g. – Account Manager with Customers)
  • Design View / Edit forms
  • Design Workflows
  • Deploy Model and Application  (versioned)
  • Version / Snapshot Data as required.
  • Repeat
  • Data Modelling
  • GUI designer built into the tool
  • Simple, complex and composite attributes
  • Hierarchies of entities
  • Has – a
  • Is – a
  • Parent/child
  • Relationships between entities
  • Yes!  Has a, Is a.
  • Publishing events
  • There is a Data Integration layer (separate product) which sits over the MDM layer, and can track batches submitted, poll for changes and such.
  • http://www.semarchy.com/data-integration/
  • The DI layer orchestrates things like push events
  • Java API,
  • Web Service API (SOAP)
  • Not MQ – need to write that using the Java API
  • Data Quality reports
  • Yes!  Using the additional PULSE tool
  • http://www.semarchy.com/data-governance/
  • Pulse provides “Before and After” views of data quality – so that you can see the improvements that MDM has made
  • Pulse is another layer over  the core MDM tool
  • Pulse Profiling for profiling data before MDM
  • Pulse Metrics for reporting on MDM data after matching
  • See diagram and dashboard screenshots on website:
  • http://www.semarchy.com/en/convergence/pulse/
  • Data authoring
  • Yes, using custom web forms designed and deployed as part of a versioned “application”
  • Full LDAP role integration determining Read/Write/View/Edit functionality at a Field > Entity level, with custom filtering
  • Security
  • LDAP Integration
  • Semarchy only knows about roles, not users
  • Can read LDAP properties and store them in variables for use in filters and the SemQL query language
  • APIs also participate in the security.
  • State Management
  • Data has a number of states
  • Certified (i.e. it’s participating in the golden view)
  • Rejects/Errors
  • Create workflow tasks to resolve / resubmit
  • Duplicate
  • Machine Merged
  • User Merged
  • User Split
  • User actions always override machine, even for future updates.
  • Matching strategy
  • Simple match (eg, by ID)
  • Fuzzy, using: Levenshtein, Jaro-Winkler distance, Name normalization and Soundex
  • Can combine multiple fields and algorithms using SemQL language (which essentially gets translated back to Oracle SQL)
  • Match tuning capabilities
  • There isn’t a match tuning capability, except by trial and error, for example, setting the Levenshtein distance to 65, then 70, then 75 etc… & looking at the output results.
  • Internally, they take the generated SQL and use that to tune. The generated SQL is very readable (we saw examples), and with comments identifying the user definable parts.
  • Trust / Survivorship across sources
  • Yes, lots and custom
  • “Custom ranking” allows complex logic, such as “If this field is > 1 year old, trust the other system instead…”, or “If this field is supplied then also use that field”.
  • Auditing
  • Model audits (who changed what)
  • Created By / Updated By + timestamps
  • Data steward audits
  • Full data lineage.
  • Who read / changed what and when, from which source system etc…
  • Data lifecycle management
  • Soft deletes only.
  • Deletes from source systems indicated by a flag.
  • API / Interface options
  • The DI layer provides SOAP web services
  • There is a Java API also available for building custom interfaces as an alternative to the defined “Application”
  • The Java API uses the underlying workflow and entities
  • their customers have done this – using Vaadin (GWT) to write their own custom app over the top of the Semarchy Java API
  • Plugin API for Enrichers & Validators
  • g. Google Geocode address
  • Java interface, Junit tests, with good documentation
  • They also provide training for partners
  • “Code” generation from other tools
  • The Model itself is represented by XML, and you can transfer a model between environments by importing / exporting this model.
  • So, in theory, as long as you are able to create this XML file, you could codegen.
  • Deployment to different environments ( DEV / UAT / PROD)
  • Versioning
  • Both the model and data can be versioned
  • This also versions the APIs
  • Data can be snapshotted (point in time)
  • Allows things like a product catalogue to be snapshotted, so that downstream applications use the current catalogue while upstream data managers maintain the upcoming catalogue.
  • The data is viewable using multiple versions (without duplicating the data)
  • This means that a downstream application could be using V1 of the data, while another application could be using V2 of the data
  • Branching
  • If a version is applied with bugs, then you can go back to a previous version, and start changing that instead.
  • No destructive changes on the hub.
  • Production only accepts “Frozen” versions – no updates in production
  • Functional Testing options
  • Internally, test is done by having a set of known input records, and a set of expected output records, then a series of scripts to compare the output of MDM with the expected outputs.
  • No automated testing built in to the application
  • Scalability
  • Recommended architecture for clustering etc. (documented on website)
  • Essentially, multiple passive instances for end users (taking updates, viewing data etc…), and one active instance for orchestrating match jobs etc
  • Talking to an Oracle RAC cluster on the back end.
  • Cloud
  • Yes – running on AWS, RSD etc…
  • They have released to AWS Marketplace
  • Press release:

http://www.semarchy.com/news/semarchy-announces-cloud-mdm-success-at-the-gartner-enterprise-information-master-data-management-summit/

Standard