DQ and Social Big Data an issue?
In his blog on The Dataroundtable http://www.dataroundtable.com/?p=11270, David Loshin very rightfully points out there is a number of approaches for Data Quality that will not work in the realm of Big Data, especially of the 'unstructured' type. Below is my response to his blog post, where I feel triggered beyond imagination.
My question is: Do you think data quality / -management professionals should respond in a serious manner to a 'business' question about quality of 'Social Big Data'? When no-one has an incentive to bring quality into the information process upstream, the battle with the beast is lost by definition.
The next question is: "Do we need to battle". I can't imagine anybody expecting high quality from a Social Big Data source (that they did not develop and gather themselves for a specific purpose).
My perplexed state of bafflement I experienced when reading the post, felt like a sort of short circuit. In the end I could not find an original logical scenario where David's premise would be valid. I think I need some examples where Data Quality professionals are asked to improve the quality of this type of data.
So I absolutely disagree with your remark "we should rethink what is meant by data quality in the context of big data, and especially with streamed social media." Instead, when asked to provide quality measures for Big Data Sources, we should collectively laugh in their faces. "Sorry, can't be done". We are not going to think about it. Find another hype to ride on.
In the end it all has to do with `Purpose´. This type of data was not produced for your purpose. David painted this picture in his blog post. In his words: "there is no incentive for a data producer to care about the needs of these as of yet unknown downstream data consumers, especially because those consumers might have not even decided to consume the data." If you, as a data quality engineer or data steward are asked to include Big Data from social media sources for analysis, you may have to make the requestor aware of the limited influence he/she has on the possibillities of quality enhancements, let alone structural measures for improving quality of outcomes of analysis.
I understand that there are (a few, mostly multinational) companies that have very specific questions (purpose) they want to have investigated by counting 'Likes' on Facebook, there the purpose aligns with the expected quality of the results, because usually the numbers are big and trends are what matters, not absolute numbers. There will not be an expectation for the three ways of fighting poor DQ, David mentioned.
It occured to me that there is a parallel sort of trend in the DM blogoshpere about fitness for purpose vs "Real world Alignment" as main qualifier for data quality. I'm not sure where this is coming from, but I feel the same short circuit. If it means that IT people want to have a bigger say, or think they know best in data quality issues than it's a bad thing.
If there is no purpose, quality is never an issue.
I also made these comments on franklybi.blogspot.com.