Skip to main content

Data Warehousing and Data Mining

[This article was first published on r – Jonathan Fowler, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The relationship between data mining tools and data warehousing systems can be most easily seen in the connector options of popular analytics software packages. For example, the image below right shows the many source options from which to pull data in from warehouse backends in Tableau Desktop. Microsoft Power BI includes similar interface options. There are countless packages in R for connecting to data warehouse backends, readily available online from proprietary and open-source vendors. Other proprietary packages such as SPSS, SAS, and JMP have similar interfaces.

Simply put, a data mining tool enables insights into what is stored in the data warehouse, and is only as useful as the quality of the data it accesses. Power (2016) calls this discover, access, and distill. In professional practice, this author has often seen businesses focus erroneously on a particular data mining tool, believing the paid solution will provide immediate value, without ensuring the data warehouse (or equivalent) is in proper order first. Successful implementation of a data mining tool requires a number of preparatory steps, including (but not limited to):

  1. Identifying appropriate Systems of Record (SORs)
  2. Validating the SOR accuracy and alignment with business purposes
  3. Establishing a common understanding of the data points within each SOR and how they translate across business units within the organization (this often requires an organization-wide Data Governance Board or equivalent)
  4. Developing business goals, or questions the data mining tool can answer

These steps ensure the data is valid, useful, and actionable. Organizations that do not take the necessary steps to ensure data quality and develop a business case for the data mining tool run a danger of wasting time and resources on a solution in search of a problem (Gudfinnsson, Strand, & Berndtsson, 2015; LaValle, Lesser, Shockley, Hopkins, & Kruschwitz, 2011).

Consider an international manufacturing company that currently uses a number of disparate systems of record for its business: Cognos, AS400, 3PL, SQL, Informix, and multiple warehouse management systems. The company does not have a unified data warehouse or data governance procedures in place. In current state, different business units that use different systems of record are unable to successfully work together with common understandings of data. Attempts at data mining and even simple reporting have failed across business units because of the quality of data—for example, sales forecasting does not translate between Finance and Marketing because the basic figures from the disparate systems of record do not match. There can be no useful data mining from this data without significant transformation.

Assuming the foundational steps are done, and the data mining tool is in production, new data points can be put back into the warehouse based on discovered insights. For example, consider a multi-level marketing company has a number of data points on its associates: units sold, associates recruited, years in the program, rewards program tier, et cetera. They know the associates can be grouped into performance categories akin to “novice” and “expert” but are unclear on both how many categories to look at and what factors are important. Principal components analysis and k-means clustering can reveal how the associates differentiate themselves based on the available variables and suggest an appropriate number of categories within which to classify them. These classifications can be put back into the data warehouse and used as co-variates in other analysis work.

References

Brownlee, J. (2016, September 22). Supervised and unsupervised machine learning algorithms.  Retrieved from https://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/

Gudfinnsson, K., Strand, M., & Berndtsson, M. (2015). Analyzing business intelligence maturity. Journal of Decision Systems, 24(1), 37-54. doi:10.1080/12460125.2015.994287

LaValle, S., Lesser, E., Shockley, R., Hopkins, M. S., & Kruschwitz, N. (2011). Big data, analytics and the path from insights to value. MIT Sloan Management Review, 52(2), 21-31.

Power, D. J. (2016). Data science: Supporting decision-making. Journal of Decision Systems, 25(4), 345-356.

Soni, D. (2018, March 22). Supervised vs. Unsupervised learning – towards data science.  Retrieved from https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d

Tableau Desktop 2018.2 [Computer software]. (2018). Retrieved from http://www.tableau.com.

Tembhurkar, M. P., Tugnayat, R. M., & Nagdive, A. S. (2014). Overview on data mining schemes to design business intelligence framework for mobile technology. International Journal of Advanced Research in Computer Science, 5(8).

This post originally appeared on my LinkedIn page: https://www.linkedin.com/pulse/data-warehousing-mining-jonathan-fowler/

The post Data Warehousing and Data Mining appeared first on Jonathan Fowler.

To leave a comment for the author, please follow the link and comment on their blog: r – Jonathan Fowler.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


from R-bloggers https://ift.tt/2YUmYyW
via IFTTT

Comments

  1. Interesting module for warehouse inventory for management. thanks for sharing the article with us.

    ReplyDelete

Post a Comment

Popular posts from this blog

Explaining models with Triplot, part 1

[This article was first published on R in ResponsibleML on Medium , and kindly contributed to R-bloggers ]. (You can report issue about the content on this page here ) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Explaining models with triplot, part 1 tl;dr Explaining black box models built on correlated features may prove difficult and provide misleading results. R package triplot , part of the DrWhy.AI project, is aiming at facilitating the process of explaining the importance of the whole group of variables, thus solving the problem of correlated features. Calculating the importance of explanatory variables is one of the main tasks of explainable artificial intelligence (XAI). There are a lot of tools at our disposal that helps us with that, like Feature Importance or Shapley values, to name a few. All these methods calculate individual feature importance for each variable separately. The problem arises when features used ...

The con behind every wedding

With her marriage on the rocks, one writer struggles to reconcile her cynicism about happily-ever-after as her own children rush to tie the knot A lavish wedding, a couple in love; romance was in the air, as it should be when two people are getting married. But on the top table, the mothers of the happy pair were bonding over their imminent plans for … divorce. That story was told to me by the mother of the bride. The wedding in question was two summers ago: she is now divorced, and the bridegroom’s parents are separated. “We couldn’t but be aware of the crushing irony of the situation,” said my friend. “There we were, celebrating our children’s marriage, while plotting our own escapes from relationships that had long ago gone sour, and had probably been held together by our children. Now they were off to start their lives together, we could be off, too – on our own, or in search of new partners.” Continue reading... The Guardian http://ift.tt/2xZTguV October 07, 2017 at 09:00AM