Data Warehousing and Data Mining

[This article was first published on r – Jonathan Fowler, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The relationship between data mining tools and data warehousing systems can be most easily seen in the connector options of popular analytics software packages. For example, the image below right shows the many source options from which to pull data in from warehouse backends in Tableau Desktop. Microsoft Power BI includes similar interface options. There are countless packages in R for connecting to data warehouse backends, readily available online from proprietary and open-source vendors. Other proprietary packages such as SPSS, SAS, and JMP have similar interfaces.

Simply put, a data mining tool enables insights into what is stored in the data warehouse, and is only as useful as the quality of the data it accesses. Power (2016) calls this discover, access, and distill. In professional practice, this author has often seen businesses focus erroneously on a particular data mining tool, believing the paid solution will provide immediate value, without ensuring the data warehouse (or equivalent) is in proper order first. Successful implementation of a data mining tool requires a number of preparatory steps, including (but not limited to):

Identifying appropriate Systems of Record (SORs)
Validating the SOR accuracy and alignment with business purposes
Establishing a common understanding of the data points within each SOR and how they translate across business units within the organization (this often requires an organization-wide Data Governance Board or equivalent)
Developing business goals, or questions the data mining tool can answer

These steps ensure the data is valid, useful, and actionable. Organizations that do not take the necessary steps to ensure data quality and develop a business case for the data mining tool run a danger of wasting time and resources on a solution in search of a problem (Gudfinnsson, Strand, & Berndtsson, 2015; LaValle, Lesser, Shockley, Hopkins, & Kruschwitz, 2011).

Consider an international manufacturing company that currently uses a number of disparate systems of record for its business: Cognos, AS400, 3PL, SQL, Informix, and multiple warehouse management systems. The company does not have a unified data warehouse or data governance procedures in place. In current state, different business units that use different systems of record are unable to successfully work together with common understandings of data. Attempts at data mining and even simple reporting have failed across business units because of the quality of data—for example, sales forecasting does not translate between Finance and Marketing because the basic figures from the disparate systems of record do not match. There can be no useful data mining from this data without significant transformation.

Assuming the foundational steps are done, and the data mining tool is in production, new data points can be put back into the warehouse based on discovered insights. For example, consider a multi-level marketing company has a number of data points on its associates: units sold, associates recruited, years in the program, rewards program tier, et cetera. They know the associates can be grouped into performance categories akin to “novice” and “expert” but are unclear on both how many categories to look at and what factors are important. Principal components analysis and k-means clustering can reveal how the associates differentiate themselves based on the available variables and suggest an appropriate number of categories within which to classify them. These classifications can be put back into the data warehouse and used as co-variates in other analysis work.

References

Brownlee, J. (2016, September 22). Supervised and unsupervised machine learning algorithms. Retrieved from https://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/

Gudfinnsson, K., Strand, M., & Berndtsson, M. (2015). Analyzing business intelligence maturity. Journal of Decision Systems, 24(1), 37-54. doi:10.1080/12460125.2015.994287

LaValle, S., Lesser, E., Shockley, R., Hopkins, M. S., & Kruschwitz, N. (2011). Big data, analytics and the path from insights to value. MIT Sloan Management Review, 52(2), 21-31.

Power, D. J. (2016). Data science: Supporting decision-making. Journal of Decision Systems, 25(4), 345-356.

Soni, D. (2018, March 22). Supervised vs. Unsupervised learning – towards data science. Retrieved from https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d

Tableau Desktop 2018.2 [Computer software]. (2018). Retrieved from http://www.tableau.com.

Tembhurkar, M. P., Tugnayat, R. M., & Nagdive, A. S. (2014). Overview on data mining schemes to design business intelligence framework for mobile technology. International Journal of Advanced Research in Computer Science, 5(8).

This post originally appeared on my LinkedIn page: https://www.linkedin.com/pulse/data-warehousing-mining-jonathan-fowler/

The post Data Warehousing and Data Mining appeared first on Jonathan Fowler.

To leave a comment for the author, please follow the link and comment on their blog: r – Jonathan Fowler.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

from R-bloggers https://ift.tt/2YUmYyW
via IFTTT

Explaining models with Triplot, part 1

[This article was first published on R in ResponsibleML on Medium , and kindly contributed to R-bloggers ]. (You can report issue about the content on this page here ) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Explaining models with triplot, part 1 tl;dr Explaining black box models built on correlated features may prove difficult and provide misleading results. R package triplot , part of the DrWhy.AI project, is aiming at facilitating the process of explaining the importance of the whole group of variables, thus solving the problem of correlated features. Calculating the importance of explanatory variables is one of the main tasks of explainable artificial intelligence (XAI). There are a lot of tools at our disposal that helps us with that, like Feature Importance or Shapley values, to name a few. All these methods calculate individual feature importance for each variable separately. The problem arises when features used ...

DataScience4you2me

Search This Blog

Data Warehousing and Data Mining

Labels

Comments

Post a Comment

Popular posts from this blog

Former San Diego mayor joins race for California governor

Using RStudio and LaTeX

Explaining models with Triplot, part 1