Skip to main navigation Skip to search Skip to main content

A model PM for preprocessing and data mining proper process

  • Technical University of Madrid
  • Pace University

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Data Mining, as defined in 1996 by Piatetsky-Shapiro ([1]) is a step (crucial, but a step nevertheless) in a KDD (Knowledge Discovery in Data Bases) process. The Piatetsky-Shapiro's definition states that the KDD process consists of the following steps: developing an understanding of the application domain, creating a target data set, choosing the data mining task i.e. deciding whether the goal of the KDD process is classification, regression, clustering, etc., choosing the data mining algorithm(s), data preprocessing, data mining (DM), interpreting mined patterns, deciding if a re-iteration is needed, and consolidating discovered knowledge. Since then the Data Mining (DM) term has evolved to become a name for all of the KDD process, or some parts of it, or even to be used as a name of an application of a data mining (or learning) algorithm. For example, in 1997 a Cross-Industry Standard Process for Data Mining (CRISP-DM) was proposed ([5]) to establish a standard for what they called, and others adopted, a data mining process. CRISP-DM standard was developed for business purposes and it included all of KDD process steps plus some extra steps such as a business understanding, business goal understanding followed by the KDD standard steps. Hence the KDD process became Data Mining process for industrial applications and was and is more and more often called just by the name of Data Mining. To clarify these naming confusions we follow the standard terminology developed by data mining researches in which we understand by Data Mining (DM) a KDD process in which its original data mining phase is now called data mining proper phase. For short we say that Data Mining (DM) is a process that includes between the others the following phases: creating the target data, data preprocessing, data mining proper, pattern evaluation, and knowledge presentation. We present here formal models DP and DMP for two essential phases of the Data Mining: preprocessing and data mining proper. They are defined in such a way that put together they form a Process Model PM for the sequence of preprocessing and data mining proper processes, and hence for the most essential part of the KDD (Data Mining) process. The main components of our models are: a Data Mining System DMS and preprocessing and data mining proper operators that form together a set of all process operators of our PM model. The process operators reflect some ideas presented in [6] and [7], where some operators, called generalization operators were defined. The generalization operators were very abstract in nature and their definitions reflected the author's efforts to find a formal model for Data Mining viewed as the process of information generalization. The process operators defined here do not address the generalization issue and are specifically defined, one by one, and in a great detail in an effort to cover all known preprocessing and data mining proper techniques. We discuss the relationship of our new operators and the generalization operators of [6], [7] in the last section of the paper. The Data Mining System DMS is a crucial component of all of our models and is defined as an extension of Pawlak's Information System ( [3]). Following the Rough Set tradition stated in the statement: knowledge is an ability to classify objects ( [3], [4]) we observe that this is what not only Rough Sets algorithms do, but it is (as it should be) a common property of all of data mining algorithms, methods, models. We hence model here the data mining proper process as a process of grouping objects (records) into sets of objects. To be able to do so we need to define an extension of the notion of the information system where the information function acts on the sets of objects. We call such function, in the definition 1 of our data mining system DMS a a knowledge function. The name reflects the fact we are modelling data mining process as a transformation of an information (set of records as described by the information function) into a higher level knowledge. This knowledge obtained in the process (by algorithms, methods, models) comes in two forms: semantic and syntactic. The syntactic knowledge is always defined in terms of attributes and values of attributes of the initial data table, i.e. initial information system. It has different forms, depending on the goal of the data mining process and methods used. While modelling the semantic knowledge, i.e. the grouping objects (records) into sets of objects we want to model as well its syntactic descriptions. We want, at the end of the process be able to characterize these groups of sets (semantics) in terms of attributes and values of attributes of the initial data base (syntax) and moreover, do do so, as it often happens in terms of some accuracy parameters. Our extension of the notion of the information system accommodates all these demands and is defined formally as follows.

Original languageEnglish
Title of host publicationTransactions on Rough Sets VI - Commemorating the Life and Work of Zdzislaw Pawlak
PublisherSpringer Verlag
Pages397-399
Number of pages3
EditionPART 1
ISBN (Print)9783540711988
DOIs
StatePublished - 2007

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
NumberPART 1
Volume4374 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Fingerprint

Dive into the research topics of 'A model PM for preprocessing and data mining proper process'. Together they form a unique fingerprint.

Cite this