Class-Relational Approach to Tabular and XML Data Representation

by Sergey Savushkin, 2008

Abstract

This article is directed to all who are interested in XML databases and their possibilities to represent tabular data in XML view. Author, based on the long experience in development of database system (Sav Zigzag ORDBMS), suggests new ideas considering tables and XML as forms for presentation of one common data model. Ideas enable to convert tables set in the XML without using of mapping schema. XPath language may be applied to express the navigation through the correspondent tabular data.

I. Main Features

Introduction

It is not necessary to prove or mention dominant role that Relational DBMS has now. Reason of this, it is high performance in processing of data, which have popular and simple table view representation. Computer support is based on formalization of the tabular data processing through the relational algebra or more general the set algebra. The SQL is only natural language representing or mapping that algebra. Performance and formalization by means of the set theory provide for the essential database feature a possibility to work with huge structured data mass.

 

New database systems, object-relational DBMS (ORDBMS), could not obtain analogous popularity. The fate of Informix is significant confirmation of saying. Even well-known ORDBMS PostgreSQL and Oracle have no more popularity than MySQL and Microsoft SQL Server, usually considered as RDBMS. In fact, object-oriented features may be lightly added to any RDBMS (often named as SQL-DBMS). Nature and core of such DBMS remain the same, also as SQL language reflecting tabular data model.

 

In this time we can see in principle a new approach to data representation in XML DBMS. Another hierarchy XML data structure is based on other query languages XQuery and XPath. The XML data, structure and processing may be formalized by means of the set theory and correspondent algebra, for example XML Query Algebra. Many XML database servers/systems pretend to be the most high-performance DBMS, for example Oracle Berkeley DB XML, Tamino XML Server, MarkLogic Server, eXist XML Database, Sedna XML DBMS.

 

XML DBMSs correspond to the mentioned above requirements: high-performance and formalizing by means of the set algebra. So they can work with huge data mass and then satisfy determinative feature of DBMS. Free niche, the representation of Web sites, seems very probably, will belong to the XML DBMSs, not RDBMS. What about other niches already occupied by RDBMS? The question presents great difficulties. Main reason, XML primarily was directed to replace HTML and to present hierarchy, not tabular information. Simple table in XML form looks awkward. Nevertheless, some interesting features of XML and first of all XQuery and XPath languages may be moved on tabular data representation.

 

This article considers advances of XML for representation interrelated multi-tables data. These advances are based on specific "class-relational" approach described below. In this approach both XML and tables are only forms for presentation of the same data. Normalizing and denormalizing of tables or disjoining and joining of XML do not change real data and structure in the correspondent database.

Data representation

I recommend first to read essay XML representation of a relational database, one of the typical early approaches to XML representation of tabular data. Shortly it may be expressed in the following image consisting of the table and correspondent XML.

 

Figure 1

R

A

B

C

a

b

c

...

 

 

 

<R>  
  <r>
    <A> a </A>
    <B> b </B>
    <C> c </C>
  </r>
  ...
</R>
 

This is very general light-weight mapping of tabular relation in XML, which passes the serious inaccuracy from tabular to XML data representation. What real relation or dependency is between A, B, and C? For example, it may be A -> B -> C, A -> (B, C), (A, B) -> C, A -> B & C-> A or another. Even if we suppose that relation in the table is normalized, dependency may be A -> (B, C), (A, B) -> C or be missing (A, B, C) -> NONE. It is necessary that these dependences, both single-valued (functional) and multiple-valued, were visible in XML and tables. Speech does not go even about the task of integrity constraint.

 

Main virtue of XML is tag nesting. It is possibility to represent not only the column dependency inside of one table. XML is very comfortable for representation interrelated data inside of the tables set. Denormilized relation may be more sensible in XML view rather than table view. We need to agree only that in case of dependencies like (A, B, ...) -> ANY, it is necessary to generate surrogate primary key for (A, B, ...) composition or use logical (natural) key, which should be presented in separate column in the table. In other words, primary key has to exist for any sensible and used columns combination. For example, in place of  (FIRST NAME, LAST NAME) -> AGE better to use PERSON -> (FIRST NAME, LAST NAME) & PERSON -> AGE. The approach described here may be demonstrated on the following correspondence between tables and XML.

 

Figure 2

AS

A

B

a

b1

a

b2

...

 

 

BS

B

C

b1

c1

b2

c2

...

 

 

<data>
  <A>a
    <B>b1</B>
    <B>b2</B>
  </A>
  <B>b1
    <C>c1</C>
  </B>
  <B>b2
    <C>c2</C>
  </B>
  ...
</data>

 

Here is very principal moment. We still can not say unambiguously that tables and XML express A->B, B->C dependency or A->B->C that is the same. For example, according to XML Data Model, each opening B tag, even with the same 'b1' text in the figure 2, designates new unique XML-element of the B type. The similar case is for tables or relational data model, RMD, see for example The Relational Data Model. We can not guarantee that 'b1' column values (attribute values) in B columns (attributes) of different tables (relations) denote the one common element. It is not evidently that B columns belong to the one common domain. This article considers another approach to the data representation, which supposes that all columns with the same name mean one class, also as all the tags with the same name in the correspondent XML. I'd like to name such approach as class-relational. Accordingly to such approach, tables and XML in figure 2 unambiguously present relation between classes A, B, C with dependency A -> B -> C. Interested readers can find the additional information about the class-relational data and their processing in description of already realized DBMS and correspondent language Zigzag.

 

All the tabular columns with the same name, as all XML elements with the same type, are united by one class name and mapping one class. So, in figure 2, all the "B" columns and "B" XML elements are one B class. I suppose here that value of each XML element is a text placed directly after opening tag. The value of table field (column) or XML element is a value of the correspondent class defined by the same name as field or XML element. For example, all 'b1', anywhere in figure 2, denote only one value of one B class. Sometimes, if it seems more convenient, it is possible to set up XML element value via "id" attribute, for example <A id='a'><B>b</B></A> in place of <A>a<B>b</B></A>. If the source couple of XML tags does not set up value, these value will be automatically generated in database through a unique identifier. For example I can enter in database such consecution <A><A/></A>. Then, if to print correspondent DB relation, the output XML presentation may have near such view <A id='1'><A id='2'/></A>.

 

Let's analyze now another denormalized or joined form of data presentation.

 

Figure 3

AS

A

B

C

a

b1

c1

a

b2

c2

...

 

 


<data>  
  <A>a
    <B>b1
      <C>c1</C>
    </B>
    <B>b2
      <C>c2</C>
    </B>
  </A>
  ...
</data>

 

In relational approach it is equal to natural join. As we can see, relational dependency (functional or multivalued) is correctly reflected only in XML, not in the table. Then, denormal table form is not exact for class-relational representation (as for relational). Suggested approach supposes only the normalized tables, which present the relation with A -> (B, ...) dependency, where A is a name of the first key column. Correspondent XML has structure <A> <B/> ... </A>. Order of XML elements in one level is not essential and they may have the same type name but must be different by value or id. It is not necessary to use the table names inside of XML. Moreover, table name in the correspondent tabular representation may be automatically derived from the key column or columns combination for cross-reference table.

XPath queries

Navigation through the data of class-relational tables and XML is expressed by means of XPath-like language. Interesting and useful description of XPath may be found in the work XPath 2.0 Expression Syntax of Michael H. Kay. I'd like to compare XPath queries for the same data in relational and class-relational representation. First, I will show the SQL query with the same meaning, selecting all the C values related with A='a' through the B. See figures 4, 5, and 6 having equal semantics.

 

Figure 4, Relational table and SQL

AS

A

B

a

b1

a

b2

 

BS

B

C

b1

c1

b2

c2

 

select BS.C from AS,BS where AS.A='a' and AS.B=BS.B

 

Figure 5, Relational XML and XPath/XQuery

<AS>
  <row>
    <A>a</A>
    <B>b1</B>
  </row>
  <row>
    <A>a</A>
    <B>b2</B>
  </row>
</AS>
<BS>
  <row>
    <B>b1</B>
    <C>c1</C>
  </row>
  <row>
    <B>b2</B>
    <C>c2</C>
  </row>
</BS>


XPath:

/BS/row[B=/AS/row[A='a']/B]/C
XQuery:

for $a in doc("AS.xml")/AS/row[A='a'],

    $b in doc("BS.xml")/BS/row

where $a/B=$b/B

return $b/C


 

Figure 6, Class-relational XML and XPath

<A>a
  <B>b1
    <C>c1</C>
  </B>
  <B>b2
    <C>c2</C>
  </B>
</A>


XPath:

A[.='a']/B/C
 

I hope all agree that class-relational XPath (Fig. 6) looks simpler and more comprehensible. Moreover, XPath with nested brackets (Fig.5) does not work in the modern XPath implementations. For relational XML representation such request is realized in XQuery language.

 

The class-relational XML and XPath work with the relation between class values, but not with the tables (row sets). To get C values related with A='a', it is possible to use even this expression A[.='a']//C. Constructions with beginning slash symbol, like /A, is not required here. I'd like also to mark important property of the class-relational data, which follows rather from the normalized disjoin XML or tables. Each class may be accessible directly. We can get all the C values easy by the C/. or C[.] without navigation through the A and B. Other simple queries, for example selecting A values related with C='c1', would be viewed thus:
A[./*/C='c1']
A[.//C='c1']
A[B[C[.='c1']]]
A[B[C='c1']]
A[*[C='c1']]

II. Relational Sample

From tables to XML

Let's consider the typical example of the relational database.

Figure 7

DEPARTAMENTS

DEPARTMENT_ID

NAME

BRANCH

D1

SoftS

IT

D2

HardS

IT

WORKERS

WORKER_ID

NAME

DEPARTMENT_ID

W1

Fisher

D1

W1

Fisher

D2

W2

Smit

D1

W3

Smit

D2

 

Two tables are related with each other through the DEPARTMENT_ID column. Type of this interrelation is aggregation, where DEPARTMENTS has WORKERS. We consider N:M aggregation, because one worker Fisher (as exception) works in the several departments. The DEPARTMENT_ID is foreign key in the WORKERS table. First columns in both tables are key.

 

Below is the same relation in denormalized tabular and XML forms:

 

Figure 8

data

BRANCH

DEPARTMENT_ID

NAME

WORKER_ID

NAME

IT

D1

SoftS

W1

Fisher

IT

D1

SoftS

W2

Smit

IT

D2

HardS

W1

Fisher

IT

D2

HardS

W3

Smit


<data>
  <BRANCH>
    IT
    <DEPARTMENT id='D1'>
      <NAME>
        SoftS
      </NAME>
      <WORKER id='W1'>
        <NAME>
          Fisher
        </NAME>
      </WORKER>
      <WORKER id='W2'>
        <NAME>
          Smit
        </NAME>
      </WORKER>
    </DEPARTMENT>
    <DEPARTMENT id='D2'>
      <NAME>
        HardS
      </NAME>
      <WORKER id='W1'>
        <NAME>
          Fisher
        </NAME>
      </WORKER>
      <WORKER ID='W3'>
        <NAME>
          Smit
        </NAME>
      </WORKER>
    </DEPARTMENT>
  </BRANCH>
</data>

From XML to tables

I had marked earlier that XML and tables are only form of presentation of the data stored inside of database. The same data must be lightly converted from one form to other. Nevertheless, for legacy system with existent XML or tabular applications, very desirable would be mapping schema regulating correspondence between XML and tables.

 

To transform class-relational data from XML in the most appropriate tabular presentation (4th normal form), I will transform XML in figure 8 to another "disjoined" view in figure 9. Only then, I get the correspondent set of normalized class-relational tables.

Figure 9

<data>
  <BRANCH>
    IT
    <DEPARTMENT id='D1'>
    <DEPARTMENT id='D2'>
  </BRANCH>
 
  <DEPARTMENT id='D1'>
    <NAME>
      SoftS
    </NAME>
    <WORKER id='W1'/>
    <WORKER id='W2'/>
  </DEPARTMENT>
  <DEPARTMENT id='D2'>
    <NAME>
      HardS
    </NAME>
    <WORKER id='W1'/>
    <WORKER ID='W3'/>
  </DEPARTMENT>
 
  <WORKER id='W1'>
    <NAME>
      Fisher
    </NAME>
  </WORKER>
  <WORKER id='W2'>
    <NAME>
      Smit
    </NAME>
  </WORKER>
  <WORKER ID='W3'>
    <NAME>
      Smit
    </NAME>
  </WORKER>
</data>

 

BRANCHES-DEPARTAMENTS

BRANCH

DEPARTMENT_ID

IT

D1

IT

D2

 

DEPARTAMENTS

DEPARTMENT_ID

NAME

D1

SoftS

D2

HardS

 

DEPARTAMENTS-WORKERS

DEPARTMENT_ID

WORKER_ID

D1

W1

D1

W2

D2

W1

D2

W3

 

WORKERS

WORKER_ID

NAME

W1

Fisher

W2

Smit

W3

Smit

 

Most compact and desirable representation of relation in database is normal tabular or disjoined XML form. The tables may have any other view but with obligatory separate description of their columns dependency. Joined XML form may be appropriate basically for presentation of whole bound hierarchy, for example for illustration of the query execution. In case of the data manipulation, it is better to consider disjoined XML form or correspondent normal tabular form. I say about specific representation of class-relational data for navigation and integrity control.

Data integrity

Last figure 9 shows the sample of class-relational dependency. Class-relational data model of considered XML and tables does not correlate with models of the native XML and relational tables. We can see DEPARTEMENT with ID='D1' in several tables, but it is unambiguously the same element. In our case DEPARTMENT with ID='D1' is connected also with NAME = 'SoftS' value and with WORKER_ID = ('W1', 'W2') values. Let's try to delete DEPARTEMENT_ID = 'D1'. The unique NAME = 'SoftS' value (field or tag) will be deleted also. Main reason, because NAME = 'SoftS' looses all the connections (relation) with other values. There is no meaning to keep it. Moreover, it is dependent value. Deleting of BRANCH = 'IT' does not implies deleting of the related DEPARTEMENTs, because they have other connection with other elements. (Think that example is not successful? Well, let's presume that these departments will belong in the future to other branch. So, their information must be kept.).


Type definition of class values is not so necessary for class-relational data, though very desirable. We can define database schema separately even for already existent data, that is enabled in XML databases. I knowingly used in the example ( last figure) the one class NAME related with DEPARTEMENT and WORKER classes. The NAME class is general class. It is better (but not necessary) to divide it on other classes, like WORKER_NAME and DEPARTEMENT_NAME with different value types. Nevertheless, if realization allows, types (domains) of class values may be defined also depending on their relation with other classes. For example, classes NAME(WORKER) and NAME(DEPARTEMENT), or in XPath notation WORKER / NAME and DEPARTEMENT / NAME, may have various types.

Conclusion

This work demonstrates some features and possibilities of developed database system having specific “class-relational” data organization. Correspondent class-relational database considers tabular columns with the same name as one independent class. It is main distinction with relational (or SQL) databases, which do not suppose strict association between columns with equal name in the various tables. Tables and XML are only forms for presentation of relation between class values. So one class value may be found in different fields of tables or tags of XML and simultaneously denote only one object for all the fields and tags with the same name.


In class-relational database there is possibility to manipulate with set of class values independently of relation. Moreover, content and structure of relation depend on classes and class values forming this relation. For example, deleting of class value may affect deleting of rows in all the relations for which correspondent class is key attribute. Class belonging to the several relations actually joins those relations with each other. Therefore navigation through the relations looks very simple, for example in XPath language.

Unfortunately in different approaches and data models the term "class" till now has no precise definition and understanding. More often it is associated with type. Anyhow, the type and class are different terms and finally designate and should designate different concepts. Being based on the ODMG object model, it is possible to understand class as mapping (for example implementation) of type on its extent, that is on the set of instances. The class, beside other things, can include subclasses and make up hierarchy of the classes corresponding hierarchy of certain types.

In the described approach the class is set of the current (kept in a DB or considered at present) values grouped by one name. The type of a class is definition of all possible values of the given class. Name hierarchy sets up class hierarchy. It is allowable and even is very helpful to use one name for class and a corresponding type, essentially on the top level of hierarchy. Only in case of name coincidence between class and type, the ODMG approach works for class definition. What it is the value? Anything you like. Important that it might be stored inside of a DB. The composition of class name and value may be used, for example, as the object identifier.

Given article is a next, supported by practical results, attempt to transfer object features in relational data model. Specificity and the main contradiction of class-relational approach with existing models consists in treatment of attribute. This was made not only for the sake of the practical purposes, simplification of the language interface and effective realization of data processing operations, SELECTION, NATURAL JOIN and PROJECTION in terms of relational model. Also there is underlying semantic reason related with understanding of natural language. The attribute in the given approach is considered as a class. Any attribute denoting role, name, property, etc. finally is a class. I mean both, separate attribute, considered independently of others, like NAME, GREATER, SIZE, etc., and attribute related in context with other attributes (classes J), for example the EMPLOYEE NAME, GREATER SIZE ENGINE, ….