Tip:
Highlight text to annotate it
X
Hello and welcome. In this final session on introduction to XML, we shall be delving primarily
into XML databases. That is until now we have been talking about what is XML and what are
the benefits of XML or data interchange using XML and XML queries or schemas and so on.
But today in this session, we shall be looking into XML databases and what are the issues
that come up when XML data has to be stored or transmitted and exchanged over preexisting
or some set of existing data stores. We shall also be looking into the larger problem of
what is called as semi structured data management. Semi structured data as you might have imagined
from the name is data which do not have specific rigid structures. And we shall show that more
and more the data requirements of today's world what we call the post internet world
is being defined by semi structure data rather than structured organized data. And semi structured
data possess some unique problems in fact some fundamental problems in database management.
And what we shall do here is that in this session is we shall only motivate those problems
rather than providing any solutions as such many of which are in some sense areas of active
research.
So we shall be motivating the problems with some examples or some analogies and instances
and we shall conclude this session with a list of what one could expect in terms of
semi structured data management.
So let us proceed further with this session and as with earlier session some of these
material have been derived from an invited talk by Jayant Haritsa in the VLDB summer
school held in Bangalore in June 2004 and of course including many of the analogies
and jokes that have been there in the slide and so on.
So here is an acknowledgement for those set of material that have been derived from those
slides. So let us look back at XML again and recap what we have learnt about XML and what
are its characteristics. First of all before we look into the points in the slide itself,
let us remember that XML is a platform independent, standardized and extensible markup language.
So, in a sense that its platform independent because it is written in character data and
every platform should at the very least support character data and it's an extensible markup
language that is it does not have a specific set of keywords as such. I mean of course
it has key words where you say, where you give declarations but for describing the data
itself, it doesn't have any keywords.
The user who is describing the XML document can come out with his or own set of tags or
meta data that describe how the data is organized and what semantics to attach to each data
set in an XML store. And XML syntax has a plain hierarchical structure which is easy
to navigate and easy to enforce and easy to parse as well. So we have, as a result we
have query languages like XPath which translate an XML query into a directory like path expression
which is quite intuitive for users who have been using computers and so to traverse directories
and look for something in a specific directory and so on.
And last but not the least, XML is understandable or parsable by both the machine and the computer.
So if everything fails what you can do is just open an XML document in notepad or vi
or Emax or some such text editor and look through the document and see where there could
be a problem especially, if may be starting tag doesn't have an ending tag or some problem
of that sort.
And of course there are issues like what about the amount of space, extra space that XML
takes up and this can be answered by the fact that one need not store XML in character form
on disk. You might want to actually compress XML data and then store it on disk, so that
you can and decompress it when reading it, so that the actual disk space that is taken
up by XML is reduced.
So let us address the question of data interchange now using XML. XML is said to be a platform
independent source of data representation and representing data that is self-describing
in the sense that the meta data is embedded within the data. Now, but what is the problem,
what is real and of course the advantage of this being that you don't have to worry about
whether the end user is using the same platform as yours.
If you are using Linux, the end user could be using windows or Solaris or Mac or whatever
and the end user could still use your services using SOAP or some other form of XML based
message parsing protocol and you can still answer or you can still provide those services
via an XML wrapper. But what is the interchange problem actually or in its entirety? For the
interchange problem, we have to first note that most data is already stored in some existing
databases. It's quite unlikely that databases that have been existing for large periods
of time now will be replaced by an XML data store or will be replaced by something else
that is unifying.
So databases are going to continue to exist and the question of data interchange will
boil down to the fact that how to interface these databases using XML or using a common
interchange language. And databases will not only exist, will be updated through existing
interfaces. It's again quite unlikely to expect that all updating and all interface to the
databases will come through a common interchange format. So what is really required is to provide
XML wrapping to existing data bases and SOAP for example in this regard that is message
parsing between objects existed long before XML came into the picture. There were several
different object middle wares like CORBA or DCOM and so on which supported message parsing
between objects over a distributed system.
But however one needed to be CORBA complaint in order to be able to, a platform needed
to be CORBA complaint in order to be able to support CORBA that is CORBA had to be returned
for that platform. On the other hand when using XML, it does not bother about what kind
of platform that's being used because every message is parsed using an XML wrapper.
So SOAP is some kind of a generalization over message parsing frame works in distributed
object base systems which has been extended from simple, small distributed systems like
CORBA to a larger web base services using SOAP. So a similar analogy also exists in
data integration where data exists in different databases in preexisting forms and preexisting
applications and they have to be integrated using, by wrapping the data around using XML
wrappers.
So one simple way to convert databases and now here we are assuming relational databases.
And for the most part we would be correct because majority of the database implementations
around the world are relational databases. So assuming that existing databases are relational
databases and we need to be able to interface between these different relational databases
which are across different platforms. A simple way to perform this is to convert relational
tables in their canonical form to XML documents.
And this slide shows such an example where there is a slide called actors containing
of two different columns last name and first name. And this table became one XML document
actors and slash actors, it's a rooted XML document. And it is some kind of a flat XML
document in the sense that the number of levels in this document is fixed. That is the actors
file or the actors document comprises of several children called row and slash row where each
row comprises of again several children, each child corresponding to a specific attribute
or attribute name and this is an attribute value within the attribute name.
So it's quite simple to map a given relational table on to a canonical flat XML file and
given a flat XML file, it's again straight forward to map it back into a relational table.
Now let us come to the question of storing XML documents itself that is having XML databases
itself. In many cases for example in several commercial implementations, oracle 9i and
DB2 and so on, the database gives an impression that it supports XML storage and XML based
updates. However what it actually underneath, it's still a relational database and then
there is an XML wrapper around the database.
However that is there are several other approaches to storing XML documents and in many cases
it becomes necessary to treat XML not just as an interchange language but also as the
language in which the data is stored. And some of the issues that occur in this regard
would be issues like data layout. How would you organize the XML data there are again
many alternatives to this question.
One can map XML data on to a relational database or one can map an XML dataset as a special
kind of object in an object relational database or one can store XML data in native XML form
or in text form and so on.
And what about updates? I mean XML databases are prone to updates and here in XML unlike
in say relational database it's not just the values of attributes that can change, the
actual structure of the database can also change very frequently. In a relational database
we basically assume that once we fix the schema, it is intact and it's only the data sets that
are updated. That is new attributes are added or existing attributes might be deleted or
modified and so on. But the evolution of the schema itself that is change in a structure
of the database itself is considered to be far less frequent if at all it happens. So
the database is modified, database is oriented, the relational database is oriented towards
fast updates and retrieval of data and not the structure as such.
But in an XML database that may not be the case that is there could be updates of not
just the values but also of the structure of in which data is stored in the database.
And in many case there may be no explicit schema that's available, so well formedness
itself is the schema. so we don't we don't really know which is a valid structural update
and which is an invalid structural update and what kind of constraints exists between
different elements and so on, unless there is a XML schema or a DTD that's available
for us.
Similarly there are different kinds of query supports that need to be supported by an XML
database. this standard flower queries that we saw earlier where for let where return
kind of queries and also select, project and join based queries where which is also supported
in XQuery in addition to recursion or document construction that's transformation from one
XML document to the input XML document to the output XML document. And a far bigger
problem is of indexing. How do we maintain meta data or how do we maintain indexes into
an XML store? Such that we can retrieve elements as quickly and efficiently as possible whenever
required. And it's not just the normal attribute and value index that we need to store but
we also need to store full text indexing, what are called as inverted indexes, so to
be able to search a full text elements or data sets quite efficiently.
In addition to these kinds of indexes we also need what are called as structural indexes,
I mean structural indexes essentially talk about what are the relationships between,
structural relationships between different elements is one a child of the other is one
reachable from the other or what are the siblings of a given element and so on. And in addition
there are general requirements like scalability and recovery in case of failures, concurrency
control, updates and during updates and so on.
So let us look at XML storage structures itself and what are the different kinds of choices
we have for XML storage. Basically we can divide the different kinds of XML storage
that storage structures that are, storage paradigms or storage mechanisms that are available
into three different classes, what might be termed as flat stream based storage or native
XML storage and colonial storage. So what do each of this terms mean? A flat stream
based XML storage essentially stores XML data as some kind of objects in an object relational
database. They are stored as CLOBs or remember what is a BLOB. A BLOB is a binary large object
and a CLOB is a character large object.
So you just store an object comprising mostly of characters, so character large object which
contains its own mechanisms of access and retrieval as one of the attributes in a relational
database. So just like storing any multimedia object or some such object, you can store
an XML dataset in a relational object relational database itself. But of course the advantage
of this is that, you don't need to reinvent anything in the sense that there are several
object relational databases that are already available. And it's just a question of using
one of the object relational databases to store XML documents but however of course
the flip side of this argument is that the database itself does not support any XML centric
queries. So you can query the database based on object relational constructs but not an
XML constructs itself. And all those XML specific query constructs like flower queries or XPath
queries and so on cannot be directly supportive.
The next kind of database storage strategy for XML is the native XML storage. Native
storage essentially means that you design a new database from scratch for storing XML
data, optimized for storing XML data. So here you have to worry about everything that you
thought about let's say for designing a relational database, one is to think about the storage
structure, block accesses that is the physical storage structure, indexing structures then
some kind of query mechanisms and recovery mechanisms, concurrency control, transaction
I mean throughputs and I mean how to maintain efficient transactions and so on.
So everything that goes into designing any conventional DBMS should go into designing
this native XML storage as well. And there are quite a few examples of databases supporting
native XML storage. The third strategy is what might be termed as a colonial strategy.
A colonial strategy essentially means that colonize an existing paradigm using XML. That
is use some kind of, use an existing paradigm like say relational paradigm and then map
every XML construct to a relational construct rather than note the difference between a
colonial strategy and the first strategy which you just put XML storage as character large
objects or one of an attributes of a relational table.
However a colonial storage essentially decomposes or deconstructs an XML document into relational
constructs and reconstructs them back. So whatever XML related query that you provide
like XPath expressions or flower queries and so on, they have to be rewritten in SQL and
you have to have a mapping, one each to have a mapping between XQuery constructs and SQL
constructs and vice versa. And of course this can be both an advantage and a limitation.
The advantage primarily being that you don't have to worry about number of techniques that
the native XML storage has to worry about storage structures, block storage structures,
concurrency control and recovery and so on.
However there is a significant if not huge performance overhead in terms of mapping an
XQuery construct into an SQL construct. Especially when there is a recursive kind of a query
that's given, mapping it into an SQL construct can take quite avail and running the SQL query
on the database can also be quite inefficient.
So let us look into the second strategy in a little bit more detail namely the strategy
of native XML storage. That is redefining or redesigning an XML database by looking
at all the different aspects that needs to go into an XML store. So what are the issues,
what are the typical kinds of issues that one needs to worry about when designing a
native XML storage. One need to think about data layout, how is data organized on the
disk and physical data layout is essentially that is if my disk is organized in terms of
pages, disk pages or disk blocks what should each block contain and how should the blocks
be organized and so on.
Indexing is again a major requirement like we said before it's not just attribute and
value indexes that are important, we need to also look into full text indexing or indexing
of phrases within text or structural indexes that talks about how elements are related
to one another and so on.
And it needs to also address query support, what kinds of queries are supported in this
store and how do we optimize queries and how do we preprocess, do we need to preprocess
or and if so or how do we preprocess for managing efficient query retrieval and so on. Then
access control, concurrency control, updates and so on transactions, recovery and so many
other issues.
Now let's look at the question of data storage in a little bit more detail. And what are
the different approaches that are used for managing or storing physical disk blocks or
managing the physical data storage that is how is an XML tree mapped on to physical disk
blocks. Essentially one can think of, essentially an XML data set is a tree. So this slide shows
a particular tree like this where there is an imdb at the root node and there are different
show elements in the second node and each show element has different sub trees like
title, year, box office and so on and so forth. And finally the leaf node contains the dataset
that is available in this XML tree.
Now how do we divide this tree into physical data blocks and which kind of division makes
sense? One way to think of, one way to look at dividing a tree into disk blocks is to
cluster trees based on sub trees. That is this show sub tree would go into one disk
block and this show sub tree would go into another disk block and the imdb would just
be in an index which in turn just stores pointers to each of this disk blocks.
Now what is the advantage of such a storage? The simple advantages that the entire sub
tree is in one disk block, hence if I am looking at navigational queries where the user just
navigates from like opens an element and looks at the sub tree and opens another element
underneath and looks at that sub tree and so on like in a explorer kind of file system
navigation. For such kinds of navigational queries, this storage structure is very efficient
because when the user is navigating, the navigation path is quite close to the actual way in which
elements are related in the tree itself. It is quite unlikely that the user would open
this sub tree and start navigating here. So in one disk access, one can read an entire
sub tree.
However there is a flip side to this kind of data access itself. Suppose the user gives
a XQuery kind of, a user gives a query saying show me all elements, show me all show elements
where the title contains whatever the fugitive or something like that. Now if the query has
to be searched on a particular element like say title and even though we know which element
has to be searched for, we still have to access every data block in the disk because every
data block which contains the show element, which stores the show element would contain
a title element. So we still have to access every data block containing the show element
in order to access such a query.
So to answer attribute kind of queries, such a kind of organization is a file organization
is actually counterproductive or is not very efficient. However for answering navigational
kind of queries, such a kind of access is quite useful.
The second kind of database storage structure is to cluster similar elements within a database,
within a data page that is within a disk block. So, this slide shows such an example where
the same tree is taken imdb and show and title, year, seasons and so on. However rather than
storing an entire sub tree within a data block here, each element at a particular level are
clustered together. So, all title elements are clustered together under stored in one
data blocks, so in the red data block let us say. And all year elements are clustered
together and stored in one single data block that is a blue data block and so on. And then
there are pointers that point to each of this data blocks and so any of these elements that
don't contain CDATA or PCDATA, you can cluster all of these into one data block and then
maintain pointers from them to each of these data blocks.
Now, again this is in some sense the dual of the earlier mechanism where such a kind
of organization is very useful or very efficient for answering attribute queries. So if I want
to say show me all show elements or return all show elements where the title contains
the term the fugitive then all that you need to do is to first access this block which
contains the show element and find a pointer for the data block containing all the title
elements. And with just one data block access, for one will and of course one or more I mean
depending on how many title elements are there but generally with far lesser data blocks
than that are necessary in the previous case, we can access all title elements that are
there in this XML store.
So answering an attribute query is far simpler, however answering a navigational query becomes
difficult in this case. And there are other techniques for this thing, it depends on how
something is clustered. If clustering is performed based on what may be termed as the lowest
level elements where the elements contain just CDATA or PCDATA, it becomes difficult
to navigate. That is from show, you need to open a title and year and from title to year
it requires a different block access and so on. However it might be possible to cluster
based on paths rather than based on single elements, so rather than clustering similar
elements, some other techniques cluster similar paths.
Therefore one would say that cluster all paths of the form imdb show and title in one block
and all parts of the form imdb show and year in another block and so on. So there are different
variants, however if one were to ask the question which is the best way of storing, which is
the best storage structure for XML databases, the answer would be depends. Essentially there
is no single, there is no single technique that is universally most efficient way of
accessing or storing XML databases. And to a large extent it depends on what kind of
queries that you expect from the users. So if it is navigational queries, it might be
better off to cluster the tree based on sub trees, so store sub trees within data blocks.
On the other hand if it is more of attribute searches or even say full text searches, it
might make sense to cluster similar elements in a page rather than sub trees.
Let us look into the problem of indexing XML documents and what kind of indexing requirements
arise for XML document. Firstly, we said that in addition to attribute value indexing, for
which we can use traditional RDBMS indexing like say B plus tree or a B tree, here we
need two other kinds of indexes. one is what may be termed as full text indexing were we
should be able to efficiently search on free text data that are written within an XML document.
So one might just write a paragraph within an XML document and be able to search for
some key word in that paragraph. And very common mechanism of indexing full text is
what is called as an inverted index. An inverted index is very similar to what you would find
at the end of a book like say text book were you have an index and there are certain keywords
and if page numbers or section numbers in which the keywords are, in which those keywords
are either defined or used or whatever. So an inverted index on an XML document would
index different keywords that appear in the document and then maintain links in to the
XML document saying where each of these keywords can be found. And keyword based indexing can
be of two kinds, it may either be an XML aware keyword indexing or an XML unaware keyword
indexing. So what is the difference between the two? XML unaware keyword indexing just
looks for keyword searches. So you just give a keyword search called the fugitive or jerry
sign field or whatever that appears in the XML document. And keywords are searched, this
table here shows XML unaware keyword searches that is for every term that appears, it just
stores the document reference which document contains this or probably the element reference
or whatever. So it doesn't bother about where keyword appears and it only bothers about
the keyword and the value in the keyword.
On the other hand an XML aware keyword or an XML aware index not only contains the keywords
but also another element or another column which says in which element does the keyword
appear from. So there is one more index that this ampersand t1 here is an element index
or is a key into an element index where it identifies each element uniquely that is present
in the XML store. So what this says is that the term fugitive can be accessed or the term
fugitive appearing in the element whose key is t1 can be accessed using this pointer from
wherever. So if you are looking at attribute based searches and we want to say return all
show elements where the title sub element of show contains the term the fugitive.
Then an XML aware indexing makes much more sense than XML unaware indexing. and of course
the flip side of XML aware indexing is that as the number of elements increase and keywords
are repeated across different elements, there is a huge amount of combinational choices
that appear between a given term and an element pair. So the term 1993 for example may appear
in different kinds of elements, it might appear in births date, it might appear in release
date, it might appear in show date or whatever and so on. so different kinds of elements,
so each of these have to be indexed separately and which leads in increase in the size of
the index structure. So this table shows different XML, native XML databases that are available
and quite a few of them are already available like Xyleme or Natix or GoXML and so on.
And most of them have been built from scratch and some of them like eXelon or Tamino have
been built over existing databases. So it's not in a pure sense, a native XML storage
but they do they are called native mainly because they do address many of the questions
that are typically addressed in native XML storage. And there is a wide diversity or
wide ray, wide diversity in the kind of features that are supported.
So Xyleme for example supports full text searchers and XPath searchers and XQuery searches but
it doesn't have any what kind of APIs that it provides or unknown and Natix doesn't support
any of these but just supports some low level primitives and so on. and there is partial
support for XQuery and so on. Let us now move into the second part of the stock where we
look into managing semi structured data.
So until now we have been looking primarily into XML and XML is a very comprehensive tool
to manage semi structured data. And so what is semi structure data and what is the significance
and why is it important to study about semi structured data?
So, let us first define what is semi structured data. There are sever several different definitions
and of course semi structured data what we understand often is data that whose structuring
is not rigid and data which doesn't conform to a very rigid structuring mechanism. There
are other kinds of definitions as well like data that is inherently self-describing and
self-describing data, so with no rigid schema which basically implies there is no rigid
schema, the data itself defines its schema and that is known as semi structured data.
And again data which are generated half hand and without planning and so on, there also
called semi structured data.
Now in today's world semi structured data is getting more and more prevalent and data
structuredness is becoming much harder to impose and define an impose. For example what
is the structure of the world wide web, I mean the world wide web is the huge data store
but without any structure, in fact but one cannot even say that it is an unstructured
data store because there is some semblance of structure that is present like one can
think of some meta data tags that are available for each HTML text, some HREF hyper link references
and so on and directories and so on and so forth. But on the whole it is not possible
to define specific structure and then impose the structure on the world wide web.
So the world wide web is the best example for huge semi structured data store and most
semi structured data stores are characterized by rapid or rapid changes or very frequent
changes to the data set. So not only is the data not defined apriori or the structure
of the data not defined apriori, it is also changing dynamically or it's also changing
rapidly. And it is not able to, it's not possible to formalize semi structured data using a
nice formal model like the relational model and the best data structure that's used to
formalize a semi structured data are usually graph structures.
And there are several different examples for semi structured data like web information
systems and digital libraries or even data integration from heterogeneous data sources
can be considered to be a semi structured data problem where there are several different
databases that are already defined. And there are so many such databases that it becomes
impossible or impractical to be able to impose a common structure over all of them and it
is easier to treat them as a large semi structured data store. So the very common example of
semi structured database is of course the internet movie database which we have been
seeing continuous examples of when we are talking about XML as well.
So imdb is a classic example of a collection of semi structured data and even though I
mean what makes a semi structured? The fact that even though it just stores information
about movies, each movie is different from the other. Each movie may belong to a different
journal and it may belong to different country, some may have language fields, some may have
some star cast fields, some may have some other kinds of fields which may not be available
in the other records and so on. So let's consider an example from a movie database. Let's say
imdb is the database and of course imdb not just stores movies, it also stores information
about tv serials and documentaries and other such movie related or movie like data that
have been released.
So even within a movie, let us say even within the movie category different movies could
be different, movie one may have information about the cast and the director in the movie
and who could be the cast in the director but movie three may have something called
actors and actresses, it may make a distinction between who is the actor and who is the actresses
and the direction that is rather than just the director, it can talk about a direction
team or who directed it and so on.
So the structure of each record that makes up a movie element in imdb may be different
from one another. And some data elements may annotate more information than others and
some may have missing fields and the kind of relationships that exist between each of
these different elements may also change, may also vary between different records.
And in addition to that, in addition to changes in structure the way in which data is organized
itself could change. For example one might represent an actress name as first name, last
name and one might represent it as last name, first name or one might, some other record
might have its something like just a name and so on.
And data gets added to this database dynamically and as a result and dynamically and from different
sources from different independent sources. So it becomes difficult to enforce a particular
kind of schema restriction on this database. So what is the problem here or what is the
main problem in managing semi structured data? The main problem is trying to ascertain what
structure or what is the common structure to use for different data sets that are being
added to the database and to be able to formulate queries and formulate query languages and
so on.
And in addition we should also note that the structure of data element is implicit. So
it's not that the user providing the dataset first defines the structure and then provides
data according to it but structure is embedded within the data itself.
Now the structure has to be first discovered and then a common structure has to be evolved
over the entire data set. And this is what is called as the problem of discovery of structure
that is the structure should be discovered such that the structure is indicative in nature
rather than constraining in nature.
That is the common structure that evolves out of a database should not constrain the
database to adhere to a specific structure but rather should be indicative of what kind
of data is available in the database and how they are interrelated to one another. So here
is an example. This slide shows an example of what is the main problem in or what is
the main challenge in semi structured data.
In relational or what may be termed as traditional data management what really happens is you
have a UOD or a universe of discourse like company or an academic institute or university
or whatever. And then you have a model of the UOD that is there is the schema that defines
how data in the UOD should be organized. It's not ease organized but how the data should
be organized and data that is collected from the UOD is first taken through this model
and populated into the database.
So when we say that an employee should have a pan number as the primary key and name and
dependents and salary and so on, it's only those sets of data that are extracted from
the UOD and then sent into the database. And especially for example, if an employee doesn't
have a pan number and the pan number is the primary key then it is not possible to add
that employee record into the database because the primary key has to be not null, it has
a not null constraint and so on.
So constraints are enforced when the database is being populated and the query also is formulated
within the model of the UOD. So the query just takes the model of the UOD and queries
the database accordingly.
However in what might be termed as the post internet data management which is the main
problem with semi structure databases, the universe of discourse whatever is the universe
the world wide web or the internet movie database where users can independently add movie data
into the database, the universe directly populates the database.
It doesn't go through any common mental model by which the database is populated, in fact
there might may or may not be any mental model here as to how the data is organized but the
data is directly populated by the UOD.
Now the query, when the query is searching the data it should not only know what data
to search, it should first try to find out what is the mental model or how is the data
organized, what is the schema for this by which the data can be searched. So that is
basically the schema discovery problem or the implicit schema discovery problem.
And in addition to that, the schema discovery problem often encounters the problem of what
is called as the large schematic structure. That is even when we discover a schema, this
is again called the maximalist world notion, in contrast to the minimalist world model
of a traditional database system.
In a traditional database system whatever is not allowed by, whatever is not explicitly
permitted by the schema is forbidden. So everything is forbidden unless explicitly allowed by
the schema. So it's a kind of exclusivist data model where things are thrown away unless
they are permitted.
However in a schema discovery process, it's an inclusive model where everything should
be permitted unless it is sure that it is forbidden, that is unless it is sure that
some kind of a relationship cannot exist, all kinds of relationships between data elements
are permitted. So it is not, one cannot apriori define what kinds of relationships exist between
data elements unless of course we know that some kind of relationships do not exist in
the database or cannot exist in this UOD.
So the associated problem from this is that the discourse schema can be quite large rather
than in contrast to a relational schema where the schema is much smaller than the data set
itself. so in the internet movie database for example, we might discover that lot of
different, we might discover lot of different things that go into a movie based on what
people add into the database. And we should allow for all such relationships, unless of
course we know explicitly or unless of course we know specifically that some kind of a relationship
cannot exist.
For example we might know as a rule, I don't if this is a true but we might know as a rule
that in the universe of discourse that in the world of movies, it is not possible for
a director to be the boss of the producer or something like that. So the producer reporting
to the director or whatever. So unless of course we know that some kind of relationship
does not exist, we have to accept all kinds of relationships that are, we might have to
accept a dataset about a movie which does not contain any movie star.
We might have to accept a data set where a movie contain 10 different movie stars and
so on. So all such relationships should be accepted unless explicitly forbidden. And
as a result, the actual schema that is generated is far bigger than a typical relational schema.
There are several different application areas where this is useful and of course where these
have been tried out and these include data integration where you design an interface
to integrate different desperate data sources coming from different locations, each having
their own schematic structures. And the second major area of application is in digital libraries
which consists of again different kinds of semi structured data coming from different
sources.
Several more application areas like genome databases or scientific databases that talks
about scientific documents, similar documents, citations, references and abstracts and where
it was published and ratings and so on and so forth. And of course in E-Commerce applications
where the discovery of structure problem becomes very important in business to business systems
where each business if it is quite big, it might be difficult or impractical to impose
a very specific schematic structure over the entire business house.
So one needs to be able to resort to semi structure data management, when managing B
two B business systems.
So let us skip through these slides where the need for discovery of structures are motivated
even more or which talks about how discovery of structures can go about and addressing
the discovery of structures problem in itself would take a complete, would involve a separate
session and it is clearly beyond the scope of this particular lecture.
However we can give some kind of thumb rules which talk about how implicit structure can
be discovered from a set of desperate data sources. And most of these revolved around
looking at some kind of regularity in the data set and then generalizing based on this
regularities. And so several different kinds of data mining
and machine learning and artificial intelligence kind techniques are explored for trying to
fit a structure on to a data set. And there is also a notions of what is a best fit. A
best fit data structure or a schematic structure should not be too general and should neither
be too specific and so it has to be, the structure discovery process should be able to generalize
based on whatever examples were encountered while passing through the data.
However it shouldn't be too general in the sense that it can accept anything or any structure
in the data set. That is it should also identify what are the forbidden relationships among
data elements in addition to what are all the possible relationships among the data
elements.
And several different kinds of query languages are also supported for semi structured data
in addition to XQuery which is primarily meant for XML databases and of course keyword based
searches which are useful for, which are useful for full text searching. There are other kinds
of primitives like navigation based queries or searching for patterns or temporal queries
based on how particular data element evolves over time and so on. So XML is an embodiment
of semi structure data in the sense that XML is the natural choice in which semi structure
data can be organized. And the problem of discovery of structure over XML, over semi
structured data can reduced to discovery of a XML schema given a desperate set of XML
documents. And queries can be revisited using XPath and XQuery expressions based on whatever
structure that have been discovered.
So let us summarize whatever we learnt in this session and of course the idea of semi
structure data itself is a vast ocean and it is beyond the scope of this lecture to
explore all of them. So therefore we looked at native XML storage and XML publishing and
different kinds of storage structures that have been proposed for XML and mainly we touched
upon the problem of the semi structured data and the larger problem of discovery of structure
which is very important for semi structured data management. So with that we shall end
this session.