SlideShare une entreprise Scribd logo
1  sur  12
Télécharger pour lire hors ligne
Title: Triple-Triple RDF store with Greedy Graph Based
Grouping
Name: Vinoth Chandar ( vinothc@cs.utexas.edu )
Partner's Name: -
RDBMS Platform: MySql 5.0
Triple-Triple RDF Store with Greedy Graph Based Grouping
Abstract                                                         platform, on which the state of the art is evaluated and its
Triple stores, based on relational databases, have progress towards the vision is assessed. Efficiently querying
received      wide     attention      amongst         database such Triple stores presents many challenges. Since RDF is
researchers      and     Semantic      web      enthusiasts. essentially a graph based data format, queries involve
Improving the query performance on very large multiple joins and become really slow when scaling to
RDF datasets is a challenging problem, which billion triples. If the 'Semantic Web' vision is to be
needs to be addressed for efficient implementation incorporated, then we require very fast query retrieval
of such triple stores. In this paper, we explore techniques since the long response times of these systems
promising approaches to address the problem. We would be unacceptable to a normal Internet user.
explore the possibility of storing the RDF triples
in different orders in three tables and develop a In this paper, we explore promising new ideas for Triple
query rewriting scheme for the same. We also look store implementation. In particular, we take the Triple-Triple
at optimization of the physical schema by graph idea (explained later) to its logical conclusion, and develop
clustering techniques that aim to bring related SPARQL to SQL query rewriting mechanisms for the same.
triples closer to each other on disk. We also We further enhance the Triple-Triple idea by introducing a
present       experimental         results     from        the computationally feasible clustering scheme that attempts to
implementation of the scheme over a million reduce the number of disk pages accessed, by moving related
triples. Our results show that our scheme can subjects/objects/properties closer to each other on disk. In
yield significant performance benefits on complex fact, this clustering scheme can be applied to any general
queries.                                                         indexing scheme for the Triple stores.


1. Introduction                                                  Section 2 details related work in this area. Section 3 presents
In the recent years, RDF[1] stores or Triple stores, that can various approaches that were considered for improving the
store information about (subject, property, object) triples of query performance. Section 4 presents a query rewriting
Ontologies, have received significant attention from database technique corresponding to the Triple-Triple idea. Section 5
researchers. Many efforts have been made to implement identifies and analyzes the benefits of grouping related
RDF stores using relational databases, and devising efficient triples in the same data block, to reduce the number of disk
schemes for accessing information from such stores. These IO operations. Section 6 presents the experimental results
efforts are focussed towards the larger vision of the and Section 7 concludes.
'Semantic Web'. To realize this vision, RDBMS based Triple
stores should be able to store , and query enormous amounts 2. Related Work
of triples that describe web pages on the Internet.              [3] establishes the validity of using relational databases to
                                                                 store and query ontologies. The paper extends SQL with a
 The Billion Triple challenge [2] serves as a common set of ontology related operators, that can help obtain more
pertinent results for ontology driven applications . At the spo, osp , pos, to help lookups based on each of the three
same time, the applications can also benefit from the elements in the Triple. However, it works with its own query
efficient storage and retrieval mechanisms in a RDBMS. language rather than using the general purpose SPARQL and
Simplistically, the baseline physical model for storing RDQL. Adabi et al [5] pursue an interesting path, arguing
ontologies in RDF format, using a RDBMS consists of two towards having a table for each property, with the associated
tables – Symbol table and a Triple table (refer figure 1). An subject and object that are related by that property. Such a
ontology describing the elements of the Web, contains URLs vertical partitioning approach tends to reduce the query
and URIs, which are long strings (lex field in SymbolTable). response time by incorporating fast linear merge joins, when
To avoid redundancy and wastage of disk space, these each table is sorted by the subject or object. However, such
elements are assigned unique integer identifiers (hash field in an approach inherently assumes that the queries are property
SymbolTable). This mapping from the element to the bound. A non-property bound query would require us to
identifiers is stored in the Symbol table. The Triples table query across all the tables. Hexastore [6] furthers the
has three columns – s (subject), p (property), o (object) – as multiple indexing approach taken by Kowari, by storing the
per RDF conventions and each tuple in the table represents a three elements of a triple, in six different orders. For
RDF triple. The table has a compound primary key on all the example, the spo ordering is stored as a sorted list of
three columns. Such a naive representation of the triples, subjects, with each subject pointing to another sorted list of
enables us to analyze clearly, where the benefits come from, properties defined for that subject. Each property in such
when evaluating much more sophisticated physical schema.       sorted properties list points to a sorted list of objects defined




                                    Figure 1 Baseline Physical model

                                                               for that subject, property pair. Thus, all joins can be
Many research efforts have attempted to propose alternate converted into fast linear merge joins. Hexastore occupies
physical schema and improved SPARQL to SQL query five times more space than a single triples table. However,
rewriting techniques, to improve query performance over the this is acceptable with the ever falling storage costs.
baseline model. This is based on the realization that the
baseline model can be used as a simple logical data model 3. Promising Directions
alone. Kowari metastore [4] proposes a RDF store based on We will now explore some promising directions in which we
AVL trees with each triple stored in three different orders – can further improvements from the baseline physical model.
We will detail each idea and also present some motivation installation.
for pursuing [or abandoning] the idea. We also present
concrete and motivating examples for our arguments, using 3.2 Triple-Triple
MySql.                                                         One promising idea is to create three redundant copies of the
                                                               triples table with the compound primary keys – spo, pos,
3.1 Applicability of Spatial Indexes                           osp, such that each table has tuples sorted on subject ,
Potentially faster query execution times can be achieved if property and object in the order of occurrence in the primary
the joins between the triples are materialized in euclidean keys of the tables. Figure 2 presents the Triple-Triple




                                   Figure 2 Triple­Triple Physical model


space, in terms of fast minimum bounding rectangle (MBR) physical model. From here on, we will refer a table by its
operations. For example, simply storing each triple as a (s,p), primary key i.e. spo table will denote the triples table with
(p,o) line segment will materialize subject-subject joins as a (s,p,o) as the compound primary key. For the spo table, the
MBRcontains     operation    between    the   line   segment triples will be clustered based on subjects and then clustered
(s,minhash) (s,maxhash) and the start point of each triple line on properties and then on objects, Such, a clustering ensures
segment. Minhash and maxhash are the minimum and that the triples are stored in sorted order in disk and hence
maximum integer ids from the symbol table. However, the fast linear merge joins can be applied. Note that this scheme
support for R-Tree operations remain limited in commercial requires only 3 times extra space than a triples table,
DBMS. MySql does not support spatial joins. Postgresql described in the baseline model. Hence, this approach is
does not support R-Trees. Only Oracle Enterprise, supports definitely promising. The primary compound index is also
spatial joins. We were unable to pursue this direction further, useful for any query that involves a prefix of the compound
due to non availability (or rather infeasibility) of an Oracle key. For example, spo table can answer select queries based
on s, sp, spo columns, using the primary index.                   for each join operation – spo or pos or osp. Also, we must be
                                                                  able to support selects on any combination of the three triple
Though MySql does not support merge joins [7], the idea elements. These decisions are listed in Table 1.
still would yield faster execution times since the sorted
nature of the data ensures efficient use of the index.                         Operation                  Method
Remember that, even with a secondary index, if the relevant
                                                                           subject-subject join         spo JOIN spo
tuples are spread across the table in different blocks, the end
performance could be worse than a table scan. For the                    property-property join         pos JOIN pos

Semantic web vision, MySql plays a pivotal role since many                  object-object join          osp JOIN osp
web sites are powered by MySql. This gives enough
                                                                           subject-object join          spo JOIN osp
motivation to continue exploring the idea using MySql. For
                                                                              Subject select                 spo
example, a simple three table join on 25K triples, using
Triple-Triple yields   very encouraging benefits over the                    Property select                 pos
baseline model. In Figure 3, Triples_s denotes spo table,                     Object select                  osp
Triples_p denotes the pos table and Triples_o denotes the
                                                                        subject-property selects /    spo [no need for a
osp table. Triples tables denotes the baseline triples table.
                                                                         property-subject select        separate pso]
Nodes table will denote the symbol table. All following
examples in the paper will use the same conventions to                   subject-object selects/      osp [no need for a

denote the corresponding tables.                                          object-subject selects        separate sop]

 mysql> select * from Triples_s t_s,                                    Object-property selects/      pos [no need for a
 Triples_o t_o , Triples_p t_p where 
                                                                         property-object selects        separate ops]
 t_s.s = t_o.o and t_s.s = t_p.p; 
 Empty set (0.28 sec)
                                                                         Subject-property-object          Any table
 mysql> select * from Triples t_1,                                                select
 Triples t_2, Triples t_3 where t_1.s = 
 t_2.o and t_1.s = t_3.p;                                           Table 1 : Answering queries using Triple-Triple
 Empty set (2 min 10.83 sec)

                                                                  Hence, to our understanding, the three possible orderings –
 Figure 3  Benefits of Triple­Triple over baseline
                                                                  spo, pos, osp – are sufficient for handling the same set of
                                                                  queries that the Hexastore handles. The only missing piece in
In comparison to the Hexastore, the Triple-Triple stores only
                                                                  building a complete Triple-Triple store is to define
three possible orderings of the elements in a triple. We
                                                                  mechanisms to convert sparql to sql queries, using the
explore if these orderings are sufficient for answering the
                                                                  appropriate tables for each triple. We explore this problem in
same range of queries answered by the Hexastore. The only
                                                                  the next section.
possible joins are subject-subject joins, object-object joins,
property-property joins and subject-object joins, with all
                                                                  3.3 Applicability of Secondary Indexes
joins being equijoins. Hence, mechanisms for using the
                                                                  It is tempting to create secondary indexes on other columns ,
Triple-Triple should judiciously choose which table to use
that do not form a prefix of the primary key of the triples is adopted. Since all joins of t1 are subject joins, we safely
table (both in the baseline and the Triple-Triple idea). For choose spo table for t1. Since all the joins involving t4 are
example, the spo table can have secondary indexes on po and subject joins, we choose spo table for t4. However, t3
o, to speed up the selects on those columns. However, initial presents a difficulty. For the t1,t3 join to be efficient , we
experiments showed no improvement in the query response need to use spo table for t3. For the t3,t4 join to be efficient,
times, since the bulk of the benefit was from the primary we need to pick the osp table for t3. Clearly, only one of
index or the values were split across multiple disk pages so these options is possible. In general, a sparql to sql compiler
that the secondary index did not make sense. For example, for the Triple-Triple has to make these hard decisions
there are relatively very few properties in the data, when dynamically during runtime. From our experiments, we
compared to subjects or objects. Hence, a secondary index noticed that the MySql optimizer (or any other DBMS) does
would not be beneficial here. Also, along the similar lines, not do a good job in choosing the right join order for the
secondary index on the lex field of the symbol table did not tables and substantial performance gains can be achieved by
yield significant benefits. Hence, we stick with the Triple- simply rewriting the query by explicitly specifying a join
Triple baseline model.                                         order. These are hard search problems and thus, even in the
                                                               context of the Triple-Triple, the sparql compiler cannot be
4. Add-Join                                                    expected to do a very good job in optimizing the choice of
Converting SPARQL to a SQL query on the baseline is tables.
straight forward. When deciding which tables to use for each
triple in the Triple-Triple store, we can be faced with Hence, we adopt a method we term as Add-Join, which tries
interesting tradeoffs. For example, consider the SPARQL to achieve the best of both worlds, by using multiple triples
query in figure 4.                                             tables for a single triple in the SPARQL query. In effect, we

SELECT ?label ?producer ?comment                               add extra joins to the resulting sql query. But, as we show,
WHERE {                                                        the cost of such additional joins is no way prohibitive and
t1   dataFromProducer9:Product444 
rdfs:label ?label .                                            can be traded off in return for a deterministic simple sparql-
t2   dataFromProducer9:Product444                              sql compilation technique. For example, in the above
rdfs:comment ?comment .
t3   dataFromProducer9:Product444                              example, we use two tables for t3 triple – t3_o [which is a
bsbm:producer ?p .                                             osp table] and t3_s [which is a spo table]. We join t1 with
t4   ?p rdfs:label ?producer 
}                                                              t3_s and t3_o with t4, and finally join t3_s and t3_o on all
       Figure 4: Tradeoff in SQL conversion                    three columns. The final join is very fast since it involves all
                                                               three columns, so that the primary index can be used. In
The triples involved in the sparql query are marked t1, t2, t3 effect, all of the joins in the query can use the clustered
and t4. It is easy to observe that t1 joins t2,t3,t4 on the primary index. Figure 5 shows that the additional joins are
subject and t3 joins t4 using a subject-object join. Our task is not prohibitively expensive, using the same example.
to select one of the three triple tables – spo, pos, osp – Also, when rewriting the query for the Triple-Triple, we
judiciously so that all joins make use of the primary index must ensure that we have as few rows to start with as
and hence the fastest access path to access the relevant data possible. Since, MySql uses a         single-sweep multi join
algorithm [8], this would ensure that we try to match as few method described above.
tuples as possible in each stage of the multi join.             5. Graph Based Grouping

  Add­Join :                                                    We observe that the further benefits can be achieved only
  select * from Triples_s as t1                                 through physical optimizations. One interesting observation
  STRAIGHT_JOIN Triples_s as t3_s 
  STRAIGHT_JOIN Triples_o as t3_o                               we make is that the selects on the triples table can be
  STRAIGHT_JOIN Triples_s as t4 where 
                                                                improved if the related tuples are brought closer to each
  t1.s = ­1280067686087321383 and t1.s = 
  t3_s.s and t3_o.o = t4.s and t3_s.s =                         other on disk. For example consider the spo table with
  t3_o.s and t3_s.p = t3_o.p and t3_s.o = 
  t3_o.o;                                                       selects using po, p, o on it. This is a common scenario when
  0.2 sec                                                       triples with different properties are joined on the subject.

  Use spo for t3:                                               Remember that we introduce additional joins only for the
   select * from Triples_s as t1                                joins between triples and selects using bound input values.
  STRAIGHT_JOIN Triples_s as t3 
  STRAIGHT_JOIN Triples_s as t4 where 
  t1.s = ­1280067686087321383 and t1.s =                        By bringing the subjects with common properties and
  t3.s and t3.o = t4.s;
  0.17 sec                                                      objects, closer to each other,      we, in effect, reduce the
            Figure 5 Cost of additional joins                   number of disk pages across which the result of the select
Hence, in addition to joins, we can also involve multiple operation is distributed. Thus, such a scheme would result in
triple tables for a single triple, when there is an initial select direct reductions in the I/O cost for the operation. The same
operation on the triple. For eg: Though a triple joins on s, it scenario can be applied to other two tables as well, bringing
might involve a select on p as a bound input value. In those together related properties and objects respectively.
cases, selecting p using a spo table may not be efficient.
Hence, we introduce an additional pos table for the triple and The integer identifiers assigned to the elements of the triples,
perform the select upon it, and later joining the pos and spo determine the order in which the triples appear on disk.
tables. We now present the algorithm to convert SPARQL to Right now, these ids are assigned randomly. Hence, we
SQL, based on Add-Join method.                                  cannot use general purpose clustering schemes based on
                                                                euclidean distances, to group the related triples together.
Query Rewriting method:                                         Also, without the Triple-Triple, it would be impossible to
Step 0. Convert the SPARQL query to sql, on the baseline        give equal priority to each of subject, property and object.
Step 1. Identify the triples that have bound values for its This is because the first element of the compound primary
elements i.e. the input for the sql query.                      key determines the order on disk and hence a clustering
Step 2. In the explicit join order that we intend to provide, scheme has to choose between the three elements. Another
start with the triple with bound input values and follow it approach is to abandon the primary keys all together and
with triples that join with that triple.                        define a physical representation that brings together related
Step 3. For selects on the non join attributes, insert entries triples close to each other , based on all three columns.
for 'Nodes' as necessary in the explicit order.                 However, such a scheme would compromise on having data
Step 4. Once, the explicit ordering is done, introduce in sorted order. The Triple-Triple idea lends flexibility by
additional triple tables for each triple, as per the Add-Join allowing us to optimize with respect to subject, property and
object using the spo, pos and osp tables respectively.             formulated as an optimization problem as described in figure
Hence, the integers Ids need to be assigned intelligently in 6. The formulation can be suitably generalized to P-graphs
order to leverage these benefits. We now define metrics and O-graphs also. The problem is an instance of the
which quantify the strength of the relationship between two Knapsack constrained Maximum Weighted cluster problem
elements. S-score, P-score and O-score denote the metrics [9].
for interrelating subjects, properties and objects respectively.
                                                                   The formulation aims to extract the cluster from the graph
 S-score(s1, s2) = Number of triple pairs t1,t2 with (t1.s=s1      such that the sum of all the edge weights in the cluster is
         and t2.s=s2) and ( t1.p = t2.p or t1.o = t2.o)            maximal, subject to the constraint that there can be at most B
                                                                   triples on a block. For MySql, which uses 16Kb data blocks,
defines the S-score for two subjects s1 and s2. P-score and B = 16Kb/12 = 1300.              Once, such a maximal cluster is
O-score are defined similarly. Once, we have these metrics extracted, we assign consecutive ids to all the elements of
computed, we build three graphs S-graph, P-graph, O-graph the cluster. We then repeat the algorithm, pruning the graph
that depict relationships between subjects, properties , removing the assigned edges. However, in practice, this
objects using S-score, P-score, O-score as edge weights problem is NP hard and grows computationally unrealistic,
respectively. The S-graph will have a vertices for each with large data sets which involve thousands of subjects. The
subject and S-score between two subjects as the edge weight. S-graph is also very dense and this complicates the solution
Note that no subject or object occurs as a property. Hence, further. For example, a 25K triples database, contains 2367
the problem of assigning ids to properties can be solved subjects, 200K edges. Hence, when we scale to a billion
independent of the other two elements. However, some triples, the graph construction itself may become a very long
subjects also occur as objects. But, there can be only one id process. There are other graph clustering tools such as
that can be assigned to that element. We therefore prune out MCL[10] and Graclus [11], for unsupervised learning from
the   O-graph    by    removing    all   vertices   and   edges graphs. Though, these tools do not solve the exact problem
corresponding to such overlapping subjects.                        as described above, they attempt to provide clusters from the
                                                                   graph based on connected components. Attempts at
 Let S denote a cluster and Si denote a 
 subject belonging to S.                                           hierarchical clustering using the MCL algorithm, yields only
 Objective :
                                                                   3 clusters, reflecting on the dense nature of these graphs and
 Max : ∀i ∀ j Σ S­score( Si, Sj ),i!= j
 Constraints:                                                      non applicability of standard graph clustering techniques.
       Σ numTriples(Si) <= B
 where numTriples(Si) denote the number 
 of triples with subject Si                                        Hence, we attempt to develop computationally feasible
 B denotes the number of triples per 
                                                                   greedy algorithms for the problem. One such greedy
 block
                                                                   algorithm is described in Figure 7. The algorithm greedily
      Figure 6 : Optimal clustering of subjects                    constructs parts of the relationship graphs and assigns
                                                                   identifiers based on such partial graphs. The algorithm
Each disk page can be viewed as a cluster or group and the closely approximates the optimal solution for certain parts of
problem of finding the most related subjects can be the graph. Nonetheless, it is suboptimal since we ignore the
strength of the relationships between the discovered subjects queries that cause scalability problems for applications.
Si.

  While there is a subject S [that has                                      RDBMS                   MySql 5.0
  not been assigned an Id] :
        ­ Assign next available id to S                                     OS                      Ubuntu 8.04
        ­ compute all the subjects Si 
  related to S; i.e has a non zero S­                                       Processor               AMD Turion TL 58
  score with S
        ­ compute S­score(S, Si) for all                                    32/64 bit               32
  such discovered subjects
        ­ Assign ids to Si in the                                           Processor speed         1.9 Ghz
  increasing order of S­score(S,Si) till 
  Σ numTriples(Si) <= B                                                     L1 cache (MB)           128

              Figure 7: Greedy Grouping                                     L2 cache (MB)           512

                                                                            FSB speed (MHz)         800
In section 6, we validate the effectiveness of this algorithm.
                                                                            RAM (Gb)                2
Once again, the id assignment for properties can be done in a
symmetric fashion. For the objects, we need to additionally                 Disk capacity (Gb)      160

ignore objects which are also subjects.                                     Disk rotation           5400
                                                                            (RPM)
6. Empirical results
                                                                            Buffering (MB)          8
We present empirical results that demonstrate the validity of
the approaches proposed in this paper. Specifically, we study
                                                                                  Table 2 Platform details
the query performance compared to the baseline and the real
benefits of the grouping technique described in the previous
                                                                 The improvements on query 3 and query 6 are not
section. Our experimental setup is detailed in Table 2.
                                                                 significant. The baseline timings for those queries are not
                                                                 very large in the first place and involve lesser number of
6.1 Query Performance
                                                                 joins. For example, query 6 involves only one triple. Add-
For each scheme that we evaluate, we define a metric called
                                                                 Join uses two triples for the query and offers the same
speed up to compare it against the baseline. Speed up is
                                                                 amount of performance as the baseline. There are very
simply the query response time for the baseline divided by
                                                                 significant benefits on the queries 4 and 5. The benefits due
the query response time for the scheme for the same query.
                                                                 to Add-Join method account for the bulk of the benefits over
Higher the speed up better the scheme i.e. it determines how
                                                                 the baseline amounting to approximately 50x improvement
many magnitudes of performance improvement is achieved
                                                                 from the baseline. The direct benefits due to grouping
by the scheme. Figure 8 and 9 present the speed ups for
                                                                 technique amount to 10x-20x times over the baseline, when
three schemes – Rewrite (simply rewriting the query by
                                                                 compared to the Add-Join method without grouping. It
explicitly specifying the join order), Add-Join , Add-Join
                                                                 remains to be seen if better grouping techniques can yield
with grouping. The results are presented for queries 3,4,5,6
                                                                 significantly   higher     benefits.    However,   these   results
[12]. Queries 4,5 are typical examples of the complex
demonstrate the validity of the techniques described earlier.




                               Figure 8 Speedup from baseline (250K Triples)




                               Figure 9 Speedup from baseline (1M Triples)
6.2 Validation of Grouping scheme
It is also important to validate the grouping scheme
presented in section 5. We expect the grouping scheme to
reduce the number of disk I/O for selects on the triples
tables.    Table 3 presents the number of rows the query
optimizer expects to examine, for answering selects on the
triple tables. It can be seen that the grouping has resulted in a
decrease in the number of rows examined for the query.



          Table         No grouping         With grouping

      spo table            1000619               973866

      pos table            1000619               805152                               Figure 10 : R­score

      osp table            1000619               828004
                                                                    The results indicate that the grouping algorithm has been
  Table 3 : Expected number of rows accessed for
                                                                    quite effective, increasing the interrelationship by 2x-10x
                            selects
                                                                    times of the random Id assignment. The R-score for the pos
                                                                    table is lower since there are fewer number of properties and
We also measure the amount of interrelationship between the
                                                                    hence in a number of cases, a single data block is full of
triples in each block, with and without grouping, to observe
                                                                    triples from the same property. In fact, the average edge
the effectiveness of our grouping algorithm. For each disk
                                                                    weight for the pos table is much lower than those of the spo
block [i.e. a set of 1300 tuples] we construct S-graphs, P-
                                                                    and osp tables. There are higher values for the osp table
graphs and O-graphs representing those clusters. We then
                                                                    since there are far more objects than subjects, in which case,
compute the sum of all the edge weights of each such cluster
                                                                    there are more number of edges in the cluster graph
and average it across all data blocks. This allows us to
                                                                    constructed.
quantify the effectiveness of our grouping scheme. Ideally,
we should also be comparing our grouping scheme against
                                                                    7. Conclusions and Future work
the optimal solution. However, the optimal solution is very
                                                                    Thus, we have explored some promising approaches to
hard to compute as mentioned earlier and also cannot be
                                                                    improving query performance in relational triple stores. An
predicted accurately since it depends on the nature of the
                                                                    interesting query rewriting mechanism which introduces
triples. We divide the average edge weight for table with
                                                                    additional joins to speed up query execution has been
grouping by the average edge weight for the corresponding
                                                                    discussed. Optimization of the physical schema by
table without grouping to obtain a metric called the R-score
                                                                    leveraging the interrelationship between the elements of a
or the relationship score for those two tables. Figure 10
                                                                    triple, has been proposed.    A greedy grouping algorithm
Presents the R-scores for all three triples tables for 250K and
                                                                    which is simple and computationally feasible has been
1M triple tables.
proposed and validated. The results show that our approach Schiefer, Eds. Very Large Data Bases. VLDB Endowment,
is promising and can be potentially combined with other 1054-1065.
techniques in literature to yield faster RDF stores. As a part [4] David Wood, “Kowari: A Platform for Semantic Web
of future work, we intend to compare the performance of the Storage and Analysis”,In XTech 2005 Conference
system with the Hexastore and potentially enhance the [5]           Abadi, D. J., Marcus, A., Madden, S. R., and
Hexastore with our grouping algorithm.       As mentioned Hollenbach,       K.   2007.      Scalable   semantic   web   data
earlier, we would like to develop better grouping algorithms management using vertical partitioning. In Proceedings of
by leveraging parallel computing techniques to overcome the the 33rd international Conference on Very Large Data Bases
computational issues. We believe that better grouping (Vienna, Austria, September 23 - 27, 2007). Very Large Data
algorithms can yield significantly higher performance. Bases. VLDB Endowment, 411-422.
Another key observation we make, is that no physical [6] Weiss, C., Karras, P., and Bernstein, A. 2008. Hexastore:
schema will perform best for all types of queries. Hence, sextuple indexing for semantic web data management. Proc.
with the cheap availability of storage, multiple physical VLDB Endow. 1, 1 (Aug. 2008), 1008-1019. DOI=
schema can co exist within the same RDF store and the http://doi.acm.org/10.1145/1453856.145396
SPARQL compiler can judiciously employ them based on [7] Nested-Loop Join Algorithms
the type of the query.                                       http://dev.mysql.com/doc/refman/5.0/en/nested-loop-
                                                             joins.html
8. References                                                [8] Using Explain syntax
[1] Resource Description Framework                           http://dev.mysql.com/doc/refman/5.0/en/using-explain.html
http://www.w3.org/RDF/                                       [9]   Anuj Mehrotra and Michael A Trick, “Cliques and
[2] http://iswc2008.semanticweb.org/calls/call-for-semantic- Clustering: A Combinatorial Approach”,
web-challenge-and-billion-triples-tracks/                    [10] MCL : An algorithm for clustering graphs,
[3] Das, S., Chong, E. I., Eadon, G., and Srinivasan, J. 2004. http://www.micans.org/mcl/
Supporting ontology-based semantic matching in RDBMS. [11] Graclus
In Proceedings of the Thirtieth international Conference on http://www.cs.utexas.edu/users/dml/Software/graclus.html
Very Large Data Bases - Volume 30 (Toronto, Canada, [12] Class project website
August 31 - September 03, 2004). M. A. Nascimento, M. T. http://www.cs.utexas.edu/~jsequeda/cs386d/project.html
Özsu, D. Kossmann, R. J. Miller, J. A. Blakeley, and K. B.

Contenu connexe

Tendances

Data structure,abstraction,abstract data type,static and dynamic,time and spa...
Data structure,abstraction,abstract data type,static and dynamic,time and spa...Data structure,abstraction,abstract data type,static and dynamic,time and spa...
Data structure,abstraction,abstract data type,static and dynamic,time and spa...Hassan Ahmed
 
11. Hashing - Data Structures using C++ by Varsha Patil
11. Hashing - Data Structures using C++ by Varsha Patil11. Hashing - Data Structures using C++ by Varsha Patil
11. Hashing - Data Structures using C++ by Varsha Patilwidespreadpromotion
 
12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMSkoolkampus
 
AN ENHANCED FREQUENT PATTERN GROWTH BASED ON MAPREDUCE FOR MINING ASSOCIATION...
AN ENHANCED FREQUENT PATTERN GROWTH BASED ON MAPREDUCE FOR MINING ASSOCIATION...AN ENHANCED FREQUENT PATTERN GROWTH BASED ON MAPREDUCE FOR MINING ASSOCIATION...
AN ENHANCED FREQUENT PATTERN GROWTH BASED ON MAPREDUCE FOR MINING ASSOCIATION...IJDKP
 
Data structure lecture 2
Data structure lecture 2Data structure lecture 2
Data structure lecture 2Abbott
 
Data structures (introduction)
 Data structures (introduction) Data structures (introduction)
Data structures (introduction)Arvind Devaraj
 
Column store databases approaches and optimization techniques
Column store databases  approaches and optimization techniquesColumn store databases  approaches and optimization techniques
Column store databases approaches and optimization techniquesIJDKP
 
Chapter 1( intro &amp; overview)
Chapter 1( intro &amp; overview)Chapter 1( intro &amp; overview)
Chapter 1( intro &amp; overview)MUHAMMAD AAMIR
 
Effective Data Retrieval in XML using TreeMatch Algorithm
Effective Data Retrieval in XML using TreeMatch AlgorithmEffective Data Retrieval in XML using TreeMatch Algorithm
Effective Data Retrieval in XML using TreeMatch AlgorithmIRJET Journal
 
data structure
data structuredata structure
data structurehashim102
 
Unit 2 linear data structures
Unit 2   linear data structuresUnit 2   linear data structures
Unit 2 linear data structuresSenthil Murugan
 
how can implement a multidimensional Data Warehouse using NoSQL
how can implement a multidimensional Data Warehouse using NoSQLhow can implement a multidimensional Data Warehouse using NoSQL
how can implement a multidimensional Data Warehouse using NoSQLMohammed El malki
 
Introduction of data structures and algorithms
Introduction of data structures and algorithmsIntroduction of data structures and algorithms
Introduction of data structures and algorithmsVinayKumarV16
 
Introduction to data structure
Introduction to data structureIntroduction to data structure
Introduction to data structureVivek Kumar Sinha
 
Data structure lecture 1
Data structure lecture 1Data structure lecture 1
Data structure lecture 1Kumar
 

Tendances (20)

Data structure,abstraction,abstract data type,static and dynamic,time and spa...
Data structure,abstraction,abstract data type,static and dynamic,time and spa...Data structure,abstraction,abstract data type,static and dynamic,time and spa...
Data structure,abstraction,abstract data type,static and dynamic,time and spa...
 
Data structures project
Data structures projectData structures project
Data structures project
 
11. Hashing - Data Structures using C++ by Varsha Patil
11. Hashing - Data Structures using C++ by Varsha Patil11. Hashing - Data Structures using C++ by Varsha Patil
11. Hashing - Data Structures using C++ by Varsha Patil
 
12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS
 
AN ENHANCED FREQUENT PATTERN GROWTH BASED ON MAPREDUCE FOR MINING ASSOCIATION...
AN ENHANCED FREQUENT PATTERN GROWTH BASED ON MAPREDUCE FOR MINING ASSOCIATION...AN ENHANCED FREQUENT PATTERN GROWTH BASED ON MAPREDUCE FOR MINING ASSOCIATION...
AN ENHANCED FREQUENT PATTERN GROWTH BASED ON MAPREDUCE FOR MINING ASSOCIATION...
 
Data structure lecture 2
Data structure lecture 2Data structure lecture 2
Data structure lecture 2
 
Join operation
Join operationJoin operation
Join operation
 
Data structures (introduction)
 Data structures (introduction) Data structures (introduction)
Data structures (introduction)
 
Bo4301369372
Bo4301369372Bo4301369372
Bo4301369372
 
Column store databases approaches and optimization techniques
Column store databases  approaches and optimization techniquesColumn store databases  approaches and optimization techniques
Column store databases approaches and optimization techniques
 
Chapter 1( intro &amp; overview)
Chapter 1( intro &amp; overview)Chapter 1( intro &amp; overview)
Chapter 1( intro &amp; overview)
 
Effective Data Retrieval in XML using TreeMatch Algorithm
Effective Data Retrieval in XML using TreeMatch AlgorithmEffective Data Retrieval in XML using TreeMatch Algorithm
Effective Data Retrieval in XML using TreeMatch Algorithm
 
data structure
data structuredata structure
data structure
 
DBMS 9 | Extendible Hashing
DBMS 9 | Extendible HashingDBMS 9 | Extendible Hashing
DBMS 9 | Extendible Hashing
 
Unit 2 linear data structures
Unit 2   linear data structuresUnit 2   linear data structures
Unit 2 linear data structures
 
how can implement a multidimensional Data Warehouse using NoSQL
how can implement a multidimensional Data Warehouse using NoSQLhow can implement a multidimensional Data Warehouse using NoSQL
how can implement a multidimensional Data Warehouse using NoSQL
 
Introduction of data structures and algorithms
Introduction of data structures and algorithmsIntroduction of data structures and algorithms
Introduction of data structures and algorithms
 
Introduction to data structure
Introduction to data structureIntroduction to data structure
Introduction to data structure
 
AtomiDB Dr Ashis Banerjee reviews
AtomiDB Dr Ashis Banerjee reviewsAtomiDB Dr Ashis Banerjee reviews
AtomiDB Dr Ashis Banerjee reviews
 
Data structure lecture 1
Data structure lecture 1Data structure lecture 1
Data structure lecture 1
 

Similaire à Triple-Triple RDF Store with Greedy Graph based Grouping

Automating Relational Database Schema Design for Very Large Semantic Datasets
Automating Relational Database Schema Design for Very Large Semantic DatasetsAutomating Relational Database Schema Design for Very Large Semantic Datasets
Automating Relational Database Schema Design for Very Large Semantic DatasetsThomas Lee
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseEditor IJCATR
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseEditor IJCATR
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseEditor IJCATR
 
Bitmap Indexes for Relational XML Twig Query Processing
Bitmap Indexes for Relational XML Twig Query ProcessingBitmap Indexes for Relational XML Twig Query Processing
Bitmap Indexes for Relational XML Twig Query ProcessingKyong-Ha Lee
 
Query Distributed RDF Graphs: The Effects of Partitioning Paper
Query Distributed RDF Graphs: The Effects of Partitioning PaperQuery Distributed RDF Graphs: The Effects of Partitioning Paper
Query Distributed RDF Graphs: The Effects of Partitioning PaperDBOnto
 
Icde2019 improving rdf query performance using in-memory virtual columns in o...
Icde2019 improving rdf query performance using in-memory virtual columns in o...Icde2019 improving rdf query performance using in-memory virtual columns in o...
Icde2019 improving rdf query performance using in-memory virtual columns in o...Jean Ihm
 
Accelerating sparse matrix-vector multiplication in iterative methods using GPU
Accelerating sparse matrix-vector multiplication in iterative methods using GPUAccelerating sparse matrix-vector multiplication in iterative methods using GPU
Accelerating sparse matrix-vector multiplication in iterative methods using GPUSubhajit Sahu
 
Rdf Processing On The Java Platform
Rdf Processing On The Java PlatformRdf Processing On The Java Platform
Rdf Processing On The Java Platformguestc1b16406
 
Part2- The Atomic Information Resource
Part2- The Atomic Information ResourcePart2- The Atomic Information Resource
Part2- The Atomic Information ResourceJEAN-MICHEL LETENNIER
 
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES cscpconf
 
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIESENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIEScsandit
 
Enhancing keyword search over relational databases using ontologies
Enhancing keyword search over relational databases using ontologiesEnhancing keyword search over relational databases using ontologies
Enhancing keyword search over relational databases using ontologiescsandit
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 

Similaire à Triple-Triple RDF Store with Greedy Graph based Grouping (20)

Automating Relational Database Schema Design for Very Large Semantic Datasets
Automating Relational Database Schema Design for Very Large Semantic DatasetsAutomating Relational Database Schema Design for Very Large Semantic Datasets
Automating Relational Database Schema Design for Very Large Semantic Datasets
 
RDF and Java
RDF and JavaRDF and Java
RDF and Java
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented database
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented database
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented database
 
Bitmap Indexes for Relational XML Twig Query Processing
Bitmap Indexes for Relational XML Twig Query ProcessingBitmap Indexes for Relational XML Twig Query Processing
Bitmap Indexes for Relational XML Twig Query Processing
 
Query Distributed RDF Graphs: The Effects of Partitioning Paper
Query Distributed RDF Graphs: The Effects of Partitioning PaperQuery Distributed RDF Graphs: The Effects of Partitioning Paper
Query Distributed RDF Graphs: The Effects of Partitioning Paper
 
Icde2019 improving rdf query performance using in-memory virtual columns in o...
Icde2019 improving rdf query performance using in-memory virtual columns in o...Icde2019 improving rdf query performance using in-memory virtual columns in o...
Icde2019 improving rdf query performance using in-memory virtual columns in o...
 
ORDBMS.pptx
ORDBMS.pptxORDBMS.pptx
ORDBMS.pptx
 
Accelerating sparse matrix-vector multiplication in iterative methods using GPU
Accelerating sparse matrix-vector multiplication in iterative methods using GPUAccelerating sparse matrix-vector multiplication in iterative methods using GPU
Accelerating sparse matrix-vector multiplication in iterative methods using GPU
 
Rdf Processing On The Java Platform
Rdf Processing On The Java PlatformRdf Processing On The Java Platform
Rdf Processing On The Java Platform
 
Part2- The Atomic Information Resource
Part2- The Atomic Information ResourcePart2- The Atomic Information Resource
Part2- The Atomic Information Resource
 
nosql.pptx
nosql.pptxnosql.pptx
nosql.pptx
 
p27
p27p27
p27
 
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
 
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIESENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
 
Enhancing keyword search over relational databases using ontologies
Enhancing keyword search over relational databases using ontologiesEnhancing keyword search over relational databases using ontologies
Enhancing keyword search over relational databases using ontologies
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Dl36675677
Dl36675677Dl36675677
Dl36675677
 
ADAPTER
ADAPTERADAPTER
ADAPTER
 

Plus de Vinoth Chandar

[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/HudiVinoth Chandar
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Vinoth Chandar
 
Hoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkHoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkVinoth Chandar
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Vinoth Chandar
 
Voldemort : Prototype to Production
Voldemort : Prototype to ProductionVoldemort : Prototype to Production
Voldemort : Prototype to ProductionVinoth Chandar
 
Voldemort on Solid State Drives
Voldemort on Solid State DrivesVoldemort on Solid State Drives
Voldemort on Solid State DrivesVinoth Chandar
 
Composing and Executing Parallel Data Flow Graphs wth Shell Pipes
Composing and Executing Parallel Data Flow Graphs wth Shell PipesComposing and Executing Parallel Data Flow Graphs wth Shell Pipes
Composing and Executing Parallel Data Flow Graphs wth Shell PipesVinoth Chandar
 
Distributeddatabasesforchallengednet
DistributeddatabasesforchallengednetDistributeddatabasesforchallengednet
DistributeddatabasesforchallengednetVinoth Chandar
 

Plus de Vinoth Chandar (9)

[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
Hoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkHoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on Spark
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived
 
Voldemort : Prototype to Production
Voldemort : Prototype to ProductionVoldemort : Prototype to Production
Voldemort : Prototype to Production
 
Voldemort on Solid State Drives
Voldemort on Solid State DrivesVoldemort on Solid State Drives
Voldemort on Solid State Drives
 
Composing and Executing Parallel Data Flow Graphs wth Shell Pipes
Composing and Executing Parallel Data Flow Graphs wth Shell PipesComposing and Executing Parallel Data Flow Graphs wth Shell Pipes
Composing and Executing Parallel Data Flow Graphs wth Shell Pipes
 
Distributeddatabasesforchallengednet
DistributeddatabasesforchallengednetDistributeddatabasesforchallengednet
Distributeddatabasesforchallengednet
 
Bluetube
BluetubeBluetube
Bluetube
 

Dernier

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 

Dernier (20)

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 

Triple-Triple RDF Store with Greedy Graph based Grouping

  • 1. Title: Triple-Triple RDF store with Greedy Graph Based Grouping Name: Vinoth Chandar ( vinothc@cs.utexas.edu ) Partner's Name: - RDBMS Platform: MySql 5.0
  • 2. Triple-Triple RDF Store with Greedy Graph Based Grouping Abstract platform, on which the state of the art is evaluated and its Triple stores, based on relational databases, have progress towards the vision is assessed. Efficiently querying received wide attention amongst database such Triple stores presents many challenges. Since RDF is researchers and Semantic web enthusiasts. essentially a graph based data format, queries involve Improving the query performance on very large multiple joins and become really slow when scaling to RDF datasets is a challenging problem, which billion triples. If the 'Semantic Web' vision is to be needs to be addressed for efficient implementation incorporated, then we require very fast query retrieval of such triple stores. In this paper, we explore techniques since the long response times of these systems promising approaches to address the problem. We would be unacceptable to a normal Internet user. explore the possibility of storing the RDF triples in different orders in three tables and develop a In this paper, we explore promising new ideas for Triple query rewriting scheme for the same. We also look store implementation. In particular, we take the Triple-Triple at optimization of the physical schema by graph idea (explained later) to its logical conclusion, and develop clustering techniques that aim to bring related SPARQL to SQL query rewriting mechanisms for the same. triples closer to each other on disk. We also We further enhance the Triple-Triple idea by introducing a present experimental results from the computationally feasible clustering scheme that attempts to implementation of the scheme over a million reduce the number of disk pages accessed, by moving related triples. Our results show that our scheme can subjects/objects/properties closer to each other on disk. In yield significant performance benefits on complex fact, this clustering scheme can be applied to any general queries. indexing scheme for the Triple stores. 1. Introduction Section 2 details related work in this area. Section 3 presents In the recent years, RDF[1] stores or Triple stores, that can various approaches that were considered for improving the store information about (subject, property, object) triples of query performance. Section 4 presents a query rewriting Ontologies, have received significant attention from database technique corresponding to the Triple-Triple idea. Section 5 researchers. Many efforts have been made to implement identifies and analyzes the benefits of grouping related RDF stores using relational databases, and devising efficient triples in the same data block, to reduce the number of disk schemes for accessing information from such stores. These IO operations. Section 6 presents the experimental results efforts are focussed towards the larger vision of the and Section 7 concludes. 'Semantic Web'. To realize this vision, RDBMS based Triple stores should be able to store , and query enormous amounts 2. Related Work of triples that describe web pages on the Internet. [3] establishes the validity of using relational databases to store and query ontologies. The paper extends SQL with a The Billion Triple challenge [2] serves as a common set of ontology related operators, that can help obtain more
  • 3. pertinent results for ontology driven applications . At the spo, osp , pos, to help lookups based on each of the three same time, the applications can also benefit from the elements in the Triple. However, it works with its own query efficient storage and retrieval mechanisms in a RDBMS. language rather than using the general purpose SPARQL and Simplistically, the baseline physical model for storing RDQL. Adabi et al [5] pursue an interesting path, arguing ontologies in RDF format, using a RDBMS consists of two towards having a table for each property, with the associated tables – Symbol table and a Triple table (refer figure 1). An subject and object that are related by that property. Such a ontology describing the elements of the Web, contains URLs vertical partitioning approach tends to reduce the query and URIs, which are long strings (lex field in SymbolTable). response time by incorporating fast linear merge joins, when To avoid redundancy and wastage of disk space, these each table is sorted by the subject or object. However, such elements are assigned unique integer identifiers (hash field in an approach inherently assumes that the queries are property SymbolTable). This mapping from the element to the bound. A non-property bound query would require us to identifiers is stored in the Symbol table. The Triples table query across all the tables. Hexastore [6] furthers the has three columns – s (subject), p (property), o (object) – as multiple indexing approach taken by Kowari, by storing the per RDF conventions and each tuple in the table represents a three elements of a triple, in six different orders. For RDF triple. The table has a compound primary key on all the example, the spo ordering is stored as a sorted list of three columns. Such a naive representation of the triples, subjects, with each subject pointing to another sorted list of enables us to analyze clearly, where the benefits come from, properties defined for that subject. Each property in such when evaluating much more sophisticated physical schema. sorted properties list points to a sorted list of objects defined Figure 1 Baseline Physical model for that subject, property pair. Thus, all joins can be Many research efforts have attempted to propose alternate converted into fast linear merge joins. Hexastore occupies physical schema and improved SPARQL to SQL query five times more space than a single triples table. However, rewriting techniques, to improve query performance over the this is acceptable with the ever falling storage costs. baseline model. This is based on the realization that the baseline model can be used as a simple logical data model 3. Promising Directions alone. Kowari metastore [4] proposes a RDF store based on We will now explore some promising directions in which we AVL trees with each triple stored in three different orders – can further improvements from the baseline physical model.
  • 4. We will detail each idea and also present some motivation installation. for pursuing [or abandoning] the idea. We also present concrete and motivating examples for our arguments, using 3.2 Triple-Triple MySql. One promising idea is to create three redundant copies of the triples table with the compound primary keys – spo, pos, 3.1 Applicability of Spatial Indexes osp, such that each table has tuples sorted on subject , Potentially faster query execution times can be achieved if property and object in the order of occurrence in the primary the joins between the triples are materialized in euclidean keys of the tables. Figure 2 presents the Triple-Triple Figure 2 Triple­Triple Physical model space, in terms of fast minimum bounding rectangle (MBR) physical model. From here on, we will refer a table by its operations. For example, simply storing each triple as a (s,p), primary key i.e. spo table will denote the triples table with (p,o) line segment will materialize subject-subject joins as a (s,p,o) as the compound primary key. For the spo table, the MBRcontains operation between the line segment triples will be clustered based on subjects and then clustered (s,minhash) (s,maxhash) and the start point of each triple line on properties and then on objects, Such, a clustering ensures segment. Minhash and maxhash are the minimum and that the triples are stored in sorted order in disk and hence maximum integer ids from the symbol table. However, the fast linear merge joins can be applied. Note that this scheme support for R-Tree operations remain limited in commercial requires only 3 times extra space than a triples table, DBMS. MySql does not support spatial joins. Postgresql described in the baseline model. Hence, this approach is does not support R-Trees. Only Oracle Enterprise, supports definitely promising. The primary compound index is also spatial joins. We were unable to pursue this direction further, useful for any query that involves a prefix of the compound due to non availability (or rather infeasibility) of an Oracle key. For example, spo table can answer select queries based
  • 5. on s, sp, spo columns, using the primary index. for each join operation – spo or pos or osp. Also, we must be able to support selects on any combination of the three triple Though MySql does not support merge joins [7], the idea elements. These decisions are listed in Table 1. still would yield faster execution times since the sorted nature of the data ensures efficient use of the index. Operation Method Remember that, even with a secondary index, if the relevant subject-subject join spo JOIN spo tuples are spread across the table in different blocks, the end performance could be worse than a table scan. For the property-property join pos JOIN pos Semantic web vision, MySql plays a pivotal role since many object-object join osp JOIN osp web sites are powered by MySql. This gives enough subject-object join spo JOIN osp motivation to continue exploring the idea using MySql. For Subject select spo example, a simple three table join on 25K triples, using Triple-Triple yields very encouraging benefits over the Property select pos baseline model. In Figure 3, Triples_s denotes spo table, Object select osp Triples_p denotes the pos table and Triples_o denotes the subject-property selects / spo [no need for a osp table. Triples tables denotes the baseline triples table. property-subject select separate pso] Nodes table will denote the symbol table. All following examples in the paper will use the same conventions to subject-object selects/ osp [no need for a denote the corresponding tables. object-subject selects separate sop] mysql> select * from Triples_s t_s,  Object-property selects/ pos [no need for a Triples_o t_o , Triples_p t_p where  property-object selects separate ops] t_s.s = t_o.o and t_s.s = t_p.p;  Empty set (0.28 sec) Subject-property-object Any table mysql> select * from Triples t_1,  select Triples t_2, Triples t_3 where t_1.s =  t_2.o and t_1.s = t_3.p; Table 1 : Answering queries using Triple-Triple Empty set (2 min 10.83 sec) Hence, to our understanding, the three possible orderings – Figure 3  Benefits of Triple­Triple over baseline spo, pos, osp – are sufficient for handling the same set of queries that the Hexastore handles. The only missing piece in In comparison to the Hexastore, the Triple-Triple stores only building a complete Triple-Triple store is to define three possible orderings of the elements in a triple. We mechanisms to convert sparql to sql queries, using the explore if these orderings are sufficient for answering the appropriate tables for each triple. We explore this problem in same range of queries answered by the Hexastore. The only the next section. possible joins are subject-subject joins, object-object joins, property-property joins and subject-object joins, with all 3.3 Applicability of Secondary Indexes joins being equijoins. Hence, mechanisms for using the It is tempting to create secondary indexes on other columns , Triple-Triple should judiciously choose which table to use
  • 6. that do not form a prefix of the primary key of the triples is adopted. Since all joins of t1 are subject joins, we safely table (both in the baseline and the Triple-Triple idea). For choose spo table for t1. Since all the joins involving t4 are example, the spo table can have secondary indexes on po and subject joins, we choose spo table for t4. However, t3 o, to speed up the selects on those columns. However, initial presents a difficulty. For the t1,t3 join to be efficient , we experiments showed no improvement in the query response need to use spo table for t3. For the t3,t4 join to be efficient, times, since the bulk of the benefit was from the primary we need to pick the osp table for t3. Clearly, only one of index or the values were split across multiple disk pages so these options is possible. In general, a sparql to sql compiler that the secondary index did not make sense. For example, for the Triple-Triple has to make these hard decisions there are relatively very few properties in the data, when dynamically during runtime. From our experiments, we compared to subjects or objects. Hence, a secondary index noticed that the MySql optimizer (or any other DBMS) does would not be beneficial here. Also, along the similar lines, not do a good job in choosing the right join order for the secondary index on the lex field of the symbol table did not tables and substantial performance gains can be achieved by yield significant benefits. Hence, we stick with the Triple- simply rewriting the query by explicitly specifying a join Triple baseline model. order. These are hard search problems and thus, even in the context of the Triple-Triple, the sparql compiler cannot be 4. Add-Join expected to do a very good job in optimizing the choice of Converting SPARQL to a SQL query on the baseline is tables. straight forward. When deciding which tables to use for each triple in the Triple-Triple store, we can be faced with Hence, we adopt a method we term as Add-Join, which tries interesting tradeoffs. For example, consider the SPARQL to achieve the best of both worlds, by using multiple triples query in figure 4. tables for a single triple in the SPARQL query. In effect, we SELECT ?label ?producer ?comment add extra joins to the resulting sql query. But, as we show, WHERE { the cost of such additional joins is no way prohibitive and t1   dataFromProducer9:Product444  rdfs:label ?label . can be traded off in return for a deterministic simple sparql- t2   dataFromProducer9:Product444  sql compilation technique. For example, in the above rdfs:comment ?comment . t3   dataFromProducer9:Product444  example, we use two tables for t3 triple – t3_o [which is a bsbm:producer ?p . osp table] and t3_s [which is a spo table]. We join t1 with t4   ?p rdfs:label ?producer  } t3_s and t3_o with t4, and finally join t3_s and t3_o on all Figure 4: Tradeoff in SQL conversion three columns. The final join is very fast since it involves all three columns, so that the primary index can be used. In The triples involved in the sparql query are marked t1, t2, t3 effect, all of the joins in the query can use the clustered and t4. It is easy to observe that t1 joins t2,t3,t4 on the primary index. Figure 5 shows that the additional joins are subject and t3 joins t4 using a subject-object join. Our task is not prohibitively expensive, using the same example. to select one of the three triple tables – spo, pos, osp – Also, when rewriting the query for the Triple-Triple, we judiciously so that all joins make use of the primary index must ensure that we have as few rows to start with as and hence the fastest access path to access the relevant data possible. Since, MySql uses a single-sweep multi join
  • 7. algorithm [8], this would ensure that we try to match as few method described above. tuples as possible in each stage of the multi join. 5. Graph Based Grouping Add­Join : We observe that the further benefits can be achieved only select * from Triples_s as t1  through physical optimizations. One interesting observation STRAIGHT_JOIN Triples_s as t3_s  STRAIGHT_JOIN Triples_o as t3_o  we make is that the selects on the triples table can be STRAIGHT_JOIN Triples_s as t4 where  improved if the related tuples are brought closer to each t1.s = ­1280067686087321383 and t1.s =  t3_s.s and t3_o.o = t4.s and t3_s.s =  other on disk. For example consider the spo table with t3_o.s and t3_s.p = t3_o.p and t3_s.o =  t3_o.o;   selects using po, p, o on it. This is a common scenario when 0.2 sec triples with different properties are joined on the subject. Use spo for t3: Remember that we introduce additional joins only for the  select * from Triples_s as t1  joins between triples and selects using bound input values. STRAIGHT_JOIN Triples_s as t3  STRAIGHT_JOIN Triples_s as t4 where  t1.s = ­1280067686087321383 and t1.s =  By bringing the subjects with common properties and t3.s and t3.o = t4.s; 0.17 sec objects, closer to each other, we, in effect, reduce the Figure 5 Cost of additional joins number of disk pages across which the result of the select Hence, in addition to joins, we can also involve multiple operation is distributed. Thus, such a scheme would result in triple tables for a single triple, when there is an initial select direct reductions in the I/O cost for the operation. The same operation on the triple. For eg: Though a triple joins on s, it scenario can be applied to other two tables as well, bringing might involve a select on p as a bound input value. In those together related properties and objects respectively. cases, selecting p using a spo table may not be efficient. Hence, we introduce an additional pos table for the triple and The integer identifiers assigned to the elements of the triples, perform the select upon it, and later joining the pos and spo determine the order in which the triples appear on disk. tables. We now present the algorithm to convert SPARQL to Right now, these ids are assigned randomly. Hence, we SQL, based on Add-Join method. cannot use general purpose clustering schemes based on euclidean distances, to group the related triples together. Query Rewriting method: Also, without the Triple-Triple, it would be impossible to Step 0. Convert the SPARQL query to sql, on the baseline give equal priority to each of subject, property and object. Step 1. Identify the triples that have bound values for its This is because the first element of the compound primary elements i.e. the input for the sql query. key determines the order on disk and hence a clustering Step 2. In the explicit join order that we intend to provide, scheme has to choose between the three elements. Another start with the triple with bound input values and follow it approach is to abandon the primary keys all together and with triples that join with that triple. define a physical representation that brings together related Step 3. For selects on the non join attributes, insert entries triples close to each other , based on all three columns. for 'Nodes' as necessary in the explicit order. However, such a scheme would compromise on having data Step 4. Once, the explicit ordering is done, introduce in sorted order. The Triple-Triple idea lends flexibility by additional triple tables for each triple, as per the Add-Join allowing us to optimize with respect to subject, property and
  • 8. object using the spo, pos and osp tables respectively. formulated as an optimization problem as described in figure Hence, the integers Ids need to be assigned intelligently in 6. The formulation can be suitably generalized to P-graphs order to leverage these benefits. We now define metrics and O-graphs also. The problem is an instance of the which quantify the strength of the relationship between two Knapsack constrained Maximum Weighted cluster problem elements. S-score, P-score and O-score denote the metrics [9]. for interrelating subjects, properties and objects respectively. The formulation aims to extract the cluster from the graph S-score(s1, s2) = Number of triple pairs t1,t2 with (t1.s=s1 such that the sum of all the edge weights in the cluster is and t2.s=s2) and ( t1.p = t2.p or t1.o = t2.o) maximal, subject to the constraint that there can be at most B triples on a block. For MySql, which uses 16Kb data blocks, defines the S-score for two subjects s1 and s2. P-score and B = 16Kb/12 = 1300. Once, such a maximal cluster is O-score are defined similarly. Once, we have these metrics extracted, we assign consecutive ids to all the elements of computed, we build three graphs S-graph, P-graph, O-graph the cluster. We then repeat the algorithm, pruning the graph that depict relationships between subjects, properties , removing the assigned edges. However, in practice, this objects using S-score, P-score, O-score as edge weights problem is NP hard and grows computationally unrealistic, respectively. The S-graph will have a vertices for each with large data sets which involve thousands of subjects. The subject and S-score between two subjects as the edge weight. S-graph is also very dense and this complicates the solution Note that no subject or object occurs as a property. Hence, further. For example, a 25K triples database, contains 2367 the problem of assigning ids to properties can be solved subjects, 200K edges. Hence, when we scale to a billion independent of the other two elements. However, some triples, the graph construction itself may become a very long subjects also occur as objects. But, there can be only one id process. There are other graph clustering tools such as that can be assigned to that element. We therefore prune out MCL[10] and Graclus [11], for unsupervised learning from the O-graph by removing all vertices and edges graphs. Though, these tools do not solve the exact problem corresponding to such overlapping subjects. as described above, they attempt to provide clusters from the graph based on connected components. Attempts at Let S denote a cluster and Si denote a  subject belonging to S.  hierarchical clustering using the MCL algorithm, yields only Objective : 3 clusters, reflecting on the dense nature of these graphs and Max : ∀i ∀ j Σ S­score( Si, Sj ),i!= j Constraints: non applicability of standard graph clustering techniques. Σ numTriples(Si) <= B where numTriples(Si) denote the number  of triples with subject Si Hence, we attempt to develop computationally feasible B denotes the number of triples per  greedy algorithms for the problem. One such greedy block algorithm is described in Figure 7. The algorithm greedily Figure 6 : Optimal clustering of subjects constructs parts of the relationship graphs and assigns identifiers based on such partial graphs. The algorithm Each disk page can be viewed as a cluster or group and the closely approximates the optimal solution for certain parts of problem of finding the most related subjects can be the graph. Nonetheless, it is suboptimal since we ignore the
  • 9. strength of the relationships between the discovered subjects queries that cause scalability problems for applications. Si. While there is a subject S [that has  RDBMS MySql 5.0 not been assigned an Id] : ­ Assign next available id to S OS Ubuntu 8.04 ­ compute all the subjects Si  related to S; i.e has a non zero S­ Processor AMD Turion TL 58 score with S ­ compute S­score(S, Si) for all  32/64 bit 32 such discovered subjects ­ Assign ids to Si in the  Processor speed 1.9 Ghz increasing order of S­score(S,Si) till  Σ numTriples(Si) <= B L1 cache (MB) 128 Figure 7: Greedy Grouping L2 cache (MB) 512 FSB speed (MHz) 800 In section 6, we validate the effectiveness of this algorithm. RAM (Gb) 2 Once again, the id assignment for properties can be done in a symmetric fashion. For the objects, we need to additionally Disk capacity (Gb) 160 ignore objects which are also subjects. Disk rotation 5400 (RPM) 6. Empirical results Buffering (MB) 8 We present empirical results that demonstrate the validity of the approaches proposed in this paper. Specifically, we study Table 2 Platform details the query performance compared to the baseline and the real benefits of the grouping technique described in the previous The improvements on query 3 and query 6 are not section. Our experimental setup is detailed in Table 2. significant. The baseline timings for those queries are not very large in the first place and involve lesser number of 6.1 Query Performance joins. For example, query 6 involves only one triple. Add- For each scheme that we evaluate, we define a metric called Join uses two triples for the query and offers the same speed up to compare it against the baseline. Speed up is amount of performance as the baseline. There are very simply the query response time for the baseline divided by significant benefits on the queries 4 and 5. The benefits due the query response time for the scheme for the same query. to Add-Join method account for the bulk of the benefits over Higher the speed up better the scheme i.e. it determines how the baseline amounting to approximately 50x improvement many magnitudes of performance improvement is achieved from the baseline. The direct benefits due to grouping by the scheme. Figure 8 and 9 present the speed ups for technique amount to 10x-20x times over the baseline, when three schemes – Rewrite (simply rewriting the query by compared to the Add-Join method without grouping. It explicitly specifying the join order), Add-Join , Add-Join remains to be seen if better grouping techniques can yield with grouping. The results are presented for queries 3,4,5,6 significantly higher benefits. However, these results [12]. Queries 4,5 are typical examples of the complex
  • 10. demonstrate the validity of the techniques described earlier. Figure 8 Speedup from baseline (250K Triples) Figure 9 Speedup from baseline (1M Triples)
  • 11. 6.2 Validation of Grouping scheme It is also important to validate the grouping scheme presented in section 5. We expect the grouping scheme to reduce the number of disk I/O for selects on the triples tables. Table 3 presents the number of rows the query optimizer expects to examine, for answering selects on the triple tables. It can be seen that the grouping has resulted in a decrease in the number of rows examined for the query. Table No grouping With grouping spo table 1000619 973866 pos table 1000619 805152 Figure 10 : R­score osp table 1000619 828004 The results indicate that the grouping algorithm has been Table 3 : Expected number of rows accessed for quite effective, increasing the interrelationship by 2x-10x selects times of the random Id assignment. The R-score for the pos table is lower since there are fewer number of properties and We also measure the amount of interrelationship between the hence in a number of cases, a single data block is full of triples in each block, with and without grouping, to observe triples from the same property. In fact, the average edge the effectiveness of our grouping algorithm. For each disk weight for the pos table is much lower than those of the spo block [i.e. a set of 1300 tuples] we construct S-graphs, P- and osp tables. There are higher values for the osp table graphs and O-graphs representing those clusters. We then since there are far more objects than subjects, in which case, compute the sum of all the edge weights of each such cluster there are more number of edges in the cluster graph and average it across all data blocks. This allows us to constructed. quantify the effectiveness of our grouping scheme. Ideally, we should also be comparing our grouping scheme against 7. Conclusions and Future work the optimal solution. However, the optimal solution is very Thus, we have explored some promising approaches to hard to compute as mentioned earlier and also cannot be improving query performance in relational triple stores. An predicted accurately since it depends on the nature of the interesting query rewriting mechanism which introduces triples. We divide the average edge weight for table with additional joins to speed up query execution has been grouping by the average edge weight for the corresponding discussed. Optimization of the physical schema by table without grouping to obtain a metric called the R-score leveraging the interrelationship between the elements of a or the relationship score for those two tables. Figure 10 triple, has been proposed. A greedy grouping algorithm Presents the R-scores for all three triples tables for 250K and which is simple and computationally feasible has been 1M triple tables.
  • 12. proposed and validated. The results show that our approach Schiefer, Eds. Very Large Data Bases. VLDB Endowment, is promising and can be potentially combined with other 1054-1065. techniques in literature to yield faster RDF stores. As a part [4] David Wood, “Kowari: A Platform for Semantic Web of future work, we intend to compare the performance of the Storage and Analysis”,In XTech 2005 Conference system with the Hexastore and potentially enhance the [5] Abadi, D. J., Marcus, A., Madden, S. R., and Hexastore with our grouping algorithm. As mentioned Hollenbach, K. 2007. Scalable semantic web data earlier, we would like to develop better grouping algorithms management using vertical partitioning. In Proceedings of by leveraging parallel computing techniques to overcome the the 33rd international Conference on Very Large Data Bases computational issues. We believe that better grouping (Vienna, Austria, September 23 - 27, 2007). Very Large Data algorithms can yield significantly higher performance. Bases. VLDB Endowment, 411-422. Another key observation we make, is that no physical [6] Weiss, C., Karras, P., and Bernstein, A. 2008. Hexastore: schema will perform best for all types of queries. Hence, sextuple indexing for semantic web data management. Proc. with the cheap availability of storage, multiple physical VLDB Endow. 1, 1 (Aug. 2008), 1008-1019. DOI= schema can co exist within the same RDF store and the http://doi.acm.org/10.1145/1453856.145396 SPARQL compiler can judiciously employ them based on [7] Nested-Loop Join Algorithms the type of the query. http://dev.mysql.com/doc/refman/5.0/en/nested-loop- joins.html 8. References [8] Using Explain syntax [1] Resource Description Framework http://dev.mysql.com/doc/refman/5.0/en/using-explain.html http://www.w3.org/RDF/ [9] Anuj Mehrotra and Michael A Trick, “Cliques and [2] http://iswc2008.semanticweb.org/calls/call-for-semantic- Clustering: A Combinatorial Approach”, web-challenge-and-billion-triples-tracks/ [10] MCL : An algorithm for clustering graphs, [3] Das, S., Chong, E. I., Eadon, G., and Srinivasan, J. 2004. http://www.micans.org/mcl/ Supporting ontology-based semantic matching in RDBMS. [11] Graclus In Proceedings of the Thirtieth international Conference on http://www.cs.utexas.edu/users/dml/Software/graclus.html Very Large Data Bases - Volume 30 (Toronto, Canada, [12] Class project website August 31 - September 03, 2004). M. A. Nascimento, M. T. http://www.cs.utexas.edu/~jsequeda/cs386d/project.html Özsu, D. Kossmann, R. J. Miller, J. A. Blakeley, and K. B.