SlideShare une entreprise Scribd logo
1  sur  83
1.Parallel and Distributed Databases




1   1. Parallel DB /D.S.Jagli                        2/12/2013
                         1. Parallel DB /D.S.Jagli               1
Parallel database
    1.         -Introduction
    2.         -Architecture for Parallel databases.
    3.         - Parallel query Evaluation
    4.         - Parallelizing Individual operations.




2   1. Parallel DB /D.S.Jagli                           2/12/2013
                           1. Parallel DB /D.S.Jagli                2
Introduction

     What is a Centralized Database ?
    -all the data is maintained at a single site and assumed that the processing   of
      individual transaction is essentially sequential.




3       1. Parallel DB /D.S.Jagli                                            2/12/2013
                             1. Parallel DB /D.S.Jagli                                   3
PARALLEL DBMSs
      WHY DO WE NEED THEM?

    • More and More Data!

      We have databases that hold a high amount of
      data, in the order of 1012 bytes:

      10,000,000,000,000 bytes!

    • Faster and Faster Access!

      We have data applications that need to process
      data at very high speeds:

      10,000s transactions per second!

    SINGLE-PROCESSOR DBMS AREN’T UP TO THE JOB!
4     1. Parallel DB /D.S.Jagli                        2/12/2013
                           1. Parallel DB /D.S.Jagli               4
Why Parallel Access To Data?
          At 10 MB/s                              1,000 x parallel
    1.2 days to scan                              1.5 minute to scan.


      1 Terabyte                                     1 Terabyte




              10 MB/s
                                   Parallelism:
                                   divide a big problem
       1. Parallel DB /D.S.Jagli
                                    into many smaller ones
5                                                          2/12/2013
                                     to be solved in parallel.
Parallel DB
     Parallel database     system seeks to improve performance through
       parallelization of various operations such as loading data ,building
       indexes, and evaluating queries by using multiple CPUs and Disks in
       Parallel.


     Motivation for Parallel DB
     Parallel machines are becoming quite common and affordable
       Prices of microprocessors, memory and disks have dropped sharply
     Databases are growing increasingly large
       large volumes of transaction data are collected and stored for later
        analysis.
       multimedia objects like images are increasingly stored in databases




6   1. Parallel DB /D.S.Jagli                                       2/12/2013
                         1. Parallel DB /D.S.Jagli                              6
PARALLEL DBMSs
           BENEFITS OF A PARALLEL DBMS


        Improves Response Time.

        INTERQUERY PARALLELISM
        It is possible to process a number of transactions in
        parallel with each other.


        Improves Throughput.

        INTRAQUERY PARALLELISM
        It is possible to process ‘sub-tasks’ of a transaction in
        parallel with each other.




7         1. Parallel DB /D.S.Jagli                                 2/12/2013
                               1. Parallel DB /D.S.Jagli                        7
PARALLEL DBMSs
       HOW TO MEASURE THE BENEFITS


    Speed-Up
    – Adding more resources results in proportionally less running time for a
       fixed amount of data.
    10 seconds to scan a DB of 10,000 records using 1 CPU
    1 second to scan a DB of 10,000 records using 10 CPUs


   Scale-Up
   If resources are increased in proportion to an increase in data/problem
    size, the overall time should remain constant

    – 1 second to scan a DB of 1,000 records using 1 CPU
    1 second to scan a DB of 10,000 records using 10 CPUs


8     1. Parallel DB /D.S.Jagli                                    2/12/2013
                           1. Parallel DB /D.S.Jagli                           8
Architectures for Parallel Databases
     The basic idea behind Parallel DB is to carry out evaluation steps in
       parallel whenever is possible.

     There are many opportunities for parallelism in RDBMS.


     3 main architectures have been proposed for building parallel DBMSs.
            1. Shared Memory
            2. Shared Disk
            3. Shared Nothing




9   1. Parallel DB /D.S.Jagli                                          2/12/2013
                         1. Parallel DB /D.S.Jagli                                 9
Shared Memory
                                                       Advantages:
                                                      1.   It is closer to conventional
                                                           machine , Easy to program
                                                      2.   overhead is low.
                                                      3.   OS services are leveraged to
                                                           utilize the additional CPUs.
                                                       Disadvantage:
                                                      1.    It leads to bottleneck problem
                                                      2.    Expensive to build
                                                      3.    It is less sensitive to
                                                            partitioning




10   1. Parallel DB /D.S.Jagli                                                        2/12/2013
                          1. Parallel DB /D.S.Jagli                                           10
Shared Disk
                                                       Advantages:
                                                      1.    Almost same
                                                          Disadvantages:
                                                      1.    More interference
                                                      2.    Increases N/W band width
                                                      3.    Shared disk less sensitive to
                                                            partitioning




11   1. Parallel DB /D.S.Jagli                                                     2/12/2013
                          1. Parallel DB /D.S.Jagli                                         11
Shared Nothing
                                                          Advantages:
                                                      1.   It provides linear scale up
                                                           &linear speed up
                                                      2.   Shared nothing benefits from
                                                           "good" partitioning
                                                      3.   Cheap to build
                                                          Disadvantage
                                                      1.   Hard to program
                                                      2.   Addition of new nodes
                                                           requires reorganizing




12   1. Parallel DB /D.S.Jagli                                                 2/12/2013
                          1. Parallel DB /D.S.Jagli                                       12
PARALLEL DBMSs
      SPEED-UP

         Number of transactions/second



                                                                                  Linear speed-up (ideal)


                                         2000/Sec
                                         1600/Sec
                                                                                      Sub-linear speed-up
                                         1000/Sec


                                                    5 CPUs                      10 CPUs   16 CPUs

      1. Parallel DB /D.S.Jagli
                                                           Number of CPUs                                   2/12/2013
13
                                                    1. Parallel DB /D.S.Jagli                                       13
PARALLEL DBMSs
      SCALE-UP

        Number of transactions/second




                                        1000/Sec                               Linear scale-up (ideal)
                                        900/Sec                                           Sub-linear scale-up



                                                   5 CPUs                      10 CPUs
                                                   1 GB Database               2 GB Database

      1. Parallel DB /D.S.Jagli
                                              Number of CPUs, Database size                              2/12/2013
14
                                                   1. Parallel DB /D.S.Jagli                                     14
PARALLEL QUERY EVALUATION
     A relational query execution plan is graph/tree of
     relational algebra operators (based on this operators can
     execute in parallel)




15   1. Parallel DB /D.S.Jagli                             2/12/2013
                            1. 1. Parallel DB /D.S.Jagli           15
Different Types of DBMS ||-ism
      Parallel evaluation of a relational query in DBMS With shared –nothing
       architecture
     1. Inter-query parallelism
            Multiple queries run on different sites
     2. Intra-query parallelism
            Parallel execution of single query run on different sites.
        a)    Intra-operator parallelism
              a)     get all machines working together to compute a given operation (scan, sort,
                     join).
        b)    Inter-operator parallelism
                  each operator may run concurrently on a different site (exploits
                   pipelining).
                  In order to evaluate different operators in parallel, we need to
                   evaluate each operator in query plan in Parallel.
16   1. Parallel DB /D.S.Jagli                                                        2/12/2013
                           1. 1. Parallel DB /D.S.Jagli                                        16
Data Partitioning
      Types of Partitioning
     1.    Horizontal Partitioning: tuple of a relation are divided among
           many disks such that each tuple resides on one disk.
           It enables to exploit the I/O band width of disks by reading & writing
            them in parallel.
           Reduce the time required to retrieve relations from disk by
            partitioning the relations on multiple disks.
             1. Range Partitioning
             2. Hash Partitioning
             3. Round Robin Partitioning


     2.    Vertical Partitioning
                                                2/12/2013
17   1. Parallel DB /D.S.Jagli
                                                                                 17
1.Range Partitioning
      Tuples are sorted (conceptually), and n ranges are chosen for
       the sort key values so that each range contains roughly the
       same number of tuples;
      tuples in range i are assigned to processor i.
      Eg:
         sailor _id 1-10 assigned to disk 1
         sailor _id 10-20 assigned to disk 2
         sailor _id 20-30 assigned to disk 3


      range partitioning can lead to data skew; that is, partitions with widely
       varying number of tuples across

18      1. Parallel DB /D.S.Jagli                                          2/12/2013
                                                                                   18
2.Hash Partitioning
      A hash function is applied to selected fields of a tuple to determine its
        processor.

      Hash partitioning has the additional virtue that it keeps data evenly
        distributed even if the data grows and shrinks over time.




19   1. Parallel DB /D.S.Jagli                                           2/12/2013
                                                                                   19
3.Round Robin Partitioning

      If there are n processors, the i th tuple is assigned to processor i mod n in
       round-robin partitioning.

      Round-robin partitioning is suitable for efficiently evaluating queries that
       access the entire relation.

         If only a subset of the tuples (e.g., those that satisfy the selection
          condition age = 20) is required, hash partitioning and range partitioning
          are better than round-robin partitioning




20       1. Parallel DB /D.S.Jagli                                                 2/12/2013
                                                                                           20
Range                                             Hash                  Round Robin




A...E F...J K...N O...S T...Z         A...E F...J K...N O...S T...Z   A...E F...J K...N O...S T...Z

Good for equijoins,                    Good for equijoins,            Good to spread load
exact-match queries,                   exact match queries
and range queries



21       1. Parallel DB /D.S.Jagli                                                    2/12/2013
                              1. Parallel DB /D.S.Jagli                                        21
Parallelizing Sequential Operator
     Evaluation Code
     1.    An elegant software architecture for parallel DBMSs enables us to
           readily parallelize existing code for sequentially evaluating a
           relational operator.

     2.    The basic idea is to use parallel data streams.

     3.    Streams are merged as needed to provide the inputs for a relational
           operator.

     4.    The output of an operator is split as needed to parallelize subsequent
           processing.

     5.    A parallel evaluation plan consists of a dataflow network of
           relational, merge, and split operators.
22   1. Parallel DB /D.S.Jagli                                          2/12/2013
                          1. Parallel DB /D.S.Jagli                             22
PARALLELIZING INDIVIDUAL
     OPERATIONS
      How various operations can be implemented in parallel in a shared-
        nothing architecture?


      Techniques
          1. Bulk loading& scanning
          2. Sorting
          3. Joins




23   1. Parallel DB /D.S.Jagli                                       2/12/2013
                          1. Parallel DB /D.S.Jagli                          23
1.Bulk Loading and scanning
      scanning a relation: Pages can be read in parallel while scanning a
        relation, and the retrieved tuples can then be merged, if the relation is
        partitioned across several disks.

      bulk loading: if a relation has associated indexes, any sorting of data
        entries required for building the indexes during bulk loading can also
        be done in parallel.




24   1. Parallel DB /D.S.Jagli                                          2/12/2013
                          1. Parallel DB /D.S.Jagli                             24
2.Parallel Sorting :
          Parallel sorting steps:
          1. First redistribute all tuples in the relation using range partitioning.
          2. Each processor then sorts the tuples assigned to it
          3. The entire sorted relation can be retrieved by visiting the processors in
              an order corresponding to the ranges assigned to them.


          Problem: Data skew

          Solution: “sample” the data at the outset to determine good
           range partition points.

         A particularly important application of parallel sorting is sorting the data
                               entries in tree-structured indexes.
25         1. Parallel DB /D.S.Jagli                                          2/12/2013
                                1. Parallel DB /D.S.Jagli                               25
3.Parallel Join
     1.   The basic idea for joining A and B in parallel is to decompose the join
          into a collection of k smaller joins by using partition.

     2.   By using the same partitioning function for both A and B, we ensure that
          the union of the k smaller joins computes the join of A and B.


             Hash-Join
             Sort-merge-join




26        1. Parallel DB /D.S.Jagli                                          2/12/2013
                               1. Parallel DB /D.S.Jagli                             26
Sort-merge-join
      partition A and B by dividing the range of the join attribute into k disjoint
       subranges and placing A and B tuples into partitions according to the
       subrange to which their values belong.

      Each processor carry out a local join.


      In this case the number of partitions k is chosen to be equal to the number
       of processors n .

      The result of the join of A and B, the output of the join process may be
       split into several data streams.

                 The advantage that the output is available in sorted order
27    1. Parallel DB /D.S.Jagli                                               2/12/2013
                           1. Parallel DB /D.S.Jagli                                  27
Dataflow Network of Operators for
        Parallel Join




     Good use of split/merge makes it easier to build parallel versions of sequential
     join code
28       1. Parallel DB /D.S.Jagli                                           2/12/2013
                              1. Parallel DB /D.S.Jagli                                 28
DDBMS

           I.   Introduction to DDBMS
           II. Architecture of DDBs
           III. Storing data in DDBs
           IV. Distributed catalog management
           V. Distributed query processing
           VI. Transaction Processing
           VII. Distributed concurrency control and recovery
29    1. Parallel DB /D.S.Jagli                                2/12/2013
                           1. Parallel DB /D.S.Jagli                   29
I.Introduction to DDBMS
      Data in a distributed database system is stored across several sites.


      Each site is typically managed by a DBMS that can run independently of
       the other sites that co-operate in a transparent manner.

         Transparent implies that each user within the system may access all of
           the data within all of the databases as if they were a single database

      There should be „location independence‟ i.e.- as the user is unaware of
       where the data is located it is possible to move the data from one physical
       location to another without affecting the user.



30      1. Parallel DB /D.S.Jagli                                              2/12/2013
                             1. Parallel DB /D.S.Jagli                                 30
DDBMS properties
      Distributed data independence: Users should be able to ask queries
        without specifying where the referenced relations, or copies or
        fragments of the relations, are located.

      Distributed transaction atomicity: Users should be able to write
        transactions that access and update data at several sites just as they
        would write transactions over purely local data.




31   1. Parallel DB /D.S.Jagli                                        2/12/2013
                          1. Parallel DB /D.S.Jagli                           31
DISTRIBUTED PROCESSING ARCHITECTURE

     CLIENT       CLIENT
                                                    CLIENT   CLIENT

                              LAN
                                                                       LAN
     CLIENT       CLIENT
                                                    CLIENT   CLIENT


                      Delhi                                           Mumbai


     CLIENT       CLIENT                            CLIENT   CLIENT




                                                                                  DBMS
                              LAN                                      LAN
     CLIENT        CLIENT
                                                    CLIENT   CLIENT



32     1. Parallel DB /D.S.Jagli
                            Hyderabad                                 Pune2/12/2013
                        1. Parallel DB /D.S.Jagli                                 32
DISTRIBUTED DATABASE ARCHITECTURE

              CLIENT         CLIENT            CLIENT   CLIENT




                                                                            DBMS
      DBMS



                                                                 LAN

              CLIENT          CLIENT           CLIENT   CLIENT




Delhi                                                                 Mumbai


              CLIENT
              CLIENT         CLIENT            CLIENT   CLIENT




                                                                            DBMS
      DBMS




                                                                 LAN

              CLIENT          CLIENT           CLIENT   CLIENT




Hyderabad DB /D.S.Jagli
33   1. Parallel
                   1. Parallel DB /D.S.Jagli
                                                                 Pune
                                                                    2/12/2013
                                                                            33
Distributed database
      Communication Network- DBMS and Data at each node


•Users are unaware
of the distribution of
the data
     Location
     transparency




34     1. Parallel DB /D.S.Jagli                       2/12/2013
Types of Distributed Databases
      Homogeneous distributed database system :
         If data is distributed but all servers run the same DBMS software.


      Heterogeneous distributed database :
         If different sites run under the control of different DBMSs,
              essentially autonomously, are connected to enable access to data
              from multiple sites.

        The key to building heterogeneous systems is to have well-accepted
         standards for gateway protocols.
        A gateway protocol is an API that exposes DBMS functionality to
         external applications.




35   1. Parallel DB /D.S.Jagli                                        2/12/2013
                          1. Parallel DB /D.S.Jagli                            35
DDBMS

           I.   Introduction to DDBMS
           II. Architecture of DDBs
           III. Storing data in DDBs
           IV. Distributed catalog management
           V. Distributed query processing
           VI. Transaction Processing
           VII. Distributed concurrency control and recovery
36    1. Parallel DB /D.S.Jagli                                2/12/2013
                           1. Parallel DB /D.S.Jagli                   36
2.DISTRIBUTED DBMS
     ARCHITECTURES
     1. Client-Server Systems:
     2. Collaborating Server Systems
     3. Middleware Systems




37   1. Parallel DB /D.S.Jagli                          2/12/2013
                            1. Parallel DB /D.S.Jagli           37
1.Client-Server Systems:
       1. A Client-Server system has one or more client
          processes and one or more server processes,
       2. A client process can send a query to any one server
          process.
       3. Clients are responsible for user-interface issues,
       4. Servers manage data and execute transactions.
       5. A client process could run on a personal computer
          and send queries to a server running on a mainframe.
       6. The Client-Server architecture does not allow a single
          query to span multiple servers

38   1. Parallel DB /D.S.Jagli                           2/12/2013
                          1. Parallel DB /D.S.Jagli              38
SPECIALISED NETWORK CONNECTION
     TERMINALS
                                                                              MAINFRAME COMPUTER
          DUMB




           DUMB




           DUMB                                                               PRESENTATION LOGIC
                                                                              BUSINESS LOGIC
                                                                              DATA LOGIC
39   1. Parallel DB /D.S.Jagli                                                             2/12/2013
                          1. Parallel DB /D.S.Jagli                                                39
2.Collaborating Server Systems
       1. The client process would have to be capable of breaking
          such a query into appropriate subqueries.
       2. A Collaborating Server system can have a collection of
          database servers, each capable of running transactions
          against local data, which cooperatively execute
          transactions spanning multiple servers.
       3. When a server receives a query that requires access to
          data at other servers, it generates appropriate subqueries
          to be executed by other servers .
       4. puts the results together to compute answers to the
          original query.

40   1. Parallel DB /D.S.Jagli                               2/12/2013
                          1. Parallel DB /D.S.Jagli                  40
M:N CLIENT/SERVER DBMS ARCHITECTURE


                                                        SERVER #1
     CLIENT
     #1
                                                               D/BASE




      CLIENT
      #2

                                                        SERVER #2

                                                               D/BASE
     CLIENT
     #3

                   NOT TRANSPARENT!
41     1. Parallel DB /D.S.Jagli                                    2/12/2013
                            1. Parallel DB /D.S.Jagli                       41
3.Middleware Systems:
      The Middleware architecture is designed to allow a single
        query to span multiple servers, without requiring all
        database servers to be capable of managing such multisite
        execution strategies.

      It is especially attractive when trying to integrate several
        legacy systems, whose basic capabilities cannot be
        extended.




42   1. Parallel DB /D.S.Jagli                              2/12/2013
                          1. Parallel DB /D.S.Jagli                 42
DDBMS

           I.   Introduction to DDBMS
           II. Architecture of DDBs
           III. Storing data in DDBs
           IV. Distributed catalog management
           V. Distributed query processing
           VI. Transaction Processing
           VII. Distributed concurrency control and recovery
43    1. Parallel DB /D.S.Jagli                                2/12/2013
                           1. Parallel DB /D.S.Jagli                   43
3.Storing Data in DDBs
      In a distributed DBMS, relations are stored across
       several sites.
      Accessing a relation that is stored at a remote site
       includes message-passing costs.
      A single relation may be partitioned or fragmented
       across several sites.




44   1. Parallel DB /D.S.Jagli                        2/12/2013
                          1. Parallel DB /D.S.Jagli           44
Types of Fragmentation:
     Horizontal fragmentation: The union of the horizontal
     fragments must be equal to the original relation. Fragments
     are usually also required to be disjoint.

     Vertical fragmentation: The collection of vertical fragments
     should be a lossless-join decomposition.




45   1. Parallel DB /D.S.Jagli                             2/12/2013
                                                                    45
46   1. Parallel DB /D.S.Jagli   2/12/2013
                                         46
Replication
      Replication means that we store several copies of a relation
        or relation fragment.

      The motivation for replication is twofold:
         1. Increased availability of data:
         2. Faster query evaluation:


      Two kinds of replications
        1.    synchronous replication
        2.    asynchronous replication


47   1. Parallel DB /D.S.Jagli                              2/12/2013
                                                                    47
DDBMS

           I.   Introduction to DDBMS
           II. Architecture of DDBs
           III. Storing data in DDBs
           IV. Distributed Catalog Management
           V. Distributed Query Processing
           VI. Transaction Processing
           VII. Distributed concurrency control and recovery
48    1. Parallel DB /D.S.Jagli                                2/12/2013
                           1. Parallel DB /D.S.Jagli                   48
4.Distributed Catalog Management
     1.    Naming Objects
     •      If a relation is fragmented and replicated, we must be able to
            uniquely identify each replica of each fragment.
          1. A local name field
          2. A birth site field


     2.    Catalog Structure
          A centralized system catalog can be used It is vulnerable to failure of
           the site containing the catalog).
           An alternative is to maintain a copy of a global system catalog.
           compromises site autonomy,)



49   1. Parallel DB /D.S.Jagli                                           2/12/2013
                                                                                 49
4.Distributed Catalog Management
      A better approach:
      Each site maintains a local catalog that describes all copies
        of data stored at that site.

      In addition, the catalog at the birth site for a relation is
        responsible for keeping track of where replicas of the
        relation are stored.




50   1. Parallel DB /D.S.Jagli                               2/12/2013
                                                                     50
DDBMS

           I.   Introduction to DDBMS
           II. Architecture of DDBs
           III. Storing data in DDBs
           IV. Distributed Catalog management
           V. Distributed Query Processing
           VI. Transaction Processing
           VII. Distributed concurrency control and recovery
51    1. Parallel DB /D.S.Jagli                                2/12/2013
                           1. Parallel DB /D.S.Jagli                   51
5.Distributed Query Processing
       Distributed query processing: Transform a high-level
           query (of relational calculus/SQL) on a distributed database
           (i.e., a set of global relations) into an equivalent and
           efficient lower-level query (of relational algebra) on
           relation fragments.

       Distributed query processing is more complex
      1.    – Fragmentation/replication of relations
      2.    – Additional communication costs
      3.    – Parallel execution




52    1. Parallel DB /D.S.Jagli                                 2/12/2013
                                                                        52
Distributed Query Processing Steps




53   1. Parallel DB /D.S.Jagli      2/12/2013
5.Distributed Query Processing
           Sailors(sid: integer, sname: string, rating: integer, age: real)
           Reserves(sid: integer, bid: integer, day: date, rname: string)


       Assume Reserves and Sailors relations
           each tuple of Reserves is 40 bytes long
           a page can hold 100 Reserves tuples
           1,000 pages of such tuples.
           each tuple of Sailors is 50 bytes long
           a page can hold 80 Sailors Tuples
           500 pages of such tuples

       How to estimate the cost?


54    1. Parallel DB /D.S.Jagli                                            2/12/2013
                                                                                   54
5.Distributed Query Processing
      Criteria for measuring the cost of a query evaluation
        strategy
         For centralized DBMSs number of disk accesses (#
           blocks read / written)
         For distributed databases, additionally
             The cost of data transmission over the network
             Potential gain in performance from having several sites processing parts
               of the query in parallel




55   1. Parallel DB /D.S.Jagli                                                2/12/2013
5.Distributed query processing
      To estimate the cost of an evaluation strategy, in addition to counting
       the number of page I/Os.
      we must count the number of pages that are shipped is a
       communication costs.
      Communication costs is a significant component of overall cost in a
       distributed database.


         1.    Nonjoin Queries in a Distributed DBMS
         2.    Joins in a Distributed DBMS
         3.    Cost-Based Query Optimization




56   1. Parallel DB /D.S.Jagli                                        2/12/2013
                                                                              56
5.Distributed Query Processing
        1.Nonjoin Queries in a Distributed DBMS
         Even simple operations such as scanning a relation, selection, and
            projection are affected by fragmentation and replication.

                             SELECT S.age
                             FROM Sailors S
                             WHERE S.rating > 3 AND S.rating < 7
              Suppose that the Sailors relation is horizontally fragmented, with all
               tuples having a rating less than 5 at Mumbai and all tuples having a
               rating greater than 5 at Delhi.

             The DBMS must answer this query by evaluating it at both sites and
                             taking the union of the answers.

57      1. Parallel DB /D.S.Jagli                                             2/12/2013
 2/12/2013                                                                              57
5.Distributed Query Processing
         Eg 1:               SELECT avg(age)
                             FROM Sailors S
                             WHERE S.rating > 3 AND S.rating < 7
      taking the union of the answers is not enough
         Eg 2:        SELECT S.age
                      FROM Sailors S
                      WHERE S.rating > 6
      taking the union of the answers is not enough


      Eg 3: suppose that the Sailors relation is vertically fragmented, with the
       sid and rating fields at MUMBAI and the sname and age fields at
       DELHI
      This vertical fragmentation would be a lossy decomposition

58   1. Parallel DB /D.S.Jagli                                          2/12/2013
5.Distributed Query Processing
       Eg 4:the entire Sailors relation is stored at both MUMBAI and DELHI
        sites.
      Where should the query be executed?
      2 Joins in a Distributed DBMS
      Eg: the Sailors relation is stored at MUMBAI, and that the Reserves
       relation is stored at DELHI.
      Joins of relations at different sites can be very expensive.
      JOIN STRATEGY
        1.   Fetch as needed
        2. Ship whole
        3. Semijoins
        4. Bloomjoins


59   1. Parallel DB /D.S.Jagli      Which strategy is better for me?   2/12/2013
5.Distributed Query Processing
      1.Fetch As Needed
      Page-oriented Nested Loops join: For each page of R, get each page
        of S, and write out matching pairs of tuples <r, s>, where r is in R-page
        and s is in S-page.

      We could do a page-oriented nested loops join in MUMBAI with
        Sailors as the outer, and for each Sailors page, fetch all Reserves pages
        from DELHI.

      If we cache the fetched Reserves pages in MUMBAI until the join is
        complete, pages are fetched only once




60   1. Parallel DB /D.S.Jagli                                          2/12/2013
Fetch As Needed: Transferring the relation
     piecewise




                                 QUERY: The query asks for R   S


61   1. Parallel DB /D.S.Jagli                                     2/12/2013
5.Distributed Query Processing
      Assume Reserves and Sailors relations
            each tuple of Reserves is 40 bytes long
            a page can hold 100 Reserves tuples
            1,000 pages of such tuples.
            each tuple of Sailors is 50 bytes long
            a page can hold 80 Sailors Tuples
            500 pages of such tuples
      td is cost to read/write page; ts is cost to ship page.
      The cost is = 500td to scan Sailors
      for each Sailors page, the cost of scanning shipping all of Reserves, which is
       =1000(td + ts).
      The total cost is = 500td + 500, 000(td + ts).


62       1. Parallel DB /D.S.Jagli                                          2/12/2013
5.Distributed Query Processing
      This cost depends on the size of the result.
      The cost of shipping the result is greater than the cost of shipping both
       Sailors and Reserves to the query site.

      2.Ship to One Site (sort-merge-join)
      The cost of scanning and shipping Sailors, saving it at DELHI, and then
       doing the join at DELHI is

        =500(2td + ts) + (Result)td,

      The cost of shipping Reserves and doing the join at MUMBAI is


        =1000(2td+ts)+(result)td.

63       1. Parallel DB /D.S.Jagli                                      2/12/2013
Ship Whole: Transferring the complete
     relation




                                 QUERY: The query asks for R   S


64   1. Parallel DB /D.S.Jagli                                     2/12/2013
5.Distributed Query Processing
      Ship Whole vs Fetch as needed:
      Fetch as needed results in a high number of messages
      Ship whole results in high amounts of transferred data
      Note: Some tuples in Reserves do not join with any tuple in Sailors, we
        could avoid shipping them.
     3.Semijoins and Bloomjoins
     Semijoins: 3 steps:
     1. At MUMBAI, compute the projection of Sailors onto the join columns i.e
         sid and ship this projection to DELHI.
     2. 2. At DELHI, compute the natural join of the projection received from
         the first site with the Reserves relation.
     3. The result of this join is called the reduction of Reserves with respect to
         Sailors. ship the reduction of Reserves to MUMBAI.
     4. 3. At MUMBAI, compute the join of the reduction of Reserves with
65       Sailors. DB /D.S.Jagli
         1. Parallel                                                       2/12/2013
Semijoin:
      Semijoin: Requesting all join partners in just one step.




66   1. Parallel DB /D.S.Jagli                           2/12/2013
5.Distributed query processing
         Bloomjoins: 3 steps:
     1.   At MUMBAI, A bit-vector of (some chosen) size k is computed by
          hashing each tuple of Sailors into the range 0 to k −1 and setting bit i to 1.
          if some tuple hashes to i, and 0 otherwise then ship this to DELHI

     2.   At DELHI, the reduction of Reserves is computed by hashing each tuple
          of Reserves (using the sid field) into the range 0 to k − 1, using the same
          hash function used to construct the bit-vector, and discarding tuples
          whose hash value i corresponds to a 0 bit.ship the reduction of Reserves
          to MUMBAI.

     3.   At MUMBAI, compute the join of the reduction of Reserves with Sailors.




67        1. Parallel DB /D.S.Jagli                                            2/12/2013
Bloom join:
      Bloom join:
      Also known as bit-vector join
      Avoiding to transfer all join attribute values to the other
       node
      Instead transferring a bitvector B[1 : : : n]
      Transformation
          Choose an appropriate hash function h
          Apply h to transform attribute values to the range [1 : : : n]
          Set the corresponding bits in the bitvector B[1 : : : n] to




68   1. Parallel DB /D.S.Jagli                                              2/12/2013
Bloom join:




69   1. Parallel DB /D.S.Jagli   2/12/2013
 Cost-Based Query Optimization
      optimizing queries in a distributed database poses the following additional
       challenges:

         Communication costs must be considered. If we have several copies of a
          relation, we must also decide which copy to use.

         If individual sites are run under the control of different DBMSs, the
          autonomy of each site must be respected while doing global query
          planning.




70       1. Parallel DB /D.S.Jagli                                           2/12/2013
Cost-Based Query Optimization
      Cost-based approach; consider all plans, pick cheapest; similar to
       centralized optimization.

           Difference 1: Communication costs must be considered.
           Difference 2: Local site autonomy must be respected.
           Difference 3: New distributed join methods.


      Query site constructs global plan, with suggested local plans describing
       processing at each site.
      If a site can improve suggested local plan, free to do so.




71       1. Parallel DB /D.S.Jagli                                     2/12/2013
6.DISTRIBUTED TRANSACTIONS PROCESSING
      A given transaction is submitted at some one site, but it can
         access data at other sites.

      When a transaction is submitted at some site, the transaction
         manager at that site breaks it up into a collection of one or
         more subtransactions that execute at different sites.

     




72        1. Parallel DB /D.S.Jagli                            2/12/2013
7.Distributed Concurrency Control
       Lock management can be distributed across sites in many
        ways:
     1. Centralized: A single site is incharge of handling lock and
        unlock requests for all objects.
     2. Primary copy: One copy of each object is designated as the
        primary copy.
                 All requests to lock or unlock a copy of this object are handled
                  by the lock manager at the site where the primary copy is stored.

     3.       Fully distributed: Requests to lock or unlock a copy of an
              object stored at a site are handled by the lock manager at the
              site where the copy is stored.
73            1. Parallel DB /D.S.Jagli                                    2/12/2013
                                   1. Parallel DB /D.S.Jagli                       73
7.Distributed Concurrency Control
      Distributed Deadlock Detection:
      Each site maintains a local waits-for graph.
      A global deadlock might exist even if the local graphs contain no cycles:




      Three solutions:
      Centralized: send all local graphs to one site ;
      Hierarchical: organize sites into a hierarchy and send local graphs to parent
       in the hierarchy;
      Timeout : abort Xact if it waits too long.

74       1. Parallel DB /D.S.Jagli                                          2/12/2013
 Phantom Deadlocks: delays in propagating local information might
   cause the deadlock detection algorithm to identify `deadlocks' that do not
   really exist. Such situations are called phantom deadlocks




75    1. Parallel DB /D.S.Jagli                                       2/12/2013
7.Distributed Recovery
      Recovery in a distributed DBMS is more complicated than in a
          centralized DBMS for the following reasons:


     1.       New kinds of failure can arise:
               failure of communication links.
               failure of a remote site at which a subtransaction is
                executing.
     2. Either all subtransactions of a given transaction must
        commit, or none must commit.
     3. This property must be guaranteed in spite of any
        combination of site and link failures.


76   1. Parallel DB /D.S.Jagli                                2/12/2013
7.Distributed Recovery
            Two-Phase Commit (2PC):

      Site at which Xact originates is coordinator;
      other sites at which it executes subXact are subordinates.
      When the user decides to commit a transaction:
              The commit command is sent to the coordinator for the
               transaction.
              This initiates the 2PC protocol:




77   1. Parallel DB /D.S.Jagli                               2/12/2013
7.Distributed Recovery(2pc)
     1.    Coordinator sends prepare msg to each subordinate.

     2.    Subordinate force-writes an abort or prepare log record and then
           sends a no or yes msg to coordinator.

     3.    If coordinator gets all yes votes, force-writes a commit log record and
           sends commit msg to all subs. Else, force-writes abort log rec, and
           sends abort msg.

     4.     Subordinates force-write abort/commit log rec based on msg they
           get, then send ack msg to coordinator.

     5.    Coordinator writes end log rec after getting ack msg from all subs
78   1. Parallel DB /D.S.Jagli                                           2/12/2013
TWO-PHASE COMMIT (2PC) - commit




79   1. Parallel DB /D.S.Jagli           2/12/2013
TWO-PHASE COMMIT (2PC) - ABORT




80   1. Parallel DB /D.S.Jagli             2/12/2013
7.Distributed Recovery(3pc)
         Three-Phase Commit
     1.   A commit protocol called Three-Phase Commit (3PC) can avoid
          blocking even if the coordinator site fails during recovery.

     2.   The basic idea is that when the coordinator sends out prepare
          messages and receives yes votes from all subordinates.

     3.    It sends all sites a precommit message, rather than a commit message.
     4.   When a sufficient number more than the maximum number of failures
          that must be handled of acks have been received.
     5.   The coordinator force-writes a commit log record and sends a commit
          message to all subordinates.


81   1. Parallel DB /D.S.Jagli                                         2/12/2013
Distributed Database
      Advantages:                 Disadvantages:
          Reliability               Complexity of Query
          Performance                opt.
          Growth (incremental)      Concurrency control
          Local control             Recovery
          Transparency              Catalog management




82   1. Parallel DB /D.S.Jagli                        2/12/2013
Thank you

                       Queries?
83   1. Parallel DB /D.S.Jagli               2/12/2013

Contenu connexe

Tendances

Relational Database Design
Relational Database DesignRelational Database Design
Relational Database DesignArchit Saxena
 
File systems versus a dbms
File systems versus a dbmsFile systems versus a dbms
File systems versus a dbmsRituBhargava7
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLRamakant Soni
 
Query decomposition in data base
Query decomposition in data baseQuery decomposition in data base
Query decomposition in data baseSalman Memon
 
Object database standards, languages and design
Object database standards, languages and designObject database standards, languages and design
Object database standards, languages and designDabbal Singh Mahara
 
Database Relationships
Database RelationshipsDatabase Relationships
Database Relationshipswmassie
 
Chapter 5 Database Transaction Management
Chapter 5 Database Transaction ManagementChapter 5 Database Transaction Management
Chapter 5 Database Transaction ManagementEddyzulham Mahluzydde
 
11. Storage and File Structure in DBMS
11. Storage and File Structure in DBMS11. Storage and File Structure in DBMS
11. Storage and File Structure in DBMSkoolkampus
 
Relational Data Model Introduction
Relational Data Model IntroductionRelational Data Model Introduction
Relational Data Model IntroductionNishant Munjal
 
Database backup and recovery
Database backup and recoveryDatabase backup and recovery
Database backup and recoveryAnne Lee
 
Transaction processing ppt
Transaction processing pptTransaction processing ppt
Transaction processing pptJaved Khan
 
Database design & Normalization (1NF, 2NF, 3NF)
Database design & Normalization (1NF, 2NF, 3NF)Database design & Normalization (1NF, 2NF, 3NF)
Database design & Normalization (1NF, 2NF, 3NF)Jargalsaikhan Alyeksandr
 
Database replication
Database replicationDatabase replication
Database replicationArslan111
 

Tendances (20)

Distributed database
Distributed databaseDistributed database
Distributed database
 
Distributed DBMS - Unit 1 - Introduction
Distributed DBMS - Unit 1 - IntroductionDistributed DBMS - Unit 1 - Introduction
Distributed DBMS - Unit 1 - Introduction
 
Database System Architectures
Database System ArchitecturesDatabase System Architectures
Database System Architectures
 
Databases: Normalisation
Databases: NormalisationDatabases: Normalisation
Databases: Normalisation
 
Concurrency control
Concurrency controlConcurrency control
Concurrency control
 
Relational Database Design
Relational Database DesignRelational Database Design
Relational Database Design
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
File systems versus a dbms
File systems versus a dbmsFile systems versus a dbms
File systems versus a dbms
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
 
Query decomposition in data base
Query decomposition in data baseQuery decomposition in data base
Query decomposition in data base
 
Object database standards, languages and design
Object database standards, languages and designObject database standards, languages and design
Object database standards, languages and design
 
Database Relationships
Database RelationshipsDatabase Relationships
Database Relationships
 
Chapter 5 Database Transaction Management
Chapter 5 Database Transaction ManagementChapter 5 Database Transaction Management
Chapter 5 Database Transaction Management
 
11. Storage and File Structure in DBMS
11. Storage and File Structure in DBMS11. Storage and File Structure in DBMS
11. Storage and File Structure in DBMS
 
Relational Data Model Introduction
Relational Data Model IntroductionRelational Data Model Introduction
Relational Data Model Introduction
 
Xml databases
Xml databasesXml databases
Xml databases
 
Database backup and recovery
Database backup and recoveryDatabase backup and recovery
Database backup and recovery
 
Transaction processing ppt
Transaction processing pptTransaction processing ppt
Transaction processing ppt
 
Database design & Normalization (1NF, 2NF, 3NF)
Database design & Normalization (1NF, 2NF, 3NF)Database design & Normalization (1NF, 2NF, 3NF)
Database design & Normalization (1NF, 2NF, 3NF)
 
Database replication
Database replicationDatabase replication
Database replication
 

Similaire à Parallel Database

Memory dbms
Memory dbmsMemory dbms
Memory dbmsTech_MX
 
Design, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for HadoopDesign, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for Hadoopmcsrivas
 
High availability solutions bakostech
High availability solutions bakostechHigh availability solutions bakostech
High availability solutions bakostechViktoria Bakos
 
Os Buncetutorial
Os BuncetutorialOs Buncetutorial
Os Buncetutorialoscon2007
 
Database 2 ddbms,homogeneous & heterognus adv & disadvan
Database 2 ddbms,homogeneous & heterognus adv & disadvanDatabase 2 ddbms,homogeneous & heterognus adv & disadvan
Database 2 ddbms,homogeneous & heterognus adv & disadvanIftikhar Ahmad
 
Pass chapter meeting - november - partitioning for database availability - ch...
Pass chapter meeting - november - partitioning for database availability - ch...Pass chapter meeting - november - partitioning for database availability - ch...
Pass chapter meeting - november - partitioning for database availability - ch...Charley Hanania
 
Die 10 besten PostgreSQL-Replikationsstrategien für Ihr Unternehmen
Die 10 besten PostgreSQL-Replikationsstrategien für Ihr UnternehmenDie 10 besten PostgreSQL-Replikationsstrategien für Ihr Unternehmen
Die 10 besten PostgreSQL-Replikationsstrategien für Ihr UnternehmenEDB
 
CCS334 BIG DATA ANALYTICS Session 3 Distributed models.pptx
CCS334 BIG DATA ANALYTICS Session 3 Distributed models.pptxCCS334 BIG DATA ANALYTICS Session 3 Distributed models.pptx
CCS334 BIG DATA ANALYTICS Session 3 Distributed models.pptxAsst.prof M.Gokilavani
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Open world exadata_top_10_lessons_learned
Open world exadata_top_10_lessons_learnedOpen world exadata_top_10_lessons_learned
Open world exadata_top_10_lessons_learnedchet justice
 
Introduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptxIntroduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptxSakthiVinoth78
 
Debugging and Configuration Best Practices for Oracle Linux
Debugging and Configuration Best Practices for Oracle LinuxDebugging and Configuration Best Practices for Oracle Linux
Debugging and Configuration Best Practices for Oracle LinuxTerry Wang
 
Panzura & Scality - Cloud Storage made seamless - Cloud Expo New York City 2012
Panzura & Scality - Cloud Storage made seamless - Cloud Expo New York City 2012Panzura & Scality - Cloud Storage made seamless - Cloud Expo New York City 2012
Panzura & Scality - Cloud Storage made seamless - Cloud Expo New York City 2012Marc Villemade
 
Scalability
ScalabilityScalability
Scalabilityfelho
 
Damon2011 preview
Damon2011 previewDamon2011 preview
Damon2011 previewsundarnu
 
2007-05-23 Cecchet_PGCon2007.ppt
2007-05-23 Cecchet_PGCon2007.ppt2007-05-23 Cecchet_PGCon2007.ppt
2007-05-23 Cecchet_PGCon2007.pptnadirpervez2
 

Similaire à Parallel Database (20)

Manjeet Singh.pptx
Manjeet Singh.pptxManjeet Singh.pptx
Manjeet Singh.pptx
 
Memory dbms
Memory dbmsMemory dbms
Memory dbms
 
Design, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for HadoopDesign, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for Hadoop
 
FAQ
FAQFAQ
FAQ
 
High availability solutions bakostech
High availability solutions bakostechHigh availability solutions bakostech
High availability solutions bakostech
 
Os Buncetutorial
Os BuncetutorialOs Buncetutorial
Os Buncetutorial
 
Database 2 ddbms,homogeneous & heterognus adv & disadvan
Database 2 ddbms,homogeneous & heterognus adv & disadvanDatabase 2 ddbms,homogeneous & heterognus adv & disadvan
Database 2 ddbms,homogeneous & heterognus adv & disadvan
 
Pass chapter meeting - november - partitioning for database availability - ch...
Pass chapter meeting - november - partitioning for database availability - ch...Pass chapter meeting - november - partitioning for database availability - ch...
Pass chapter meeting - november - partitioning for database availability - ch...
 
Die 10 besten PostgreSQL-Replikationsstrategien für Ihr Unternehmen
Die 10 besten PostgreSQL-Replikationsstrategien für Ihr UnternehmenDie 10 besten PostgreSQL-Replikationsstrategien für Ihr Unternehmen
Die 10 besten PostgreSQL-Replikationsstrategien für Ihr Unternehmen
 
CCS334 BIG DATA ANALYTICS Session 3 Distributed models.pptx
CCS334 BIG DATA ANALYTICS Session 3 Distributed models.pptxCCS334 BIG DATA ANALYTICS Session 3 Distributed models.pptx
CCS334 BIG DATA ANALYTICS Session 3 Distributed models.pptx
 
Weekly report
Weekly reportWeekly report
Weekly report
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Open world exadata_top_10_lessons_learned
Open world exadata_top_10_lessons_learnedOpen world exadata_top_10_lessons_learned
Open world exadata_top_10_lessons_learned
 
Introduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptxIntroduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptx
 
Debugging and Configuration Best Practices for Oracle Linux
Debugging and Configuration Best Practices for Oracle LinuxDebugging and Configuration Best Practices for Oracle Linux
Debugging and Configuration Best Practices for Oracle Linux
 
Panzura & Scality - Cloud Storage made seamless - Cloud Expo New York City 2012
Panzura & Scality - Cloud Storage made seamless - Cloud Expo New York City 2012Panzura & Scality - Cloud Storage made seamless - Cloud Expo New York City 2012
Panzura & Scality - Cloud Storage made seamless - Cloud Expo New York City 2012
 
Scalability
ScalabilityScalability
Scalability
 
Gluster 3.3 deep dive
Gluster 3.3 deep diveGluster 3.3 deep dive
Gluster 3.3 deep dive
 
Damon2011 preview
Damon2011 previewDamon2011 preview
Damon2011 preview
 
2007-05-23 Cecchet_PGCon2007.ppt
2007-05-23 Cecchet_PGCon2007.ppt2007-05-23 Cecchet_PGCon2007.ppt
2007-05-23 Cecchet_PGCon2007.ppt
 

Plus de VESIT/University of Mumbai (10)

4.dynamic analysis
4.dynamic analysis4.dynamic analysis
4.dynamic analysis
 
3.static testing
3.static testing 3.static testing
3.static testing
 
2.testing in the software life cycle
2.testing in the software life cycle 2.testing in the software life cycle
2.testing in the software life cycle
 
1.basics of software testing
1.basics of software testing 1.basics of software testing
1.basics of software testing
 
Handy annotations-within-oracle-10g
Handy annotations-within-oracle-10gHandy annotations-within-oracle-10g
Handy annotations-within-oracle-10g
 
Rational Unified Treatment for Web Application Vulnerability Assessment
Rational Unified Treatment for Web Application Vulnerability AssessmentRational Unified Treatment for Web Application Vulnerability Assessment
Rational Unified Treatment for Web Application Vulnerability Assessment
 
THE APPLICATION OF CAUSE EFFECT GRAPH FOR THE COLLEGE PLACEMENT PROCESS
THE APPLICATION OF CAUSE EFFECT GRAPH FOR THE COLLEGE PLACEMENT PROCESSTHE APPLICATION OF CAUSE EFFECT GRAPH FOR THE COLLEGE PLACEMENT PROCESS
THE APPLICATION OF CAUSE EFFECT GRAPH FOR THE COLLEGE PLACEMENT PROCESS
 
Working with temperaments
Working with temperamentsWorking with temperaments
Working with temperaments
 
planning & project management for DWH
planning & project management for DWHplanning & project management for DWH
planning & project management for DWH
 
Data ware house
Data ware houseData ware house
Data ware house
 

Dernier

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 

Dernier (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

Parallel Database

  • 1. 1.Parallel and Distributed Databases 1 1. Parallel DB /D.S.Jagli 2/12/2013 1. Parallel DB /D.S.Jagli 1
  • 2. Parallel database 1. -Introduction 2. -Architecture for Parallel databases. 3. - Parallel query Evaluation 4. - Parallelizing Individual operations. 2 1. Parallel DB /D.S.Jagli 2/12/2013 1. Parallel DB /D.S.Jagli 2
  • 3. Introduction  What is a Centralized Database ? -all the data is maintained at a single site and assumed that the processing of individual transaction is essentially sequential. 3 1. Parallel DB /D.S.Jagli 2/12/2013 1. Parallel DB /D.S.Jagli 3
  • 4. PARALLEL DBMSs WHY DO WE NEED THEM? • More and More Data! We have databases that hold a high amount of data, in the order of 1012 bytes: 10,000,000,000,000 bytes! • Faster and Faster Access! We have data applications that need to process data at very high speeds: 10,000s transactions per second! SINGLE-PROCESSOR DBMS AREN’T UP TO THE JOB! 4 1. Parallel DB /D.S.Jagli 2/12/2013 1. Parallel DB /D.S.Jagli 4
  • 5. Why Parallel Access To Data? At 10 MB/s 1,000 x parallel 1.2 days to scan 1.5 minute to scan. 1 Terabyte 1 Terabyte 10 MB/s Parallelism: divide a big problem 1. Parallel DB /D.S.Jagli into many smaller ones 5 2/12/2013 to be solved in parallel.
  • 6. Parallel DB  Parallel database system seeks to improve performance through parallelization of various operations such as loading data ,building indexes, and evaluating queries by using multiple CPUs and Disks in Parallel.  Motivation for Parallel DB  Parallel machines are becoming quite common and affordable  Prices of microprocessors, memory and disks have dropped sharply  Databases are growing increasingly large  large volumes of transaction data are collected and stored for later analysis.  multimedia objects like images are increasingly stored in databases 6 1. Parallel DB /D.S.Jagli 2/12/2013 1. Parallel DB /D.S.Jagli 6
  • 7. PARALLEL DBMSs BENEFITS OF A PARALLEL DBMS  Improves Response Time. INTERQUERY PARALLELISM It is possible to process a number of transactions in parallel with each other.  Improves Throughput. INTRAQUERY PARALLELISM It is possible to process ‘sub-tasks’ of a transaction in parallel with each other. 7 1. Parallel DB /D.S.Jagli 2/12/2013 1. Parallel DB /D.S.Jagli 7
  • 8. PARALLEL DBMSs HOW TO MEASURE THE BENEFITS  Speed-Up – Adding more resources results in proportionally less running time for a fixed amount of data. 10 seconds to scan a DB of 10,000 records using 1 CPU 1 second to scan a DB of 10,000 records using 10 CPUs  Scale-Up  If resources are increased in proportion to an increase in data/problem size, the overall time should remain constant – 1 second to scan a DB of 1,000 records using 1 CPU 1 second to scan a DB of 10,000 records using 10 CPUs 8 1. Parallel DB /D.S.Jagli 2/12/2013 1. Parallel DB /D.S.Jagli 8
  • 9. Architectures for Parallel Databases  The basic idea behind Parallel DB is to carry out evaluation steps in parallel whenever is possible.  There are many opportunities for parallelism in RDBMS.  3 main architectures have been proposed for building parallel DBMSs. 1. Shared Memory 2. Shared Disk 3. Shared Nothing 9 1. Parallel DB /D.S.Jagli 2/12/2013 1. Parallel DB /D.S.Jagli 9
  • 10. Shared Memory  Advantages: 1. It is closer to conventional machine , Easy to program 2. overhead is low. 3. OS services are leveraged to utilize the additional CPUs.  Disadvantage: 1. It leads to bottleneck problem 2. Expensive to build 3. It is less sensitive to partitioning 10 1. Parallel DB /D.S.Jagli 2/12/2013 1. Parallel DB /D.S.Jagli 10
  • 11. Shared Disk  Advantages: 1. Almost same  Disadvantages: 1. More interference 2. Increases N/W band width 3. Shared disk less sensitive to partitioning 11 1. Parallel DB /D.S.Jagli 2/12/2013 1. Parallel DB /D.S.Jagli 11
  • 12. Shared Nothing  Advantages: 1. It provides linear scale up &linear speed up 2. Shared nothing benefits from "good" partitioning 3. Cheap to build  Disadvantage 1. Hard to program 2. Addition of new nodes requires reorganizing 12 1. Parallel DB /D.S.Jagli 2/12/2013 1. Parallel DB /D.S.Jagli 12
  • 13. PARALLEL DBMSs SPEED-UP Number of transactions/second Linear speed-up (ideal) 2000/Sec 1600/Sec Sub-linear speed-up 1000/Sec 5 CPUs 10 CPUs 16 CPUs 1. Parallel DB /D.S.Jagli Number of CPUs 2/12/2013 13 1. Parallel DB /D.S.Jagli 13
  • 14. PARALLEL DBMSs SCALE-UP Number of transactions/second 1000/Sec Linear scale-up (ideal) 900/Sec Sub-linear scale-up 5 CPUs 10 CPUs 1 GB Database 2 GB Database 1. Parallel DB /D.S.Jagli Number of CPUs, Database size 2/12/2013 14 1. Parallel DB /D.S.Jagli 14
  • 15. PARALLEL QUERY EVALUATION A relational query execution plan is graph/tree of relational algebra operators (based on this operators can execute in parallel) 15 1. Parallel DB /D.S.Jagli 2/12/2013 1. 1. Parallel DB /D.S.Jagli 15
  • 16. Different Types of DBMS ||-ism  Parallel evaluation of a relational query in DBMS With shared –nothing architecture 1. Inter-query parallelism  Multiple queries run on different sites 2. Intra-query parallelism  Parallel execution of single query run on different sites. a) Intra-operator parallelism a) get all machines working together to compute a given operation (scan, sort, join). b) Inter-operator parallelism  each operator may run concurrently on a different site (exploits pipelining).  In order to evaluate different operators in parallel, we need to evaluate each operator in query plan in Parallel. 16 1. Parallel DB /D.S.Jagli 2/12/2013 1. 1. Parallel DB /D.S.Jagli 16
  • 17. Data Partitioning  Types of Partitioning 1. Horizontal Partitioning: tuple of a relation are divided among many disks such that each tuple resides on one disk.  It enables to exploit the I/O band width of disks by reading & writing them in parallel.  Reduce the time required to retrieve relations from disk by partitioning the relations on multiple disks. 1. Range Partitioning 2. Hash Partitioning 3. Round Robin Partitioning 2. Vertical Partitioning 2/12/2013 17 1. Parallel DB /D.S.Jagli 17
  • 18. 1.Range Partitioning  Tuples are sorted (conceptually), and n ranges are chosen for the sort key values so that each range contains roughly the same number of tuples;  tuples in range i are assigned to processor i.  Eg:  sailor _id 1-10 assigned to disk 1  sailor _id 10-20 assigned to disk 2  sailor _id 20-30 assigned to disk 3  range partitioning can lead to data skew; that is, partitions with widely varying number of tuples across 18 1. Parallel DB /D.S.Jagli 2/12/2013 18
  • 19. 2.Hash Partitioning  A hash function is applied to selected fields of a tuple to determine its processor.  Hash partitioning has the additional virtue that it keeps data evenly distributed even if the data grows and shrinks over time. 19 1. Parallel DB /D.S.Jagli 2/12/2013 19
  • 20. 3.Round Robin Partitioning  If there are n processors, the i th tuple is assigned to processor i mod n in round-robin partitioning.  Round-robin partitioning is suitable for efficiently evaluating queries that access the entire relation.  If only a subset of the tuples (e.g., those that satisfy the selection condition age = 20) is required, hash partitioning and range partitioning are better than round-robin partitioning 20 1. Parallel DB /D.S.Jagli 2/12/2013 20
  • 21. Range Hash Round Robin A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z Good for equijoins, Good for equijoins, Good to spread load exact-match queries, exact match queries and range queries 21 1. Parallel DB /D.S.Jagli 2/12/2013 1. Parallel DB /D.S.Jagli 21
  • 22. Parallelizing Sequential Operator Evaluation Code 1. An elegant software architecture for parallel DBMSs enables us to readily parallelize existing code for sequentially evaluating a relational operator. 2. The basic idea is to use parallel data streams. 3. Streams are merged as needed to provide the inputs for a relational operator. 4. The output of an operator is split as needed to parallelize subsequent processing. 5. A parallel evaluation plan consists of a dataflow network of relational, merge, and split operators. 22 1. Parallel DB /D.S.Jagli 2/12/2013 1. Parallel DB /D.S.Jagli 22
  • 23. PARALLELIZING INDIVIDUAL OPERATIONS  How various operations can be implemented in parallel in a shared- nothing architecture?  Techniques 1. Bulk loading& scanning 2. Sorting 3. Joins 23 1. Parallel DB /D.S.Jagli 2/12/2013 1. Parallel DB /D.S.Jagli 23
  • 24. 1.Bulk Loading and scanning  scanning a relation: Pages can be read in parallel while scanning a relation, and the retrieved tuples can then be merged, if the relation is partitioned across several disks.  bulk loading: if a relation has associated indexes, any sorting of data entries required for building the indexes during bulk loading can also be done in parallel. 24 1. Parallel DB /D.S.Jagli 2/12/2013 1. Parallel DB /D.S.Jagli 24
  • 25. 2.Parallel Sorting :  Parallel sorting steps: 1. First redistribute all tuples in the relation using range partitioning. 2. Each processor then sorts the tuples assigned to it 3. The entire sorted relation can be retrieved by visiting the processors in an order corresponding to the ranges assigned to them.  Problem: Data skew  Solution: “sample” the data at the outset to determine good range partition points. A particularly important application of parallel sorting is sorting the data entries in tree-structured indexes. 25 1. Parallel DB /D.S.Jagli 2/12/2013 1. Parallel DB /D.S.Jagli 25
  • 26. 3.Parallel Join 1. The basic idea for joining A and B in parallel is to decompose the join into a collection of k smaller joins by using partition. 2. By using the same partitioning function for both A and B, we ensure that the union of the k smaller joins computes the join of A and B.  Hash-Join  Sort-merge-join 26 1. Parallel DB /D.S.Jagli 2/12/2013 1. Parallel DB /D.S.Jagli 26
  • 27. Sort-merge-join  partition A and B by dividing the range of the join attribute into k disjoint subranges and placing A and B tuples into partitions according to the subrange to which their values belong.  Each processor carry out a local join.  In this case the number of partitions k is chosen to be equal to the number of processors n .  The result of the join of A and B, the output of the join process may be split into several data streams. The advantage that the output is available in sorted order 27 1. Parallel DB /D.S.Jagli 2/12/2013 1. Parallel DB /D.S.Jagli 27
  • 28. Dataflow Network of Operators for Parallel Join Good use of split/merge makes it easier to build parallel versions of sequential join code 28 1. Parallel DB /D.S.Jagli 2/12/2013 1. Parallel DB /D.S.Jagli 28
  • 29. DDBMS I. Introduction to DDBMS II. Architecture of DDBs III. Storing data in DDBs IV. Distributed catalog management V. Distributed query processing VI. Transaction Processing VII. Distributed concurrency control and recovery 29 1. Parallel DB /D.S.Jagli 2/12/2013 1. Parallel DB /D.S.Jagli 29
  • 30. I.Introduction to DDBMS  Data in a distributed database system is stored across several sites.  Each site is typically managed by a DBMS that can run independently of the other sites that co-operate in a transparent manner.  Transparent implies that each user within the system may access all of the data within all of the databases as if they were a single database  There should be „location independence‟ i.e.- as the user is unaware of where the data is located it is possible to move the data from one physical location to another without affecting the user. 30 1. Parallel DB /D.S.Jagli 2/12/2013 1. Parallel DB /D.S.Jagli 30
  • 31. DDBMS properties  Distributed data independence: Users should be able to ask queries without specifying where the referenced relations, or copies or fragments of the relations, are located.  Distributed transaction atomicity: Users should be able to write transactions that access and update data at several sites just as they would write transactions over purely local data. 31 1. Parallel DB /D.S.Jagli 2/12/2013 1. Parallel DB /D.S.Jagli 31
  • 32. DISTRIBUTED PROCESSING ARCHITECTURE CLIENT CLIENT CLIENT CLIENT LAN LAN CLIENT CLIENT CLIENT CLIENT Delhi Mumbai CLIENT CLIENT CLIENT CLIENT DBMS LAN LAN CLIENT CLIENT CLIENT CLIENT 32 1. Parallel DB /D.S.Jagli Hyderabad Pune2/12/2013 1. Parallel DB /D.S.Jagli 32
  • 33. DISTRIBUTED DATABASE ARCHITECTURE CLIENT CLIENT CLIENT CLIENT DBMS DBMS LAN CLIENT CLIENT CLIENT CLIENT Delhi Mumbai CLIENT CLIENT CLIENT CLIENT CLIENT DBMS DBMS LAN CLIENT CLIENT CLIENT CLIENT Hyderabad DB /D.S.Jagli 33 1. Parallel 1. Parallel DB /D.S.Jagli Pune 2/12/2013 33
  • 34. Distributed database  Communication Network- DBMS and Data at each node •Users are unaware of the distribution of the data Location transparency 34 1. Parallel DB /D.S.Jagli 2/12/2013
  • 35. Types of Distributed Databases  Homogeneous distributed database system :  If data is distributed but all servers run the same DBMS software.  Heterogeneous distributed database :  If different sites run under the control of different DBMSs, essentially autonomously, are connected to enable access to data from multiple sites.  The key to building heterogeneous systems is to have well-accepted standards for gateway protocols.  A gateway protocol is an API that exposes DBMS functionality to external applications. 35 1. Parallel DB /D.S.Jagli 2/12/2013 1. Parallel DB /D.S.Jagli 35
  • 36. DDBMS I. Introduction to DDBMS II. Architecture of DDBs III. Storing data in DDBs IV. Distributed catalog management V. Distributed query processing VI. Transaction Processing VII. Distributed concurrency control and recovery 36 1. Parallel DB /D.S.Jagli 2/12/2013 1. Parallel DB /D.S.Jagli 36
  • 37. 2.DISTRIBUTED DBMS ARCHITECTURES 1. Client-Server Systems: 2. Collaborating Server Systems 3. Middleware Systems 37 1. Parallel DB /D.S.Jagli 2/12/2013 1. Parallel DB /D.S.Jagli 37
  • 38. 1.Client-Server Systems: 1. A Client-Server system has one or more client processes and one or more server processes, 2. A client process can send a query to any one server process. 3. Clients are responsible for user-interface issues, 4. Servers manage data and execute transactions. 5. A client process could run on a personal computer and send queries to a server running on a mainframe. 6. The Client-Server architecture does not allow a single query to span multiple servers 38 1. Parallel DB /D.S.Jagli 2/12/2013 1. Parallel DB /D.S.Jagli 38
  • 39. SPECIALISED NETWORK CONNECTION TERMINALS MAINFRAME COMPUTER DUMB DUMB DUMB PRESENTATION LOGIC BUSINESS LOGIC DATA LOGIC 39 1. Parallel DB /D.S.Jagli 2/12/2013 1. Parallel DB /D.S.Jagli 39
  • 40. 2.Collaborating Server Systems 1. The client process would have to be capable of breaking such a query into appropriate subqueries. 2. A Collaborating Server system can have a collection of database servers, each capable of running transactions against local data, which cooperatively execute transactions spanning multiple servers. 3. When a server receives a query that requires access to data at other servers, it generates appropriate subqueries to be executed by other servers . 4. puts the results together to compute answers to the original query. 40 1. Parallel DB /D.S.Jagli 2/12/2013 1. Parallel DB /D.S.Jagli 40
  • 41. M:N CLIENT/SERVER DBMS ARCHITECTURE SERVER #1 CLIENT #1 D/BASE CLIENT #2 SERVER #2 D/BASE CLIENT #3 NOT TRANSPARENT! 41 1. Parallel DB /D.S.Jagli 2/12/2013 1. Parallel DB /D.S.Jagli 41
  • 42. 3.Middleware Systems:  The Middleware architecture is designed to allow a single query to span multiple servers, without requiring all database servers to be capable of managing such multisite execution strategies.  It is especially attractive when trying to integrate several legacy systems, whose basic capabilities cannot be extended. 42 1. Parallel DB /D.S.Jagli 2/12/2013 1. Parallel DB /D.S.Jagli 42
  • 43. DDBMS I. Introduction to DDBMS II. Architecture of DDBs III. Storing data in DDBs IV. Distributed catalog management V. Distributed query processing VI. Transaction Processing VII. Distributed concurrency control and recovery 43 1. Parallel DB /D.S.Jagli 2/12/2013 1. Parallel DB /D.S.Jagli 43
  • 44. 3.Storing Data in DDBs  In a distributed DBMS, relations are stored across several sites.  Accessing a relation that is stored at a remote site includes message-passing costs.  A single relation may be partitioned or fragmented across several sites. 44 1. Parallel DB /D.S.Jagli 2/12/2013 1. Parallel DB /D.S.Jagli 44
  • 45. Types of Fragmentation: Horizontal fragmentation: The union of the horizontal fragments must be equal to the original relation. Fragments are usually also required to be disjoint. Vertical fragmentation: The collection of vertical fragments should be a lossless-join decomposition. 45 1. Parallel DB /D.S.Jagli 2/12/2013 45
  • 46. 46 1. Parallel DB /D.S.Jagli 2/12/2013 46
  • 47. Replication  Replication means that we store several copies of a relation or relation fragment.  The motivation for replication is twofold: 1. Increased availability of data: 2. Faster query evaluation:  Two kinds of replications 1. synchronous replication 2. asynchronous replication 47 1. Parallel DB /D.S.Jagli 2/12/2013 47
  • 48. DDBMS I. Introduction to DDBMS II. Architecture of DDBs III. Storing data in DDBs IV. Distributed Catalog Management V. Distributed Query Processing VI. Transaction Processing VII. Distributed concurrency control and recovery 48 1. Parallel DB /D.S.Jagli 2/12/2013 1. Parallel DB /D.S.Jagli 48
  • 49. 4.Distributed Catalog Management 1. Naming Objects • If a relation is fragmented and replicated, we must be able to uniquely identify each replica of each fragment. 1. A local name field 2. A birth site field 2. Catalog Structure  A centralized system catalog can be used It is vulnerable to failure of the site containing the catalog).  An alternative is to maintain a copy of a global system catalog. compromises site autonomy,) 49 1. Parallel DB /D.S.Jagli 2/12/2013 49
  • 50. 4.Distributed Catalog Management  A better approach:  Each site maintains a local catalog that describes all copies of data stored at that site.  In addition, the catalog at the birth site for a relation is responsible for keeping track of where replicas of the relation are stored. 50 1. Parallel DB /D.S.Jagli 2/12/2013 50
  • 51. DDBMS I. Introduction to DDBMS II. Architecture of DDBs III. Storing data in DDBs IV. Distributed Catalog management V. Distributed Query Processing VI. Transaction Processing VII. Distributed concurrency control and recovery 51 1. Parallel DB /D.S.Jagli 2/12/2013 1. Parallel DB /D.S.Jagli 51
  • 52. 5.Distributed Query Processing  Distributed query processing: Transform a high-level query (of relational calculus/SQL) on a distributed database (i.e., a set of global relations) into an equivalent and efficient lower-level query (of relational algebra) on relation fragments.  Distributed query processing is more complex 1. – Fragmentation/replication of relations 2. – Additional communication costs 3. – Parallel execution 52 1. Parallel DB /D.S.Jagli 2/12/2013 52
  • 53. Distributed Query Processing Steps 53 1. Parallel DB /D.S.Jagli 2/12/2013
  • 54. 5.Distributed Query Processing  Sailors(sid: integer, sname: string, rating: integer, age: real)  Reserves(sid: integer, bid: integer, day: date, rname: string)  Assume Reserves and Sailors relations  each tuple of Reserves is 40 bytes long  a page can hold 100 Reserves tuples  1,000 pages of such tuples.  each tuple of Sailors is 50 bytes long  a page can hold 80 Sailors Tuples  500 pages of such tuples  How to estimate the cost? 54 1. Parallel DB /D.S.Jagli 2/12/2013 54
  • 55. 5.Distributed Query Processing  Criteria for measuring the cost of a query evaluation strategy  For centralized DBMSs number of disk accesses (# blocks read / written)  For distributed databases, additionally  The cost of data transmission over the network  Potential gain in performance from having several sites processing parts of the query in parallel 55 1. Parallel DB /D.S.Jagli 2/12/2013
  • 56. 5.Distributed query processing  To estimate the cost of an evaluation strategy, in addition to counting the number of page I/Os.  we must count the number of pages that are shipped is a communication costs.  Communication costs is a significant component of overall cost in a distributed database. 1. Nonjoin Queries in a Distributed DBMS 2. Joins in a Distributed DBMS 3. Cost-Based Query Optimization 56 1. Parallel DB /D.S.Jagli 2/12/2013 56
  • 57. 5.Distributed Query Processing 1.Nonjoin Queries in a Distributed DBMS  Even simple operations such as scanning a relation, selection, and projection are affected by fragmentation and replication. SELECT S.age FROM Sailors S WHERE S.rating > 3 AND S.rating < 7  Suppose that the Sailors relation is horizontally fragmented, with all tuples having a rating less than 5 at Mumbai and all tuples having a rating greater than 5 at Delhi. The DBMS must answer this query by evaluating it at both sites and taking the union of the answers. 57 1. Parallel DB /D.S.Jagli 2/12/2013 2/12/2013 57
  • 58. 5.Distributed Query Processing Eg 1: SELECT avg(age) FROM Sailors S WHERE S.rating > 3 AND S.rating < 7  taking the union of the answers is not enough Eg 2: SELECT S.age FROM Sailors S WHERE S.rating > 6  taking the union of the answers is not enough  Eg 3: suppose that the Sailors relation is vertically fragmented, with the sid and rating fields at MUMBAI and the sname and age fields at DELHI  This vertical fragmentation would be a lossy decomposition 58 1. Parallel DB /D.S.Jagli 2/12/2013
  • 59. 5.Distributed Query Processing Eg 4:the entire Sailors relation is stored at both MUMBAI and DELHI sites.  Where should the query be executed?  2 Joins in a Distributed DBMS  Eg: the Sailors relation is stored at MUMBAI, and that the Reserves relation is stored at DELHI.  Joins of relations at different sites can be very expensive.  JOIN STRATEGY 1. Fetch as needed 2. Ship whole 3. Semijoins 4. Bloomjoins 59 1. Parallel DB /D.S.Jagli Which strategy is better for me? 2/12/2013
  • 60. 5.Distributed Query Processing  1.Fetch As Needed  Page-oriented Nested Loops join: For each page of R, get each page of S, and write out matching pairs of tuples <r, s>, where r is in R-page and s is in S-page.  We could do a page-oriented nested loops join in MUMBAI with Sailors as the outer, and for each Sailors page, fetch all Reserves pages from DELHI.  If we cache the fetched Reserves pages in MUMBAI until the join is complete, pages are fetched only once 60 1. Parallel DB /D.S.Jagli 2/12/2013
  • 61. Fetch As Needed: Transferring the relation piecewise QUERY: The query asks for R S 61 1. Parallel DB /D.S.Jagli 2/12/2013
  • 62. 5.Distributed Query Processing  Assume Reserves and Sailors relations  each tuple of Reserves is 40 bytes long  a page can hold 100 Reserves tuples  1,000 pages of such tuples.  each tuple of Sailors is 50 bytes long  a page can hold 80 Sailors Tuples  500 pages of such tuples  td is cost to read/write page; ts is cost to ship page.  The cost is = 500td to scan Sailors  for each Sailors page, the cost of scanning shipping all of Reserves, which is =1000(td + ts).  The total cost is = 500td + 500, 000(td + ts). 62 1. Parallel DB /D.S.Jagli 2/12/2013
  • 63. 5.Distributed Query Processing  This cost depends on the size of the result.  The cost of shipping the result is greater than the cost of shipping both Sailors and Reserves to the query site.  2.Ship to One Site (sort-merge-join)  The cost of scanning and shipping Sailors, saving it at DELHI, and then doing the join at DELHI is =500(2td + ts) + (Result)td,  The cost of shipping Reserves and doing the join at MUMBAI is =1000(2td+ts)+(result)td. 63 1. Parallel DB /D.S.Jagli 2/12/2013
  • 64. Ship Whole: Transferring the complete relation QUERY: The query asks for R S 64 1. Parallel DB /D.S.Jagli 2/12/2013
  • 65. 5.Distributed Query Processing  Ship Whole vs Fetch as needed:  Fetch as needed results in a high number of messages  Ship whole results in high amounts of transferred data  Note: Some tuples in Reserves do not join with any tuple in Sailors, we could avoid shipping them. 3.Semijoins and Bloomjoins Semijoins: 3 steps: 1. At MUMBAI, compute the projection of Sailors onto the join columns i.e sid and ship this projection to DELHI. 2. 2. At DELHI, compute the natural join of the projection received from the first site with the Reserves relation. 3. The result of this join is called the reduction of Reserves with respect to Sailors. ship the reduction of Reserves to MUMBAI. 4. 3. At MUMBAI, compute the join of the reduction of Reserves with 65 Sailors. DB /D.S.Jagli 1. Parallel 2/12/2013
  • 66. Semijoin:  Semijoin: Requesting all join partners in just one step. 66 1. Parallel DB /D.S.Jagli 2/12/2013
  • 67. 5.Distributed query processing  Bloomjoins: 3 steps: 1. At MUMBAI, A bit-vector of (some chosen) size k is computed by hashing each tuple of Sailors into the range 0 to k −1 and setting bit i to 1. if some tuple hashes to i, and 0 otherwise then ship this to DELHI 2. At DELHI, the reduction of Reserves is computed by hashing each tuple of Reserves (using the sid field) into the range 0 to k − 1, using the same hash function used to construct the bit-vector, and discarding tuples whose hash value i corresponds to a 0 bit.ship the reduction of Reserves to MUMBAI. 3. At MUMBAI, compute the join of the reduction of Reserves with Sailors. 67 1. Parallel DB /D.S.Jagli 2/12/2013
  • 68. Bloom join:  Bloom join:  Also known as bit-vector join  Avoiding to transfer all join attribute values to the other node  Instead transferring a bitvector B[1 : : : n]  Transformation  Choose an appropriate hash function h  Apply h to transform attribute values to the range [1 : : : n]  Set the corresponding bits in the bitvector B[1 : : : n] to 68 1. Parallel DB /D.S.Jagli 2/12/2013
  • 69. Bloom join: 69 1. Parallel DB /D.S.Jagli 2/12/2013
  • 70.  Cost-Based Query Optimization  optimizing queries in a distributed database poses the following additional challenges:  Communication costs must be considered. If we have several copies of a relation, we must also decide which copy to use.  If individual sites are run under the control of different DBMSs, the autonomy of each site must be respected while doing global query planning. 70 1. Parallel DB /D.S.Jagli 2/12/2013
  • 71. Cost-Based Query Optimization  Cost-based approach; consider all plans, pick cheapest; similar to centralized optimization.  Difference 1: Communication costs must be considered.  Difference 2: Local site autonomy must be respected.  Difference 3: New distributed join methods.  Query site constructs global plan, with suggested local plans describing processing at each site.  If a site can improve suggested local plan, free to do so. 71 1. Parallel DB /D.S.Jagli 2/12/2013
  • 72. 6.DISTRIBUTED TRANSACTIONS PROCESSING  A given transaction is submitted at some one site, but it can access data at other sites.  When a transaction is submitted at some site, the transaction manager at that site breaks it up into a collection of one or more subtransactions that execute at different sites.  72 1. Parallel DB /D.S.Jagli 2/12/2013
  • 73. 7.Distributed Concurrency Control  Lock management can be distributed across sites in many ways: 1. Centralized: A single site is incharge of handling lock and unlock requests for all objects. 2. Primary copy: One copy of each object is designated as the primary copy.  All requests to lock or unlock a copy of this object are handled by the lock manager at the site where the primary copy is stored. 3. Fully distributed: Requests to lock or unlock a copy of an object stored at a site are handled by the lock manager at the site where the copy is stored. 73 1. Parallel DB /D.S.Jagli 2/12/2013 1. Parallel DB /D.S.Jagli 73
  • 74. 7.Distributed Concurrency Control  Distributed Deadlock Detection:  Each site maintains a local waits-for graph.  A global deadlock might exist even if the local graphs contain no cycles:  Three solutions:  Centralized: send all local graphs to one site ;  Hierarchical: organize sites into a hierarchy and send local graphs to parent in the hierarchy;  Timeout : abort Xact if it waits too long. 74 1. Parallel DB /D.S.Jagli 2/12/2013
  • 75.  Phantom Deadlocks: delays in propagating local information might cause the deadlock detection algorithm to identify `deadlocks' that do not really exist. Such situations are called phantom deadlocks 75 1. Parallel DB /D.S.Jagli 2/12/2013
  • 76. 7.Distributed Recovery  Recovery in a distributed DBMS is more complicated than in a centralized DBMS for the following reasons: 1. New kinds of failure can arise:  failure of communication links.  failure of a remote site at which a subtransaction is executing. 2. Either all subtransactions of a given transaction must commit, or none must commit. 3. This property must be guaranteed in spite of any combination of site and link failures. 76 1. Parallel DB /D.S.Jagli 2/12/2013
  • 77. 7.Distributed Recovery  Two-Phase Commit (2PC):  Site at which Xact originates is coordinator;  other sites at which it executes subXact are subordinates.  When the user decides to commit a transaction:  The commit command is sent to the coordinator for the transaction.  This initiates the 2PC protocol: 77 1. Parallel DB /D.S.Jagli 2/12/2013
  • 78. 7.Distributed Recovery(2pc) 1. Coordinator sends prepare msg to each subordinate. 2. Subordinate force-writes an abort or prepare log record and then sends a no or yes msg to coordinator. 3. If coordinator gets all yes votes, force-writes a commit log record and sends commit msg to all subs. Else, force-writes abort log rec, and sends abort msg. 4. Subordinates force-write abort/commit log rec based on msg they get, then send ack msg to coordinator. 5. Coordinator writes end log rec after getting ack msg from all subs 78 1. Parallel DB /D.S.Jagli 2/12/2013
  • 79. TWO-PHASE COMMIT (2PC) - commit 79 1. Parallel DB /D.S.Jagli 2/12/2013
  • 80. TWO-PHASE COMMIT (2PC) - ABORT 80 1. Parallel DB /D.S.Jagli 2/12/2013
  • 81. 7.Distributed Recovery(3pc)  Three-Phase Commit 1. A commit protocol called Three-Phase Commit (3PC) can avoid blocking even if the coordinator site fails during recovery. 2. The basic idea is that when the coordinator sends out prepare messages and receives yes votes from all subordinates. 3. It sends all sites a precommit message, rather than a commit message. 4. When a sufficient number more than the maximum number of failures that must be handled of acks have been received. 5. The coordinator force-writes a commit log record and sends a commit message to all subordinates. 81 1. Parallel DB /D.S.Jagli 2/12/2013
  • 82. Distributed Database  Advantages:  Disadvantages:  Reliability  Complexity of Query  Performance opt.  Growth (incremental)  Concurrency control  Local control  Recovery  Transparency  Catalog management 82 1. Parallel DB /D.S.Jagli 2/12/2013
  • 83. Thank you Queries? 83 1. Parallel DB /D.S.Jagli 2/12/2013