Sunday, April 25, 2010

NEW DISTRIBUTED QUERY OPTIMIZATION TECHNIQUES

ABSTRACT:
TOPIC: new query processing and optimization techniques in distributed database
Now-a-days data is distributed across the networks to make total world as a global village. Distributed database management systems (DDBMS) are amongst the most important and successful software developments in this decade. They are enabling computing power and data to be placed within the user environment close to the point of user activities. The performance efficiency of DDBMS is deeply related to the query processing and optimization strategies involving data transmission over different nodes through the network. Most real-world data is not well structured. Today's databases typically contain much non-structured data such as text, images, video, and audio, often distributed across computer networks. This situation demands new query processing, optimization techniques in distributed database environment. These techniques will provide efficient performance in optimization of query processing strategies in a distributed databases environment. this paper mainly focuses distributed query optimization problem
The main contents of this paper are:
• Introduction to Distributed database, Query processing and Query optimization strategies.
• Distributed Query Processing Methodology.
• Distributed Query Optimization.
• New query optimization techniques in distributed database.
• Distributed Query Optimization problems and some solutions.
• Advantages of query optimization techniques in distributed database.
• Conclusion.
Introduction:
Distributed database:
Distributed database is a database that is under the control of a central database management system (DBMS) in which storage devices are not all attached to a common CPU. It may be stored in multiple computers located in the same physical location, or may be dispersed over a network of interconnected computers. Collections of data (e.g. in a database) can be distributed across multiple physical locations.

Query processing :
Query processing is defined as the activities involved in parsing, validating, optimizing and executing a query.
The main aim of query processing is Transform query written in high-level language (e.g. SQL), into correct and efficient execution strategy expressed in low-level language (implementing Relational Algebra) and to find information in one or more databases and deliver it to the user quickly and efficiently.
High level user query -> Query Processor ->low-level data manipulation commands
Query optimization :
Query optimization is defined as the activity of choosing an efficient execution strategy for processing a query. Query optimization is a part of query processing.
The main aims of query optimization are to choose a transformation that minimizes resource usage, Reduce total execution time of query and also reduce response time of query
Distributed Query Processing Methodology:


Distributed query processing contains four stages which are query decomposition, data localization, global optimization and local optimization.
• Query decomposition: in this stage we are giving Calculus Query as an input and we are getting output as Algebraic Query. This stage is again divided in four stages they are Normalization, Analysis, Simplification and Restructuring
Data localization: in this stage Algebraic query on distributed relations is input and fragment query is output. In this stage fragment involvement is determined.
Global optimization: in this stage Fragment Query is input and optimized fragment query is output. Finding best global schedule is done in this stage.
Local optimization: Best global execution schedule is input and localized optimization queries are output in this stage. it contain two sub stages they are Select the best access path ,Use the centralized optimization techniques.
Distributed Query Optimization:
Distributed query optimization is defined as finding efficient execution strategy path in distributed networks.
Query optimization is difficult in distributed environment.
There are three components of distributed query optimization they are Access Method, Join Criteria, and Transmission Costs.
• Access Method: The methods which are used to access data from distributed environment like hashing, indexing etc.
• Join Criteria: In distributed database data is presented in different sites. Join criteria is used to join the different sites to get optimized result.
Transmission Costs: If data from multiple sites must be joined to satisfy a single query, then the cost of transmitting the results from intermediate steps needs to be factored into the equation. At times, it may be more cost effective simply to ship entire tables across the network to enable processing to occur at a single site, thereby reducing overall transmission costs. This component of query optimization is an issue only in a distributed environment.
There are many distributed query optimization issues some of them are types of optimizers, optimization granularity, network topologies and optimization timing.
AN OPTIMIZATION EXAMPLE
In order to understand distributed query optimization more fully, let’s take a look at an example of a query accessing tables in multiple locations. Consider the ramifications of coding a program to simply retrieve a list of all teachers who have taught physics to seniors. Furthermore, assume that the COURSE table and the ENROLLMENT table exist at Site 1; the STUDENT table exists at Site 2.If either all of the tables existed at a single site, or the DBMS supported distributed multi-site requests. However, if the DMBS can not perform (or optimize) distributed multi-site requests, programmatic optimization must be performed. There are at least six different ways to go about optimizing this three-table join.
Option 1: Start with Site 1 and join COURSE and ENROLLMENT, selecting only physics courses. For each qualifying row, move it to Site2 to be joined with STUDENT to see if any are seniors.
Option 2: Start with Site 1 and join COURSE and ENROLLMENT, selecting only physics courses, and move the entire result set to Site 2 to be joined with STUDENT, checking for senior students only.
Option 3: Start with Site 2 and select only seniors from STUDENT. For each of these examine the join of COURSE and ENROLLMENT at Site 1 for physics classes.
Option 4: Start with Site 2 and select only seniors from STUDENT at Site 2, and move the entire result set to Site 1 to be joined with COURSE and ENROLLMENT, checking for physics classes only.
Option 5: Move the COURSE and ENROLLMENT tables to Site 2 and proceed with a local three-table join.
Option 6: Move the STUDENT to Site 1 and proceed with a local three-table join.
Which of these six options will perform the best? Unfortunately, the only correct answer is "It depends." The optimal choice will depend upon:
1. the size of the tables;
2.the size of the result sets — that is, the number of qualifying rows
and their length in bytes; and
3.the efficiency of the network.
New query optimization techniques in distributed database:
• Cost – based query optimization:
Objective of Cost-based query optimization is estimate the cost of different equivalent query expressions and chose the execution plan with the lowest cost.
Cost based query optimization mainly depends on two factors they are solution space and cost function.
Solution space: this is depends on the set of equivalent algebraic expressions.
Cost function: cost function is equivalent to summation of i/o cost, CPU cost and communication cost. it also depends on different distributed environments.
By considering these factors cost based query optimization is processed in distributed environment.
Heuristic – based query optimization :
Heuristic based query optimization process involve following steps :
 Perform Selection operations as early as possible.
 Combine Cartesian product with subsequent selection whose predicate represents join condition into a Join operation.
 Use associatively of binary operations to rearrange leaf nodes so leaf nodes with most restrictive Selection operations executed first.
 Perform Projections operations as early as possible.
 Eliminate duplicate computations.
It is mainly used to minimize cost of selecting sites for multi join operations.
Rank-Aware Query Optimization :
Ranking is an important property that needs to be fully supported by current relational query engines. Recently, several rank-join query operators have been proposed based on rank aggregation algorithms. Rank-join operators progressively rank the join results while performing the join operation. The new operators have a direct impact on traditional query processing and optimization. We introduce a rank-aware query optimization framework that fully integrates rank-join operators into relational query engines. The framework is based on extending the System R dynamic programming algorithm in both enumeration and pruning. We define ranking as an interesting property that triggers the generation of rank-aware query plans. Unlike traditional join operators, optimizing for rank-join operators depends on estimating the input cardinality of these operators. We introduce a probabilistic model for estimating the input cardinality, and hence the cost of a rank-join operator. To our knowledge, this paper is the first effort in estimating the needed input size for optimal rank aggregation algorithms. Costing ranking plans, although challenging, is key to the full integration of rank-join operators in real-world query processing engines. We experimentally evaluate our framework by modifying the query optimizer of an open-source database management system. .
Query Optimization problems and some solutions in distributed database:
Stochastic query optimization problem for multiple join:
The model of three joins stored at two sites leads to a nonlinear programming problem, which has an analytical solution. The model with four sites leads to a special kind of nonlinear optimization problem (P).This problem is known as stochastic query optimization problem for multiple join. This problem can not be solved analytically. It is proved that problem (P) has at least one solution and two new methods are presented for solving the problem. An ad hoc constructive model and a new evolutionary technique is used for solving problem (P). Results obtained by the two considered optimization approaches are compared.
• Problem of optimizing queries that involve set operations:
The problem of optimizing queries that involves set operations (set queries) in a distributed relational database system. A particular emphasis is put on the optimization of such queries in horizontally partitioned database systems. A mathematical programming model of the set query problem is developed and its NP-completeness is proved. Solution procedures are proposed and computational results presented. One of the main results of the computational experiments is that, for many queries, the solution procedures are not sensitive to errors in estimating the size of results of set operations.
• Stochastic optimization problem for multiple queries:
Many algorithms have been devised for minimizing the costs associated with obtaining the answer to a single, isolated query in a distributed database system. However, if more than one query may be processed by the system at the same time and if the arrival times of the queries are unknown, the determination of optimal query-processing strategies becomes a stochastic optimization problem. In order to cope with such problems, a theoretical state-transition model is presented that treats the system as one operating under a stochastic load. Query-processing strategies may then be distributed over the processors of a network as probability distributions, in a manner which accommodates many queries over time. It is then shown that the model leads to the determination of optimal query-processing strategies as the solution of mathematical programming problems, and analytical results for several examples are presented. Furthermore, a divide-and-conquer approach is introduced for decomposing stochastic query optimization problems into distinct sub problems for processing queries sequentially and in parallel.
• Sum product optimization problem:
most distributed query optimization problems can be transformed into an optimization problem comprising a set of binary decisions, termed Sum Product Optimization (SPO) problem. We first prove SPO is NP-hard in light of the NP-completeness of a well-known problem, Knapsack (KNAP). Then, using this result as a basis, we prove that five classes of distributed query optimization problems, which cover the majority of distributed query optimization problems previously studied in the literature, are NP-hard by polynomials reducing SPO to each of them. The detail for each problem transformation is derived. We not only prove the conjecture that many prior studies relied upon, but also provide a frame work for future related studies.
Advantages of distributed database:

• Reflects organizational structure — database fragments are located in the departments they relate to.
• Local autonomy — a department can control the data about them (as they are the ones familiar with it.)
• Improved availability — a fault in one database system will only affect one fragment, instead of the entire database.
• Improved performance — data is located near the site of greatest demand, and the database systems themselves are parallelized, allowing load on the databases to be balanced among servers. (A high load on one module of the database won't affect other modules of the database in a distributed database.)
• Economics — it costs less to create a network of smaller computers with the power of a single large computer.
• Modularity — systems can be modified, added and removed from the distributed database without affecting other modules (systems).
Advantages of Distributed query optimization:

 Distributed Query optimization techniques provide exact results in distributed environment.
 These techniques provide efficient performance in different distributed networks.
 In internet these techniques helps to search exact information and extract the required one.

Conclusion:

Most real-world data is not well structured. Today's databases typically contain much non-structured data such as text, images, video, and audio, often distributed across computer networks. To process these kinds of data and optimize queries on this data requires these distributed query optimization techniques.

No comments:

Post a Comment