2011年8月19日星期五

Study on Index Selection Problem




Study on Index Selection Problem


A STUDY ON INDEX SELECTION PROBLEM

 

Abstract:

 

  This periodical is an try to comprehend the new methodology for collecting usage statistics at flee time to develop the optimizer to estimate query execution costs for alternative index configurations that assists the database administrator in designing an index configuration for a relational database system and defining the workload specification necessitated by an existing index design tools which may be very complex for a large integrated database system. However, one need to automatically derive the workload statistics and these are then used to efficiently calculate an index configuration. This periodical focuses on implementation of index recommendation, the user interface, and provides calculations on the quality of the recommended indexes.

 

1. Introduction

Relational database treatment systems (RDBMS) are even now the most fashionable database systems today on the other hand  RDBMS is  threaten to prevail the advertisement zone for annuals to come, particularly for affair applications. Relational databases use indices to invest rapid way to data. The presence of an index reduces the search time for indexed data items but also complicates update operations since the tuples as well as the indices must be updated.

The performance of queries in a relational database management system (RDBMS) has forever been very emotional to the indexes that exist on the tables in a database. The selection of indexes that would best serve a particular workload of queries. An index may have multiple columns as key columns, and the ordering of those columns is significant. In any real application can have tens of thousands of tables, each table can have hundreds of columns, and a typical workload can have thousands of queries, the number of possible indexes to consider is staggering. Finding the set of indexes that optimize a workload of intricate, multi-table queries having varying magnitude and subject to resource constraints, is a daunting combinatorics challenge. Hence there is a tradeoff contained in selecting indices and indexing every col is seldom a good mind. This tradeoff decision will be referred to as the index selection problem (ISP).

A relational database consists of many stored relations and each stored relation can have many secondary indices. The index set of a (relational) database is the set of indices that  are selected for the database. A cost function estimates the cost of processing a workload for  a database with a given index set. Moreover, the costs of processing a workload depends on many ingredients, such as storage costs, number of page accesses, processor time, etc. We also presume that the cardinality of the relations remains constant. To be more precise, the frequency of tuple insertions and tuple deletions is such that the total number of tuples of each relation remains constant in two continuous choices of index sets.

Workload on a relation:

Here, we differentiate 4 possible operations in the workload on a database; queries, updates, insertions and deletions. Each of these operations embody one or more steps.

Query

1. Select the relevant tuples from the data pages

2. Output the relevant tuples to user

Update of tuples

1. Select the relevant tuples from the data pages

2. Update the specified attributes and rewrite the data pages

3. Update the pertinent indices

Deletion of tuples

1. Select the relevant tuples from the data pages

2. Remove the tuples and rewrite the data pages

3. Update the relevant indices

Insertion of tuples

1. Select the location(s) where the tuples will be stored

2. Insert the tuples and rewrite the data pages

3. Update the relevant indices

 

We concentrate on steps that affect index selection. The first step of an operation of the workload is always the selection of the relevant tuple(s). The execution of this step apparently depends on the available set of indexes, so it has to be taken into account. The second step is not influenced by the availability of indexes, so can be ignored, while the third step, if present,

depends only on the presence of indexes.

Introduction of Indexex:

Index architecture

Index buildings can be classified clustered or unclustered.

UNCLUSTERED INDEX

SQL waiter creates a non-clustered index by default. The data is present in random order, but the plausible ordering is specified by the index. The data rows may be randomly scatter throughout the table. The non-clustered index tree contains the index keys in arranged order, with the leaf level of the index involving the arrow to the sheet and the row digit in the data page. In non-clustered index:

  • The physical order of the rows is not the same as the index order.
  • Typically created on column used in JOIN, WHERE, and ORDER BY clauses.
  • Good for tables whose merits may be modified frequently.

 

CLUSTERED INDEX

Clustering alters the data block into a certain distinct order to mate the index, hence it is also one operation on the data warehouse blocks as well as on the index. An address writing ordered along 1st name resembles a bunched index in its structure and intention. The exact action of database systems alter, yet because storing data is quite needless the row data can only be stored in one array. Therefore, only one bunched index can be created on a given database table. Clustered indexes can greatly boost overall speed of retrieval, but usually only where the data is accessed sequentially in the same or reverse array of the clustered index, or while a range of items is selected.


 

Formula for Number of Possible Indexes: Given a table with n columns, how many different indexes can exist containing k columns, where k <= n? There are n choices for the first column in the index. For the second column, there are n 1 remaining choices. As more columns are added, the total number becomes (n)(n - 1)(n - 2)……………(n - k +1) or n!=(n - k)!.

Therefore the total number of indexes that can be created on a table with n columns is

                          n

  ∑ n!/ (n - k)!

  k=1

2. Types of Indices:

                        Types of indexes are

1.Primary key index vs Secondary index

2.Unique concordance vs Non distinctive index

3.Dense index vs Sparse index

4.Hash index

5.Function based index

6.B-tree index.

7.Virtual index

8. bitmap index

 

In general two types of indices can be distinguished, that primary and secondary indices. In the case of a primary index, the tuples in the relation are ordered on the indexed property. This is not the case for a secondary index; in the near future we condense on secondary indices.                                

 

Index Selection Problem (ISP):

The formalization of the index selection problem provides sagacity into its dfficulty, but the results are valid for special cases only and there is no theory presented for ascertaining an index configuration for the general circumstance.

There are also some general problems with interpretative reaches apt the ISP.

First, substantial simplifications have to be made to derive an analytical solution.

Second, the model becomes obsolete if there are changes to the query processing strategy

or to other modeled appearances of the DBMS.

Index Selection Method and Database Relation

In single-index multiple-relation index selection means based on a set of add methods that is separable. This attribute reduces the index selection problem to finding a locally optimal index configuration for each relation. The set of join methods is reduced to two because these are the only ones adhering to separability. It is illegible if the advantage of a better index configuration outweighs the disadvantage of not using ecient join methods which would otherwise be accessible. The usage input consists of a weighted set of queries. The general problem with this form of usage input is that the "representative" query set might not be representative of the real workload because it has to be of gentle size for complexity reasons.

Where as single-index unattached narration approach namely unrealistic for real-life databases. Here index selection method namely the system adopts the present index configuration based above automatically gathered statistics so that users do not even have to understand almost the conception of indices. An instance for the database usage statistics accustom are the restrictive provisos for each interrogate.

 

2.         Why is Index Selection hard?

 

Despite a long history of work in the area of index selection, there are no significant

commercial productions that do auto index selection and are warmhearted deployed. Several factors make the mission of automating physical design exceedingly hard.

  First, when viewed as a search problem, the space of options for indexes is very great. A database may have many tables and each table  may have many columns that absence to be considered for indexing. An index may be clustered or non-clustered. Indexes may have different physical structures, e.g. B+-tree, hash, bitmap. When multi-column indexes are considered the quest space increases even extra dramatically, since for a given set of k columns, k! multi-column indexes are possible.

Second, index selection tools of the past have frequently followed the "textbook solution"  of taking semantic message such as uniqueness, reference constraints and rudimentary statistics  to produce a database design. Such designs may perform

poorly because they ignore expensive workload information.

Third, even when index selection tools have taken the workload into list, they suffer

from being loosened from the query optimizer. These tools adopt an expert

system favor approach, where the learning of "good" designs are encoded as rules and are used to come up with a design. This has disadvantageous ramifications for two reasons. First, a selection of indexes is only as good as the optimizer that uses it. In other words, if the optimizer does not consider a particular index for a query, then its presence in the database does not benefit that query. Second, these tools operate on their personal model of the query optimizer's index usage.

While making an precise prototype of the optimizer is itself hard, ensuring consistency between the speculations made by the tool and the query optimizer is a software-engineering nightmare

3. Index selection

                        The index selection problem has been identified as a variation of the Knapsack Problem, and there are several designs for index recommendations based on optimization rules.

Solution for Index Selection

   The overall goal of this work is to develop a flexible index selection structure that can be tuned to achieve forcible static index selection and online index selection for

high-dimensional data below different analysis constraints.

For the static index selection, when no constraints are specified, the goal is to recommend the set of indexes that yields the lowest estimated cost for every query in a workload for any query that can benefit from an index. In cases where a constraint is specified either as the minimum number of indexes or a time constraint, we want to

recommend a set of indexes within the constraint, from which the queries can benefit the most. When there is a time constraint, we need to automatically adapt the analysis parameters to increase the speed of analysis.

                        For the online index selection, the goal is to amplify a system that can suggest an evolving set of indexes for incoming queries over time such that the benefit of index set changes outweighs the spend of production those changes. Therefore, an online index selection system that differentiates between low-cost index set changes and higher price index set changes and can also make decisions about index set changes based on different cost-benefit thresholds is desirable.

   While maintaining the native query information for after use to determine the estimated query cost, we apply one preoccupation to the query workload to become each query into the set of attributes referenced in the query. We perform frequent item set mining over this preoccupation and only consider those sets of attributes that meet a certain

assist to be potential indexes. By varying the aid, we affect the speed of index selection and the percentage of queries that are covered by potential indexes. We beyond prune the analysis space using coalition rule mining by eliminating those subsets upon a certain positiveness threshold. Lowering the trust threshold improves the analysis time by eliminating some lower dimensional indexes from attention but can outcome in recommending indexes that  cover a strict superset of the queried attributes.

Our   technique differs from existing tools in the method that we use to determine the potential set of indexes to evaluate and in the quantization-based technique that we use to estimate query costs. All of the commercial index wizards work in design time. The DBA has to  decide when to run this sorcerer and over which workload. The assumption is that the workload is going to remain static over time, and in case it changes, the DBA would gather the new workload and run the wizard again.

Static Index Selection Approach

 The goal of the index selection is to minimize the cost of the queries in the workload, given certain constraints. Given a query workload, a data set, the indexing constraints, and several analysis parameters, our framework produces a set of recommended indexes as an output.

Online Index Selection Approach

The online index selection is motivated by the fact that query patterns can change over time. By monitoring the query workload and finding when there is a change on the query pattern that generated the existing set of indexes, we are proficient to retain agreeable rendition as query patterns evolve. In our approach, we use control feedback to monitor the performance of the current set of indexes for incoming queries and determine when corrections should be made to the index set. In a typical control response system, the output of a system is monitored, and based on some functions involving the input and output, the input to the system is readjusted through a control feedback loop.

 

 Conclusion

A flexible technique for index selection is introduced, which tin be tuned to effect differ levels of limitations and thinking complexity.Index production is quite period consuming. It is not feasible to perform real-time analysis of incoming queries and

generate new indexes when the patterns change. Potential indexes could be generated prior to receiving current queries and, when indicated by the online analysis, migrated to the

active status.

 

References

  • Ramakrishnan and Gehrke:  Database Management Systems.
  • Data Mining: Concepts and Techniques : Micheline Kamber, Jiawei Han
  • http://citeseerx.ist.psu.edu/
  • Article by K. Whang, "Index Selection in Relational Databases,"

 




没有评论:

发表评论