• Home   /  
  • Archive by category "1"

Partition By Clause In Pl/Sql What Is The Assignment Operator

Would you please explain in details with examples 'assignment operators' like (:= ,:=:,=:).These altogether seem...

to be a little bit confusing in terms of where to use what and how to use them.

The assignment operator is simply the way PL/SQL sets the value of one variable to a given value. There is only one assignment operator, := . I'm not sure where you saw the others listed or used, but they are invalid.

Assignment operators are different from a regular equal sign, =, in that they are used to assign a specified value to a PL/SQL variable. For instance, if I wanted to give a variable named V_TEMPERATURE an initial value of 98.6, I'd use the following assignment statement:

set serveroutput on DECLARE v_temperature number := 98.6 ; BEGIN dbms_output.put_line('The temperature is ' || v_temperature) ; END ; / I could also change the value of v_temperature in the body of my code as well: set serveroutput on DECLARE v_temperature number := 98.6 ; BEGIN dbms_output.put_line('The initial temperature is ' || v_temperature) ; v_temperature := 100.6 ; dbms_output.put_line('The next temperature is ' || v_temperature) ; v_temperature := v_temperature - 4 ; dbms_output.put_line('The last temperature is ' || v_temperature) ; END ; / Assigning a value is different from just using the equal sign. Whenever you see an = in PL/SQL, it is typically used for comparison purposes: set serveroutput on DECLARE v_temperature number := 98.6 ; BEGIN IF v_temperature = 99 THEN dbms_output.put_line('You have a slight fever.') END IF ; END ; / Notice how the = is used to compare the value of v_temperature (which is 98.6) to the value 99.

That's the key way to differentiate. If you want to compare a value use an equal sign, but if you want to assign a value to a variable, use the := assignment operator.

For More Information

  • Dozens more answers to tough Oracle questions from Karen Morton are available.
  • The Best Oracle Web Links: tips, tutorials, scripts, and more.
  • Have an Oracle or SQL tip to offer your fellow DBAs and developers? The best tips submitted will receive a cool prize. Submit your tip today!
  • Ask your technical Oracle and SQL questions -- or help out your peers by answering them -- in our live discussion forums.
  • Ask the Experts yourself: Our SQL, database design, Oracle, SQL Server, DB2, metadata, object-oriented and data warehousing gurus are waiting to answer your toughest questions.

30/36

21SQL for Analysis and Reporting

The following topics provide information about how to improve analytical SQL queries in a data warehouse:

Overview of SQL for Analysis and Reporting

Oracle has enhanced SQL's analytical processing capabilities by introducing a new family of analytic SQL functions. These analytic functions enable you to calculate:

  • Rankings and percentiles

  • Moving window calculations

  • Lag/lead analysis

  • First/last analysis

  • Linear regression statistics

Ranking functions include cumulative distributions, percent rank, and N-tiles. Moving window calculations allow you to find moving and cumulative aggregations, such as sums and averages. Lag/lead analysis enables direct inter-row references so you can calculate period-to-period changes. First/last analysis enables you to find the first or last value in an ordered group.

Other enhancements to SQL include the expression. expressions provide if-then logic useful in many situations.

In Oracle Database 10g, the SQL reporting capability was further enhanced by the introduction of partitioned outer join. Partitioned outer join is an extension to ANSI outer join syntax that allows users to selectively densify certain dimensions while keeping others sparse. This allows reporting tools to selectively densify dimensions, for example, the ones that appear in their cross-tabular reports while keeping others sparse.

To enhance performance, analytic functions can be parallelized: multiple processes can simultaneously execute all of these statements. These capabilities make calculations easier and more efficient, thereby enhancing database performance, scalability, and simplicity.

Analytic functions are classified as described in Table 21-1.

Table 21-1 Analytic Functions and Their Uses 

TypeUsed For

Ranking

Calculating ranks, percentiles, and n-tiles of the values in a result set.

Windowing

Calculating cumulative and moving aggregates. Works with these functions: , , , , , , , , , and new statistical functions. Note that the keyword is not supported in windowing functions except for and .

Reporting

Calculating shares, for example, market share. Works with these functions: , , , , (with/without ), , , , and new statistical functions. Note that the keyword may be used in those reporting functions that support in aggregate mode.

/

Finding a value in a row a specified number of rows from a current row.

/

First or last value in an ordered group.

Linear Regression

Calculating linear regression and other statistics (slope, intercept, and so on).

Inverse Percentile

The value in a data set that corresponds to a specified percentile.

Hypothetical Rank and Distribution

The rank or percentile that a row would have if inserted into a specified data set.


To perform these operations, the analytic functions add several new elements to SQL processing. These elements build on existing SQL to allow flexible and powerful calculation expressions. With just a few exceptions, the analytic functions have these new elements. The processing flow is represented in Figure 21-1.

The essential concepts used in analytic functions are:

  • Processing order

    Query processing using analytic functions takes place in three stages. First, all joins, , and clauses are performed. Second, the result set is made available to the analytic functions, and all their calculations take place. Third, if the query has an clause at its end, the is processed to allow for precise output ordering. The processing order is shown in Figure 21-1.

  • Result set partitions

    The analytic functions allow users to divide query result sets into groups of rows called partitions. Note that the term partitions used with analytic functions is unrelated to the table partitions feature. Throughout this chapter, the term partitions refers to only the meaning related to analytic functions. Partitions are created after the groups defined with clauses, so they are available to any aggregate results such as sums and averages. Partition divisions may be based upon any desired columns or expressions. A query result set may be partitioned into just one partition holding all the rows, a few large partitions, or many small partitions holding just a few rows each.

  • Window

    For each row in a partition, you can define a sliding window of data. This window determines the range of rows used to perform the calculations for the current row. Window sizes can be based on either a physical number of rows or a logical interval such as time. The window has a starting row and an ending row. Depending on its definition, the window may move at one or both ends. For instance, a window defined for a cumulative sum function would have its starting row fixed at the first row of its partition, and its ending row would slide from the starting point all the way to the last row of the partition. In contrast, a window defined for a moving average would have both its starting and end points slide so that they maintain a constant physical or logical range.

    A window can be set as large as all the rows in a partition or just a sliding window of one row within a partition. When a window is near a border, the function returns results for only the available rows, rather than warning you that the results are not what you want.

    When using window functions, the current row is included during calculations, so you should only specify (n-1) when you are dealing with n items.

  • Current row

    Each calculation performed with an analytic function is based on a current row within a partition. The current row serves as the reference point determining the start and end of the window. For instance, a centered moving average calculation could be defined with a window that holds the current row, the six preceding rows, and the following six rows. This would create a sliding window of 13 rows, as shown in Figure 21-2.

Ranking Functions

A ranking function computes the rank of a record compared to other records in the data set based on the values of a set of measures. The types of ranking function are:

RANK and DENSE_RANK Functions

The and functions allow you to rank items in a group, for example, finding the top three products sold in California last year. There are two functions that perform ranking, as shown by the following syntax:

RANK ( ) OVER ( [query_partition_clause] order_by_clause ) DENSE_RANK ( ) OVER ( [query_partition_clause] order_by_clause )

The difference between and is that leaves no gaps in ranking sequence when there are ties. That is, if you were ranking a competition using and had three people tie for second place, you would say that all three were in second place and that the next person came in third. The function would also give three people in second place, but the next person would be in fifth place.

The following are some relevant points about :

  • Ascending is the default sort order, which you may want to change to descending.

  • The expressions in the optional clause divide the query result set into groups within which the function operates. That is, gets reset whenever the group changes. In effect, the value expressions of the clause define the reset boundaries.

  • If the clause is missing, then ranks are computed over the entire query result set.

  • The clause specifies the measures (<>) on which ranking is done and defines the order in which rows are sorted in each group (or partition). Once the data is sorted within each partition, ranks are given to each row starting from 1.

  • The | clause indicates the position of in the ordered sequence, either first or last in the sequence. The order of the sequence would make compare either high or low with respect to non- values. If the sequence were in ascending order, then implies that are smaller than all other non- values and implies they are larger than non- values. It is the opposite for descending order. See the example in "Treatment of NULLs".

  • If the | clause is omitted, then the ordering of the null values depends on the or arguments. Null values are considered larger than any other values. If the ordering sequence is , then nulls will appear last; nulls will appear first otherwise. Nulls are considered equal to other nulls and, therefore, the order in which nulls are presented is non-deterministic.

Ranking Order

The following example shows how the option changes the ranking order.

Example 21-1 Ranking Order

SELECT channel_desc, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$, RANK() OVER (ORDER BY SUM(amount_sold)) AS default_rank, RANK() OVER (ORDER BY SUM(amount_sold) DESC NULLS LAST) AS custom_rank FROM sales, products, customers, times, channels, countries WHERE sales.prod_id=products.prod_id AND sales.cust_id=customers.cust_id AND customers.country_id = countries.country_id AND sales.time_id=times.time_id AND sales.channel_id=channels.channel_id AND times.calendar_month_desc IN ('2000-09', '2000-10') AND country_iso_code='US' GROUP BY channel_desc; CHANNEL_DESC SALES$ DEFAULT_RANK CUSTOM_RANK -------------------- -------------- ------------ ----------- Direct Sales 1,320,497 3 1 Partners 800,871 2 2 Internet 261,278 1 3

While the data in this result is ordered on the measure , in general, it is not guaranteed by the function that the data will be sorted on the measures. If you want the data to be sorted on in your result, you must specify it explicitly with an clause, at the end of the statement.

Ranking on Multiple Expressions

Ranking functions need to resolve ties between values in the set. If the first expression cannot resolve ties, the second expression is used to resolve ties and so on. For example, here is a query ranking three of the sales channels over two months based on their dollar sales, breaking ties with the unit sales. (Note that the function is used here only to create tie values for this query.)

Example 21-2 Ranking On Multiple Expressions

SELECT channel_desc, calendar_month_desc, TO_CHAR(TRUNC(SUM(amount_sold),-5), '9,999,999,999') SALES$, TO_CHAR(SUM(quantity_sold), '9,999,999,999') SALES_Count, RANK() OVER (ORDER BY TRUNC(SUM(amount_sold), -5) DESC, SUM(quantity_sold) DESC) AS col_rank FROM sales, products, customers, times, channels WHERE sales.prod_id=products.prod_id AND sales.cust_id=customers.cust_id AND sales.time_id=times.time_id AND sales.channel_id=channels.channel_id AND times.calendar_month_desc IN ('2000-09', '2000-10') AND channels.channel_desc<>'Tele Sales' GROUP BY channel_desc, calendar_month_desc; CHANNEL_DESC CALENDAR SALES$ SALES_COUNT COL_RANK -------------------- -------- -------------- -------------- --------- Direct Sales 2000-10 1,200,000 12,584 1 Direct Sales 2000-09 1,200,000 11,995 2 Partners 2000-10 600,000 7,508 3 Partners 2000-09 600,000 6,165 4 Internet 2000-09 200,000 1,887 5 Internet 2000-10 200,000 1,450 6

The column breaks the ties for three pairs of values.

RANK and DENSE_RANK Difference

The difference between and functions is illustrated in Example 21-3.

Example 21-3 RANK and DENSE_RANK

SELECT channel_desc, calendar_month_desc, TO_CHAR(TRUNC(SUM(amount_sold),-4), '9,999,999,999') SALES$, RANK() OVER (ORDER BY TRUNC(SUM(amount_sold),-4) DESC) AS RANK, DENSE_RANK() OVER (ORDER BY TRUNC(SUM(amount_sold),-4) DESC) AS DENSE_RANK FROM sales, products, customers, times, channels WHERE sales.prod_id=products.prod_id AND sales.cust_id=customers.cust_id AND sales.time_id=times.time_id AND sales.channel_id=channels.channel_id AND times.calendar_month_desc IN ('2000-09', '2000-10') AND channels.channel_desc<>'Tele Sales' GROUP BY channel_desc, calendar_month_desc; CHANNEL_DESC CALENDAR SALES$ RANK DENSE_RANK -------------------- -------- -------------- --------- ---------- Direct Sales 2000-09 1,200,000 1 1 Direct Sales 2000-10 1,200,000 1 1 Partners 2000-09 600,000 3 2 Partners 2000-10 600,000 3 2 Internet 2000-09 200,000 5 3 Internet 2000-10 200,000 5 3

Note that, in the case of , the largest rank value gives the number of distinct values in the data set.

Per Group Ranking

The function can be made to operate within groups, that is, the rank gets reset whenever the group changes. This is accomplished with the clause. The group expressions in the subclause divide the data set into groups within which operates. For example, to rank products within each channel by their dollar sales, you could issue the following statement.

Example 21-4 Per Group Ranking Example 1

SELECT channel_desc, calendar_month_desc, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$, RANK() OVER (PARTITION BY channel_desc ORDER BY SUM(amount_sold) DESC) AS RANK_BY_CHANNEL FROM sales, products, customers, times, channels WHERE sales.prod_id=products.prod_id AND sales.cust_id=customers.cust_id AND sales.time_id=times.time_id AND sales.channel_id=channels.channel_id AND times.calendar_month_desc IN ('2000-08', '2000-09', '2000-10', '2000-11') AND channels.channel_desc IN ('Direct Sales', 'Internet') GROUP BY channel_desc, calendar_month_desc;

A single query block can contain more than one ranking function, each partitioning the data into different groups (that is, reset on different boundaries). The groups can be mutually exclusive. The following query ranks products based on their dollar sales within each month () and within each channel ().

Example 21-5 Per Group Ranking Example 2

SELECT channel_desc, calendar_month_desc, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$, RANK() OVER (PARTITION BY calendar_month_desc ORDER BY SUM(amount_sold) DESC) AS RANK_WITHIN_MONTH, RANK() OVER (PARTITION BY channel_desc ORDER BY SUM(amount_sold) DESC) AS RANK_WITHIN_CHANNEL FROM sales, products, customers, times, channels, countries WHERE sales.prod_id=products.prod_id AND sales.cust_id=customers.cust_id AND customers.country_id = countries.country_id AND sales.time_id=times.time_id AND sales.channel_id=channels.channel_id AND times.calendar_month_desc IN ('2000-08', '2000-09', '2000-10', '2000-11') AND channels.channel_desc IN ('Direct Sales', 'Internet') GROUP BY channel_desc, calendar_month_desc; CHANNEL_DESC CALENDAR SALES$ RANK_WITHIN_MONTH RANK_WITHIN_CHANNEL ------------- -------- --------- ----------------- ------------------- Direct Sales 2000-08 1,236,104 1 1 Internet 2000-08 215,107 2 4 Direct Sales 2000-09 1,217,808 1 3 Internet 2000-09 228,241 2 3 Direct Sales 2000-10 1,225,584 1 2 Internet 2000-10 239,236 2 2 Direct Sales 2000-11 1,115,239 1 4 Internet 2000-11 284,742 2 1

Per Cube and Rollup Group Ranking

Analytic functions, for example, can be reset based on the groupings provided by a , , or operator. It is useful to assign ranks to the groups created by , , and queries. See Chapter 20, "SQL for Aggregation in Data Warehouses" for further information about the function.

A sample and query is the following:

SELECT channel_desc, country_iso_code, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$, RANK() OVER (PARTITION BY GROUPING_ID(channel_desc, country_iso_code) ORDER BY SUM(amount_sold) DESC) AS RANK_PER_GROUP FROM sales, customers, times, channels, countries WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id = channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc='2000-09' AND country_iso_code IN ('GB', 'US', 'JP') GROUP BY CUBE(channel_desc, country_iso_code); CHANNEL_DESC CO SALES$ RANK_PER_GROUP -------------------- -- -------------- -------------- Direct Sales GB 1,217,808 1 Direct Sales JP 1,217,808 1 Direct Sales US 1,217,808 1 Internet GB 228,241 4 Internet JP 228,241 4 Internet US 228,241 4 Direct Sales 3,653,423 1 Internet 684,724 2 GB 1,446,049 1 JP 1,446,049 1 US 1,446,049 1 4,338,147 1

Treatment of NULLs

are treated like normal values. Also, for rank computation, a value is assumed to be equal to another value. Depending on the | options provided for measures and the | clause, will either sort low or high and hence, are given ranks appropriately. The following example shows how are ranked in different cases:

SELECT times.time_id time, sold, RANK() OVER (ORDER BY (sold) DESC NULLS LAST) AS NLAST_DESC, RANK() OVER (ORDER BY (sold) DESC NULLS FIRST) AS NFIRST_DESC, RANK() OVER (ORDER BY (sold) ASC NULLS FIRST) AS NFIRST, RANK() OVER (ORDER BY (sold) ASC NULLS LAST) AS NLAST FROM ( SELECT time_id, SUM(sales.amount_sold) sold FROM sales, products, customers, countries WHERE sales.prod_id=products.prod_id AND customers.country_id = countries.country_id AND sales.cust_id=customers.cust_id AND prod_name IN ('Envoy Ambassador', 'Mouse Pad') AND country_iso_code ='GB' GROUP BY time_id) v, times WHERE v.time_id (+) = times.time_id AND calendar_year=1999 AND calendar_month_number=1 ORDER BY sold DESC NULLS LAST; TIME SOLD NLAST_DESC NFIRST_DESC NFIRST NLAST --------- ---------- ---------- ----------- ---------- ---------- 25-JAN-99 3097.32 1 18 31 14 17-JAN-99 1791.77 2 19 30 13 30-JAN-99 127.69 3 20 29 12 28-JAN-99 120.34 4 21 28 11 23-JAN-99 86.12 5 22 27 10 20-JAN-99 79.07 6 23 26 9 13-JAN-99 56.1 7 24 25 8 07-JAN-99 42.97 8 25 24 7 08-JAN-99 33.81 9 26 23 6 10-JAN-99 22.76 10 27 21 4 02-JAN-99 22.76 10 27 21 4 26-JAN-99 19.84 12 29 20 3 16-JAN-99 11.27 13 30 19 2 14-JAN-99 9.52 14 31 18 1 09-JAN-99 15 1 1 15 12-JAN-99 15 1 1 15 31-JAN-99 15 1 1 15 11-JAN-99 15 1 1 15 19-JAN-99 15 1 1 15 03-JAN-99 15 1 1 15 15-JAN-99 15 1 1 15 21-JAN-99 15 1 1 15 24-JAN-99 15 1 1 15 04-JAN-99 15 1 1 15 06-JAN-99 15 1 1 15 27-JAN-99 15 1 1 15 18-JAN-99 15 1 1 15 01-JAN-99 15 1 1 15 22-JAN-99 15 1 1 15 29-JAN-99 15 1 1 15 05-JAN-99 15 1 1 15

Bottom N Ranking

Bottom N is similar to top N except for the ordering sequence within the rank expression. Using the previous example, you can order ascending instead of descending.

CUME_DIST Function

The function (defined as the inverse of percentile in some statistical books) computes the position of a specified value relative to a set of values. The order can be ascending or descending. Ascending is the default. The range of values for is from greater than 0 to 1. To compute the of a value x in a set S of size N, you use the formula:

CUME_DIST(x) = number of values in S coming before and including x in the specified order/ N

Its syntax is:

CUME_DIST ( ) OVER ( [query_partition_clause] order_by_clause )

The semantics of various options in the function are similar to those in the function. The default order is ascending, implying that the lowest value gets the lowest (as all other values come later than this value in the order). are treated the same as they are in the function. They are counted toward both the numerator and the denominator as they are treated like non- values. The following example finds cumulative distribution of sales by channel within each month:

SELECT calendar_month_desc AS MONTH, channel_desc, TO_CHAR(SUM(amount_sold) , '9,999,999,999') SALES$, CUME_DIST() OVER (PARTITION BY calendar_month_desc ORDER BY SUM(amount_sold) ) AS CUME_DIST_BY_CHANNEL FROM sales, products, customers, times, channels WHERE sales.prod_id=products.prod_id AND sales.cust_id=customers.cust_id AND sales.time_id=times.time_id AND sales.channel_id=channels.channel_id AND times.calendar_month_desc IN ('2000-09', '2000-07','2000-08') GROUP BY calendar_month_desc, channel_desc; MONTH CHANNEL_DESC SALES$ CUME_DIST_BY_CHANNEL -------- -------------------- -------------- -------------------- 2000-07 Internet 140,423 .333333333 2000-07 Partners 611,064 .666666667 2000-07 Direct Sales 1,145,275 1 2000-08 Internet 215,107 .333333333 2000-08 Partners 661,045 .666666667 2000-08 Direct Sales 1,236,104 1 2000-09 Internet 228,241 .333333333 2000-09 Partners 666,172 .666666667 2000-09 Direct Sales 1,217,808 1

PERCENT_RANK Function

is similar to , but it uses rank values rather than row counts in its numerator. Therefore, it returns the percent rank of a value relative to a group of values. The function is available in many popular spreadsheets. of a row is calculated as:

(rank of row in its partition - 1) / (number of rows in the partition - 1)

returns values in the range zero to one. The row(s) with a rank of 1 will have a of zero. Its syntax is:

PERCENT_RANK () OVER ([query_partition_clause] order_by_clause)

NTILE Function

allows easy calculation of tertiles, quartiles, deciles and other common summary statistics. This function divides an ordered partition into a specified number of groups called buckets and assigns a bucket number to each row in the partition. is a very useful calculation because it lets users divide a data set into fourths, thirds, and other groupings.

The buckets are calculated so that each bucket has exactly the same number of rows assigned to it or at most 1 row more than the others. For instance, if you have 100 rows in a partition and ask for an function with four buckets, 25 rows will be assigned a value of 1, 25 rows will have value 2, and so on. These buckets are referred to as equiheight buckets.

If the number of rows in the partition does not divide evenly (without a remainder) into the number of buckets, then the number of rows assigned for each bucket will differ by one at most. The extra rows will be distributed one for each bucket starting from the lowest bucket number. For instance, if there are 103 rows in a partition which has an function, the first 21 rows will be in the first bucket, the next 21 in the second bucket, the next 21 in the third bucket, the next 20 in the fourth bucket and the final 20 in the fifth bucket.

The function has the following syntax:

NTILE (expr) OVER ([query_partition_clause] order_by_clause)

In this, the N in can be a constant (for example, 5) or an expression.

This function, like and , has a clause for per group computation, an clause for specifying the measures and their sort order, and | clause for the specific treatment of . For example, the following is an example assigning each month's sales total into one of four buckets:

SELECT calendar_month_desc AS MONTH , TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$, NTILE(4) OVER (ORDER BY SUM(amount_sold)) AS TILE4 FROM sales, products, customers, times, channels WHERE sales.prod_id=products.prod_id AND sales.cust_id=customers.cust_id AND sales.time_id=times.time_id AND sales.channel_id=channels.channel_id AND times.calendar_year=2000 AND prod_category= 'Electronics' GROUP BY calendar_month_desc; MONTH SALES$ TILE4 -------- -------------- ---------- 2000-02 242,416 1 2000-01 257,286 1 2000-03 280,011 1 2000-06 315,951 2 2000-05 316,824 2 2000-04 318,106 2 2000-07 433,824 3 2000-08 477,833 3 2000-12 553,534 3 2000-10 652,225 4 2000-11 661,147 4 2000-09 691,449 4

statements must be fully specified to yield reproducible results. Equal values can get distributed across adjacent buckets. To ensure deterministic results, you must order on a unique key.

ROW_NUMBER Function

The function assigns a unique number (sequentially, starting from 1, as defined by ) to each row within the partition. It has the following syntax:

ROW_NUMBER ( ) OVER ( [query_partition_clause] order_by_clause )

Example 21-6 ROW_NUMBER

SELECT channel_desc, calendar_month_desc, TO_CHAR(TRUNC(SUM(amount_sold), -5), '9,999,999,999') SALES$, ROW_NUMBER() OVER (ORDER BY TRUNC(SUM(amount_sold), -6) DESC) AS ROW_NUMBER FROM sales, products, customers, times, channels WHERE sales.prod_id=products.prod_id AND sales.cust_id=customers.cust_id AND sales.time_id=times.time_id AND sales.channel_id=channels.channel_id AND times.calendar_month_desc IN ('2001-09', '2001-10') GROUP BY channel_desc, calendar_month_desc; CHANNEL_DESC CALENDAR SALES$ ROW_NUMBER -------------------- -------- -------------- ---------- Direct Sales 2001-09 1,100,000 1 Direct Sales 2001-10 1,000,000 2 Internet 2001-09 500,000 3 Internet 2001-10 700,000 4 Partners 2001-09 600,000 5 Partners 2001-10 600,000 6

Note that there are three pairs of tie values in these results. Like , is a non-deterministic function, so each tied value could have its row number switched. To ensure deterministic results, you must order on a unique key. Inmost cases, that will require adding a new tie breaker column to the query and using it in the specification.

Windowing Aggregate Functions

Windowing functions can be used to compute cumulative, moving, and centered aggregates. They return a value for each row in the table, which depends on other rows in the corresponding window. With windowing aggregate functions, you can calculate moving and cumulative versions of , , , , , and many more functions. They can be used only in the and clauses of the query. Windowing aggregate functions include the convenient , which returns the first value in the window; and , which returns the last value in the window. These functions provide access to more than one row of a table without a self-join. The syntax of the windowing functions is:

analytic_function([ arguments ]) OVER (analytic_clause) where analytic_clause = [ query_partition_clause ] [ order_by_clause [ windowing_clause ] ] and query_partition_clause = PARTITION BY { value_expr[, value_expr ]... | ( value_expr[, value_expr ]... ) } and windowing_clause = { ROWS | RANGE } { BETWEEN { UNBOUNDED PRECEDING | CURRENT ROW | value_expr { PRECEDING | FOLLOWING } } AND { UNBOUNDED FOLLOWING | CURRENT ROW | value_expr { PRECEDING | FOLLOWING } } | { UNBOUNDED PRECEDING | CURRENT ROW | value_expr PRECEDING } }

Note that the keyword is not supported in windowing functions except for and .

Treatment of NULLs as Input to Window Functions

Window functions' semantics match the semantics for SQL aggregate functions. Other semantics can be obtained by user-defined functions, or by using the or a expression within the window function.

Windowing Functions with Logical Offset

A logical offset can be specified with constants such as , or an expression that evaluates to a constant, or by an interval specification like // or an expression that evaluates to an interval.

With logical offset, there can only be one expression in the expression list in the function, with type compatible to if offset is numeric, or if an interval is specified.

An analytic function that uses the keyword can use multiple sort keys in its clause if it specifies either of these two windows:

  • . The short form of this is , which can also be used.

  • . The short form of this is , which can also be used.

Window boundaries that do not meet these conditions can have only one sort key in the analytic function's clause.

Example 21-7 Cumulative Aggregate Function

The following is an example of cumulative by customer ID by quarter in 1999:

SELECT c.cust_id, t.calendar_quarter_desc, TO_CHAR (SUM(amount_sold), '9,999,999,999.99') AS Q_SALES, TO_CHAR(SUM(SUM(amount_sold)) OVER (PARTITION BY c.cust_id ORDER BY c.cust_id, t.calendar_quarter_desc ROWS UNBOUNDED PRECEDING), '9,999,999,999.99') AS CUM_SALES FROM sales s, times t, customers c WHERE s.time_id=t.time_id AND s.cust_id=c.cust_id AND t.calendar_year=2000 AND c.cust_id IN (2595, 9646, 11111) GROUP BY c.cust_id, t.calendar_quarter_desc ORDER BY c.cust_id, t.calendar_quarter_desc; CUST_ID CALENDA Q_SALES CUM_SALES ---------- ------- ----------------- ----------------- 2595 2000-01 659.92 659.92 2595 2000-02 224.79 884.71 2595 2000-03 313.90 1,198.61 2595 2000-04 6,015.08 7,213.69 9646 2000-01 1,337.09 1,337.09 9646 2000-02 185.67 1,522.76 9646 2000-03 203.86 1,726.62 9646 2000-04 458.29 2,184.91 11111 2000-01 43.18 43.18 11111 2000-02 33.33 76.51 11111 2000-03 579.73 656.24 11111 2000-04 307.58 963.82

In this example, the analytic function defines, for each row, a window that starts at the beginning of the partition () and ends, by default, at the current row.

Nested s are needed in this example since we are performing a over a value that is itself a . Nested aggregations are used very often in analytic aggregate functions.

Example 21-8 Moving Aggregate Function

This example of a time-based window shows, for one customer, the moving average of sales for the current month and preceding two months:

SELECT c.cust_id, t.calendar_month_desc, TO_CHAR (SUM(amount_sold), '9,999,999,999') AS SALES, TO_CHAR(AVG(SUM(amount_sold)) OVER (ORDER BY c.cust_id, t.calendar_month_desc ROWS 2 PRECEDING), '9,999,999,999') AS MOVING_3_MONTH_AVG FROM sales s, times t, customers c WHERE s.time_id=t.time_id AND s.cust_id=c.cust_id AND t.calendar_year=1999 AND c.cust_id IN (6510) GROUP BY c.cust_id, t.calendar_month_desc ORDER BY c.cust_id, t.calendar_month_desc; CUST_ID CALENDAR SALES MOVING_3_MONTH ---------- -------- -------------- -------------- 6510 1999-04 125 125 6510 1999-05 3,395 1,760 6510 1999-06 4,080 2,533 6510 1999-07 6,435 4,637 6510 1999-08 5,105 5,207 6510 1999-09 4,676 5,405 6510 1999-10 5,109 4,963 6510 1999-11 802 3,529

Note that the first two rows for the three month moving average calculation in the output data are based on a smaller interval size than specified because the window calculation cannot reach past the data retrieved by the query. You need to consider the different window sizes found at the borders of result sets. In other words, you may need to modify the query to include exactly what you want.

Centered Aggregate Function

Calculating windowing aggregate functions centered around the current row is straightforward. This example computes for all customers a centered moving average of sales for one week in late December 1999. It finds an average of the sales total for the one day preceding the current row and one day following the current row including the current row as well.

Example 21-9 Centered Aggregate

SELECT t.time_id, TO_CHAR (SUM(amount_sold), '9,999,999,999') AS SALES, TO_CHAR(AVG(SUM(amount_sold)) OVER (ORDER BY t.time_id RANGE BETWEEN INTERVAL '1' DAY PRECEDING AND INTERVAL '1' DAY FOLLOWING), '9,999,999,999') AS CENTERED_3_DAY_AVG FROM sales s, times t WHERE s.time_id=t.time_id AND t.calendar_week_number IN (51) AND calendar_year=1999 GROUP BY t.time_id ORDER BY t.time_id; TIME_ID SALES CENTERED_3_DAY --------- -------------- -------------- 20-DEC-99 134,337 106,676 21-DEC-99 79,015 102,539 22-DEC-99 94,264 85,342 23-DEC-99 82,746 93,322 24-DEC-99 102,957 82,937 25-DEC-99 63,107 87,062 26-DEC-99 95,123 79,115

The starting and ending rows for each product's centered moving average calculation in the output data are based on just two days, since the window calculation cannot reach past the data retrieved by the query. Users need to consider the different window sizes found at the borders of result sets: the query may need to be adjusted.

Windowing Aggregate Functions in the Presence of Duplicates

The following example illustrates how window aggregate functions compute values when there are duplicates, that is, when multiple rows are returned for a single ordering value. The query retrieves the quantity sold to several customers during a specified time range. (Although we use an inline view to define our base data set, it has no special significance and can be ignored.) The query defines a moving window that runs from the date of the current row to 10 days earlier.Note that the keyword is used to define the windowing clause of this example. This means that the window can potentially hold many rows for each value in the range. In this case, there are three pairs of rows with duplicate date values.

Example 21-10 Windowing Aggregate Functions with Logical Offsets

SELECT time_id, daily_sum, SUM(daily_sum) OVER (ORDER BY time_id RANGE BETWEEN INTERVAL '10' DAY PRECEDING AND CURRENT ROW) AS current_group_sum FROM (SELECT time_id, channel_id, SUM(s.quantity_sold) AS daily_sum FROM customers c, sales s, countries WHERE c.cust_id=s.cust_id AND c.country_id = countries.country_id AND s.cust_id IN (638, 634, 753, 440 ) AND s.time_id BETWEEN '01-MAY-00' AND '13-MAY-00' GROUP BY time_id, channel_id); TIME_ID DAILY_SUM CURRENT_GROUP_SUM --------- ---------- ----------------- 06-MAY-00 7 7 /* 7 */ 10-MAY-00 1 9 /* 7 + (1+1) */ 10-MAY-00 1 9 /* 7 + (1+1) */ 11-MAY-00 2 15 /* 7 + (1+1) + (2+4) */ 11-MAY-00 4 15 /* 7 + (1+1) + (2+4) */ 12-MAY-00 1 16 /* 7 + (1+1) + (2+4) + 1 */ 13-MAY-00 2 23 /* 7 + (1+1) + (2+4) + 1 + (5+2) */ 13-MAY-00 5 23 /* 7 + (1+1) + (2+4) + 1 + (5+2) */

In the output of this example, all dates except May 6 and May 12 return two rows. Examine the commented numbers to the right of the output to see how the values are calculated. Note that each group in parentheses represents the values returned for a single day.

Note that this example applies only when you use the keyword rather than the keyword. It is also important to remember that with , you can only use 1 expression in the analytic function's clause. With the keyword, you can use multiple order by expressions in the analytic function's clause.

Varying Window Size for Each Row

There are situations where it is useful to vary the size of a window for each row, based on a specified condition. For instance, you may want to make the window larger for certain dates and smaller for others. Assume that you want to calculate the moving average of stock price over three working days. If you have an equal number of rows for each day for all working days and no non-working days are stored, then you can use a physical window function. However, if the conditions noted are not met, you can still calculate a moving average by using an expression in the window size parameters.

Expressions in a window size specification can be made in several different sources. the expression could be a reference to a column in a table, such as a time table. It could also be a function that returns the appropriate boundary for the window based on values in the current row. The following statement for a hypothetical stock price database uses a user-defined function in its clause to set window size:

SELECT t_timekey, AVG(stock_price) OVER (ORDER BY t_timekey RANGE fn(t_timekey) PRECEDING) av_price FROM stock, time WHERE st_timekey = t_timekey ORDER BY t_timekey;

In this statement, is a date field. Here, fn could be a PL/SQL function with the following specification:

returns

  • 4 if is Monday, Tuesday

  • 2 otherwise

  • If any of the previous days are holidays, it adjusts the count appropriately.

Note that, when window is specified using a number in a window function with on a date column, then it is converted to mean the number of days. You could have also used the interval literal conversion function, as instead of just to mean the same thing. You can also write a PL/SQL function that returns an datatype value.

Windowing Aggregate Functions with Physical Offsets

For windows expressed in rows, the ordering expressions should be unique to produce deterministic results. For example, the following query is not deterministic because is not unique in this result set.

Example 21-11 Windowing Aggregate Functions With Physical Offsets

SELECT t.time_id, TO_CHAR(amount_sold, '9,999,999,999') AS INDIV_SALE, TO_CHAR(SUM(amount_sold) OVER (PARTITION BY t.time_id ORDER BY t.time_id ROWS UNBOUNDED PRECEDING), '9,999,999,999') AS CUM_SALES FROM sales s, times t, customers c WHERE s.time_id=t.time_id AND s.cust_id=c.cust_id AND t.time_id IN (TO_DATE('11-DEC-1999'), TO_DATE('12-DEC-1999')) AND c.cust_id BETWEEN 6500 AND 6600 ORDER BY t.time_id; TIME_ID INDIV_SALE CUM_SALES --------- ---------- --------- 12-DEC-99 23 23 12-DEC-99 9 32 12-DEC-99 14 46 12-DEC-99 24 70 12-DEC-99 19 89

One way to handle this problem would be to add the column to the result set and order on both and .

FIRST_VALUE and LAST_VALUE Functions

The and functions allow you to select the first and last rows from a window. These rows are especially valuable because they are often used as the baselines in calculations. For instance, with a partition holding sales data ordered by day, you might ask "How much was each day's sales compared to the first sales day () of the period?" Or you might wish to know, for a set of rows in increasing sales order, "What was the percentage size of each sale in the region compared to the largest sale () in the region?"

If the option is used with , it will return the first non-null value in the set, or if all values are . If is used with , it will return the last non-null value in the set, or if all values are . The option is particularly useful in populating an inventory table properly.

Reporting Aggregate Functions

After a query has been processed, aggregate values like the number of resulting rows or an average value in a column can be easily computed within a partition and made available to other reporting functions. Reporting aggregate functions return the same aggregate value for every row in a partition. Their behavior with respect to is the same as the SQL aggregate functions. The syntax is:

{SUM | AVG | MAX | MIN | COUNT | STDDEV | VARIANCE ... } ([ALL | DISTINCT] {value expression1 | *}) OVER ([PARTITION BY value expression2[,...]])

In addition, the following conditions apply:

  • An asterisk (*) is only allowed in

  • is supported only if corresponding aggregate functions allow it.

  • and can be any valid expression involving column references or aggregates.

  • The clause defines the groups on which the windowing functions would be computed. If the clause is absent, then the function is computed over the whole query result set.

Reporting functions can appear only in the clause or the clause. The major benefit of reporting functions is their ability to do multiple passes of data in a single query block and speed up query performance. Queries such as "Count the number of salesmen with sales more than 10% of city sales" do not require joins between separate query blocks.

For example, consider the question "For each product category, find the region in which it had maximum sales". The equivalent SQL query using the reporting aggregate function would be:

SELECT prod_category, country_region, sales FROM (SELECT SUBSTR(p.prod_category,1,8) AS prod_category, co.country_region, SUM(amount_sold) AS sales, MAX(SUM(amount_sold)) OVER (PARTITION BY prod_category) AS MAX_REG_SALES FROM sales s, customers c, countries co, products p WHERE s.cust_id=c.cust_id AND c.country_id=co.country_id AND s.prod_id =p.prod_id AND s.time_id = TO_DATE('11-OCT-2001') GROUP BY prod_category, country_region) WHERE sales = MAX_REG_SALES;

The inner query with the reporting aggregate function returns:

PROD_CAT COUNTRY_REGION SALES MAX_REG_SALES -------- -------------------- ---------- ------------- Electron Americas 581.92 581.92 Hardware Americas 925.93 925.93 Peripher Americas 3084.48 4290.38 Peripher Asia 2616.51 4290.38 Peripher Europe 4290.38 4290.38 Peripher Oceania 940.43 4290.38 Software Americas 4445.7 4445.7 Software Asia 1408.19 4445.7 Software Europe 3288.83 4445.7 Software Oceania 890.25 4445.7

The full query results are:

PROD_CAT COUNTRY_REGION SALES -------- -------------------- ---------- Electron Americas 581.92 Hardware Americas 925.93 Peripher Europe 4290.38 Software Americas 4445.7

Example 21-12 Reporting Aggregate Example

Reporting aggregates combined with nested queries enable you to answer complex queries efficiently. For example, what if you want to know the best selling products in your most significant product subcategories? The following is a query which finds the 5 top-selling products for each product subcategory that contributes more than 20% of the sales within its product category:

SELECT SUBSTR(prod_category,1,8) AS CATEG, prod_subcategory, prod_id, SALES FROM (SELECT p.prod_category, p.prod_subcategory, p.prod_id, SUM(amount_sold) AS SALES, SUM(SUM(amount_sold)) OVER (PARTITION BY p.prod_category) AS CAT_SALES, SUM(SUM(amount_sold)) OVER (PARTITION BY p.prod_subcategory) AS SUBCAT_SALES, RANK() OVER (PARTITION BY p.prod_subcategory ORDER BY SUM(amount_sold) ) AS RANK_IN_LINE FROM sales s, customers c, countries co, products p WHERE s.cust_id=c.cust_id AND c.country_id=co.country_id AND s.prod_id=p.prod_id AND s.time_id=to_DATE('11-OCT-2000') GROUP BY p.prod_category, p.prod_subcategory, p.prod_id ORDER BY prod_category, prod_subcategory) WHERE SUBCAT_SALES>0.2*CAT_SALES AND RANK_IN_LINE<=5;

RATIO_TO_REPORT Function

The function computes the ratio of a value to the sum of a set of values. If the expression evaluates to , also evaluates to , but it is treated as zero for computing the sum of values for the denominator. Its syntax is:

RATIO_TO_REPORT ( expr ) OVER ( [query_partition_clause] )

In this, the following applies:

  • can be any valid expression involving column references or aggregates.

  • The clause defines the groups on which the function is to be computed. If the clause is absent, then the function is computed over the whole query result set.

Example 21-13 RATIO_TO_REPORT

To calculate of sales for each channel, you might use the following syntax:

SELECT ch.channel_desc, TO_CHAR(SUM(amount_sold),'9,999,999') AS SALES, TO_CHAR(SUM(SUM(amount_sold)) OVER (), '9,999,999') AS TOTAL_SALES, TO_CHAR(RATIO_TO_REPORT(SUM(amount_sold)) OVER (), '9.999') AS RATIO_TO_REPORT FROM sales s, channels ch WHERE s.channel_id=ch.channel_id AND s.time_id=to_DATE('11-OCT-2000') GROUP BY ch.channel_desc; CHANNEL_DESC SALES TOTAL_SALE RATIO_ -------------------- ---------- ---------- ------ Direct Sales 14,447 23,183 .623 Internet 345 23,183 .015 Partners 8,391 23,183 .362

LAG/LEAD Functions

The and functions are useful for comparing values when the relative positions of rows can be known reliably. They work by specifying the count of rows which separate the target row from the current row. Because the functions provide access to more than one row of a table at the same time without a self-join, they can enhance processing speed. The function provides access to a row at a given offset prior to the current position, and the function provides access to a row at a given offset after the current position.

LAG/LEAD Syntax

These functions have the following syntax:

{LAG | LEAD} ( value_expr [, offset] [, default] ) OVER ( [query_partition_clause] order_by_clause )

is an optional parameter and defaults to 1. is an optional parameter and is the value returned if falls outside the bounds of the table or partition.

Example 21-14 LAG/LEAD

SELECT time_id, TO_CHAR(SUM(amount_sold),'9,999,999') AS SALES, TO_CHAR(LAG(SUM(amount_sold),1) OVER (ORDER BY time_id),'9,999,999') AS LAG1, TO_CHAR(LEAD(SUM(amount_sold),1) OVER (ORDER BY time_id),'9,999,999') AS LEAD1 FROM sales WHERE time_id>=TO_DATE('10-OCT-2000') AND time_id<=TO_DATE('14-OCT-2000') GROUP BY time_id; TIME_ID SALES LAG1 LEAD1 --------- ---------- ---------- ---------- 10-OCT-00 238,479 23,183 11-OCT-00 23,183 238,479 24,616 12-OCT-00 24,616 23,183 76,516 13-OCT-00 76,516 24,616 29,795 14-OCT-00 29,795 76,516

See "Data Densification for Reporting" for information showing how to use the / functions for doing period-to-period comparison queries on sparse data.

FIRST/LAST Functions

The aggregate functions allow you to rank a data set and work with its top-ranked or bottom-ranked rows. After finding the top or bottom ranked rows, an aggregate function is applied to any desired column. That is, / lets you rank on column A but return the result of an aggregate applied on the first-ranked or last-ranked rows of column B. This is valuable because it avoids the need for a self-join or subquery, thus improving performance. These functions' syntax begins with a regular aggregate function (, , , , , , ) that produces a single return value per group. To specify the ranking used, the / functions add a new clause starting with the word .

FIRST/LAST Syntax

These functions have the following syntax:

aggregate_function KEEP ( DENSE_RANK LAST ORDER BY expr [ DESC | ASC ] [NULLS { FIRST | LAST }] [, expr [ DESC | ASC ] [NULLS { FIRST | LAST }]]...) [OVER query_partitioning_clause]

Note that the clause can take multiple expressions.

FIRST/LAST As Regular Aggregates

You can use the / family of aggregates as regular aggregate functions.

Example 21-15 FIRST/LAST Example 1

The following query lets us compare minimum price and list price of our products. For each product subcategory within the Men's clothing category, it returns the following:

  • List price of the product with the lowest minimum price

  • Lowest minimum price

  • List price of the product with the highest minimum price

  • Highest minimum price

SELECT prod_subcategory, MIN(prod_list_price) KEEP (DENSE_RANK FIRST ORDER BY (prod_min_price)) AS LP_OF_LO_MINP, MIN(prod_min_price) AS LO_MINP, MAX(prod_list_price) KEEP (DENSE_RANK LAST ORDER BY (prod_min_price)) AS LP_OF_HI_MINP, MAX(prod_min_price) AS HI_MINP FROM products WHERE prod_category='Electronics' GROUP BY prod_subcategory; PROD_SUBCATEGORY LP_OF_LO_MINP LO_MINP LP_OF_HI_MINP HI_MINP ---------------- ------------- ------- ------------- ---------- Game Consoles 299.99 299.99 299.99 299.99 Home Audio 499.99 499.99 599.99 599.99 Y Box Accessories 7.99 7.99 20.99 20.99 Y Box Games 7.99 7.99 29.99 29.99

FIRST/LAST As Reporting Aggregates

You can also use the / family of aggregates as reporting aggregate functions. An example is calculating which months had the greatest and least increase in head count throughout the year. The syntax for these functions is similar to the syntax for any other reporting aggregate.

Consider the example in Example 21-15 for . What if we wanted to find the list prices of individual products and compare them to the list prices of the products in their subcategory that had the highest and lowest minimum prices?

The following query lets us find that information for the Documentation subcategory by using as reporting aggregates.

Example 21-16 FIRST/LAST Example 2

SELECT prod_id, prod_list_price, MIN(prod_list_price) KEEP (DENSE_RANK FIRST ORDER BY (prod_min_price)) OVER(PARTITION BY (prod_subcategory)) AS LP_OF_LO_MINP, MAX(prod_list_price) KEEP (DENSE_RANK LAST ORDER BY (prod_min_price)) OVER(PARTITION BY (prod_subcategory)) AS LP_OF_HI_MINP FROM products WHERE prod_subcategory = 'Documentation'; PROD_ID PROD_LIST_PRICE LP_OF_LO_MINP LP_OF_HI_MINP ---------- --------------- ------------- ------------- 40 44.99 44.99 44.99 41 44.99 44.99 44.99 42 44.99 44.99 44.99 43 44.99 44.99 44.99 44 44.99 44.99 44.99 45 44.99 44.99 44.99

Using the and functions as reporting aggregates makes it easy to include the results in calculations such as "Salary as a percent of the highest salary."

Inverse Percentile Functions

Using the function, you can find the cumulative distribution (percentile) of a set of values. However, the inverse operation (finding what value computes to a certain percentile) is neither easy to do nor efficiently computed. To overcome this difficulty, the and functions were introduced. These can be used both as window reporting functions as well as normal aggregate functions.

These functions need a sort specification and a parameter that takes a percentile value between 0 and 1. The sort specification is handled by using an clause with one expression. When used as a normal aggregate function, it returns a single value for each ordered set.

, which is a continuous function computed by interpolation, and , which is a step function that assumes discrete values. Like other aggregates, and operate on a group of rows in a grouped query, but with the following differences:

  • They require a parameter between 0 and 1 (inclusive). A parameter specified out of this range will result in error. This parameter should be specified as an expression that evaluates to a constant.

  • They require a sort specification. This sort specification is an clause with a single expression. Multiple expressions are not allowed.

Normal Aggregate Syntax

[PERCENTILE_CONT | PERCENTILE_DISC]( constant expression ) WITHIN GROUP ( ORDER BY single order by expression [ASC|DESC] [NULLS FIRST| NULLS LAST])

Inverse Percentile Example Basis

We use the following query to return the 17 rows of data used in the examples of this section:

SELECT cust_id, cust_credit_limit, CUME_DIST() OVER (ORDER BY cust_credit_limit) AS CUME_DIST FROM customers WHERE cust_city='Marshal'; CUST_ID CUST_CREDIT_LIMIT CUME_DIST ---------- ----------------- ---------- 28344 1500 .173913043 8962 1500 .173913043 36651 1500 .173913043 32497 1500 .173913043 15192 3000 .347826087 102077 3000 .347826087 102343 3000 .347826087 8270 3000 .347826087 21380 5000 .52173913 13808 5000 .52173913 101784 5000 .52173913 30420 5000 .52173913 10346 7000 .652173913 31112 7000 .652173913 35266 7000 .652173913 3424 9000 .739130435 100977 9000 .739130435 103066 10000 .782608696 35225 11000 .956521739 14459 11000 .956521739 17268 11000 .956521739 100421 11000 .956521739 41496 15000 1

() is computed by scanning up the values in each group till you find the first one greater than or equal to , where is the specified percentile value. For the example query where , the result is 5,000, as the following illustrates:

SELECT PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY cust_credit_limit) AS perc_disc, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY cust_credit_limit) AS perc_cont FROM customers WHERE cust_city='Marshal'; PERC_DISC PERC_CONT --------- --------- 5000 5000

The result of is computed by linear interpolation between rows after ordering them. To compute , we first compute the row number = = (1+x*(n-1)), where n is the number of rows in the group and x is the specified percentile value. The final result of the aggregate function is computed by linear interpolation between the values from rows at row numbers and .

The final result will be: = if (), then (value of expression from row at ) else () * (value of expression for row at ) + () * (value of expression for row at ).

Consider the previous example query, where we compute . Here n is 17. The row number = (1 + 0.5*(n-1))= 9 for both groups. Putting this into the formula, (), we return the value from row 9 as the result.

Another example is, if you want to compute (0.66). The computed row number =(1 + 0.66*(-1))= (1 + 0.66*16)= 11.67. (0.66) = (12-11.67)*(value of row 11)+(11.67-11)*(value of row 12). These results are:

SELECT PERCENTILE_DISC(0.66) WITHIN GROUP (ORDER BY cust_credit_limit) AS perc_disc, PERCENTILE_CONT(0.66) WITHIN GROUP (ORDER BY cust_credit_limit) AS perc_cont FROM customers WHERE cust_city='Marshal'; PERC_DISC PERC_CONT ---------- ---------- 9000 8040

Inverse percentile aggregate functions can appear in the clause of a query like other existing aggregate functions.

As Reporting Aggregates

You can also use the aggregate functions , as reporting aggregate functions. When used as reporting aggregate functions, the syntax is similar to those of other reporting aggregates.

[PERCENTILE_CONT | PERCENTILE_DISC](constant expression) WITHIN GROUP ( ORDER BY single order by expression [ASC|DESC] [NULLS FIRST| NULLS LAST]) OVER ( [PARTITION BY value expression [,...]] )

This query computes the same thing (median credit limit for customers in this result set, but reports the result for every row in the result set, as shown in the following output:

SELECT cust_id, cust_credit_limit, PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY cust_credit_limit) OVER () AS perc_disc, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY cust_credit_limit) OVER () AS perc_cont FROM customers WHERE cust_city='Marshal'; CUST_ID CUST_CREDIT_LIMIT PERC_DISC PERC_CONT ---------- ----------------- ---------- ---------- 28344 1500 5000 5000 8962 1500 5000 5000 36651 1500 5000 5000 32497 1500 5000 5000 15192 3000 5000 5000 102077 3000 5000 5000 102343 3000 5000 5000 8270 3000 5000 5000 21380 5000 5000 5000 13808 5000 5000 5000 101784 5000 5000 5000 30420 5000 5000 5000 10346 7000 5000 5000 31112 7000 5000 5000 35266 7000 5000 5000 3424 9000 5000 5000 100977 9000 5000 5000 103066 10000 5000 5000 35225 11000 5000 5000 14459 11000 5000 5000 17268 11000 5000 5000 100421 11000 5000 5000 41496 15000 5000 5000

Inverse Percentile Restrictions

For , the expression in the clause can be of any data type that you can sort (numeric, string, date, and so on). However, the expression in the clause must be a numeric or datetime type (including intervals) because linear interpolation is used to evaluate . If the expression is of type , the interpolated result is rounded to the smallest unit for the type. For a type, the interpolated value will be rounded to the nearest second, for interval types to the nearest second () or to the month().

Like other aggregates, the inverse percentile functions ignore in evaluating the result. For example, when you want to find the median value in a set, Oracle Database ignores the and finds the median among the non-null values. You can use the / option in the clause, but they will be ignored as are ignored.

Hypothetical Rank and Distribution Functions

These functions provide functionality useful for what-if analysis. As an example, what would be the rank of a row, if the row was hypothetically inserted into a set of other rows?

This family of aggregates takes one or more arguments of a hypothetical row and an ordered group of rows, returning the , , or of the row as if it was hypothetically inserted into the group.

Hypothetical Rank and Distribution Syntax

[RANK | DENSE_RANK | PERCENT_RANK | CUME_DIST]( constant expression [, ...] ) WITHIN GROUP ( ORDER BY order by expression [ASC|DESC] [NULLS FIRST|NULLS LAST][, ...] )

Here, refers to an expression that evaluates to a constant, and there may be more than one such expressions that are passed as arguments to the function. The clause can contain one or more expressions that define the sorting order on which the ranking will be based. , , , options will be available for each expression in the .

Example 21-17 Hypothetical Rank and Distribution Example 1

Using the list price data from the table used throughout this section, you can calculate the , and for a hypothetical sweater with a price of $50 for how it fits within each of the sweater subcategories. The query and results are:

SELECT cust_city, RANK(6000) WITHIN GROUP (ORDER BY CUST_CREDIT_LIMIT DESC) AS HRANK, TO_CHAR(PERCENT_RANK(6000) WITHIN GROUP (ORDER BY cust_credit_limit),'9.999') AS HPERC_RANK, TO_CHAR(CUME_DIST (6000) WITHIN GROUP (ORDER BY cust_credit_limit),'9.999') AS HCUME_DIST FROM customers WHERE cust_city LIKE 'Fo%' GROUP BY cust_city; CUST_CITY HRANK HPERC_ HCUME_ ------------------------------ ---------- ------ ------ Fondettes 13 .455 .478 Fords Prairie 18 .320 .346 Forest City 47 .370 .378 Forest Heights 38 .456 .464 Forestville 58 .412 .418 Forrestcity 51 .438 .444 Fort Klamath 59 .356 .363 Fort William 30 .500 .508 Foxborough 52 .414 .420

Unlike the inverse percentile aggregates, the clause in the sort specification for hypothetical rank and distribution functions may take multiple expressions. The number of arguments and the expressions in the clause should be the same and the arguments must be constant expressions of the same or compatible type to the corresponding expression. The following is an example using two arguments in several hypothetical ranking functions.

Example 21-18 Hypothetical Rank and Distribution Example 2

SELECT prod_subcategory, RANK(10,8) WITHIN GROUP (ORDER BY prod_list_price DESC,prod_min_price) AS HRANK, TO_CHAR(PERCENT_RANK(10,8) WITHIN GROUP (ORDER BY prod_list_price, prod_min_price),'9.999') AS HPERC_RANK, TO_CHAR(CUME_DIST (10,8) WITHIN GROUP (ORDER BY prod_list_price, prod_min_price),'9.999') AS HCUME_DIST FROM products WHERE prod_subcategory LIKE 'Recordable%' GROUP BY prod_subcategory; PROD_SUBCATEGORY HRANK HPERC_ HCUME_ -------------------- ----- ------ ------ Recordable CDs 4 .571 .625 Recordable DVD Discs 5 .200 .333

These functions can appear in the clause of a query just like other aggregate functions. They cannot be used as either reporting aggregate functions or windowing aggregate functions.

Linear Regression Functions

The regression functions support the fitting of an ordinary-least-squares regression line to a set of number pairs. You can use them as both aggregate functions or windowing or reporting functions.

The functions are as follows:

Oracle applies the function to the set of (, ) pairs after eliminating all pairs for which either of or is null. e1 is interpreted as a value of the dependent variable (a "y value"), and is interpreted as a value of the independent variable (an "x value"). Both expressions must be numbers.

The regression functions are all computed simultaneously during a single pass through the data. They are frequently combined with the , , and functions.

REGR_COUNT Function

returns the number of non-null number pairs used to fit the regression line. If applied to an empty set (or if there are no (, ) pairs where neither of or is null), the function returns 0.

REGR_AVGY and REGR_AVGX Functions

and compute the averages of the dependent variable and the independent variable of the regression line, respectively. computes the average of its first argument () after eliminating (, ) pairs where either of or is null. Similarly, computes the average of its second argument () after null elimination. Both functions return if applied to an empty set.

REGR_SLOPE and REGR_INTERCEPT Functions

The function computes the slope of the regression line fitted to non-null (, ) pairs.

The function computes the y-intercept of the regression line. returns whenever slope or the regression averages are .

REGR_R2 Function

The function computes the coefficient of determination (usually called "R-squared" or "goodness of fit") for the regression line.

returns values between 0 and 1 when the regression line is defined (slope of the line is not null), and it returns otherwise. The closer the value is to 1, the better the regression line fits the data.

REGR_SXX, REGR_SYY, and REGR_SXY Functions

, and functions are used in computing various diagnostic statistics for regression analysis. After eliminating (, ) pairs where either of or is null, these functions make the following computations:

REGR_SXX: REGR_COUNT(e1,e2) * VAR_POP(e2) REGR_SYY: REGR_COUNT(e1,e2) * VAR_POP(e1) REGR_SXY: REGR_COUNT(e1,e2) * COVAR_POP(e1, e2)

Linear Regression Statistics Examples

Some common diagnostic statistics that accompany linear regression analysis are given in Table 21-2, "Common Diagnostic Statistics and Their Expressions ". Note that this release's new functions allow you to calculate all of these.

Table 21-2 Common Diagnostic Statistics and Their Expressions 

Type of StatisticExpression

Adjusted R2

Standard error

Total sum of squares

Regression sum of squares

Residual sum of squares

- (

t statistic for slope

/ (Standard error)

t statistic for y-intercept

/ ((Standard error) *


Sample Linear Regression Calculation

In this example, we compute an ordinary-least-squares regression line that expresses the quantity sold of a product as a linear function of the product's list price. The calculations are grouped by sales channel. The values , , are slope, intercept, and coefficient of determination of the regression line, respectively. The (integer) value is the number of products in each channel for whom both quantity sold and list price data are available.

SELECT s.channel_id, REGR_SLOPE(s.quantity_sold, p.prod_list_price) SLOPE, REGR_INTERCEPT(s.quantity_sold, p.prod_list_price) INTCPT, REGR_R2(s.quantity_sold, p.prod_list_price) RSQR, REGR_COUNT(s.quantity_sold, p.prod_list_price) COUNT, REGR_AVGX(s.quantity_sold, p.prod_list_price) AVGLISTP, REGR_AVGY(s.quantity_sold, p.prod_list_price) AVGQSOLD FROM sales s, products p WHERE s.prod_id=p.prod_id AND p.prod_category='Electronics' AND s.time_id=to_DATE('10-OCT-2000') GROUP BY s.channel_id; CHANNEL_ID SLOPE INTCPT RSQR COUNT AVGLISTP AVGQSOLD ---------- ---------- ---------- ---------- ---------- ---------- ---------- 2 0 1 1 39 466.656667 1 3 0 1 1 60 459.99 1 4 0 1 1 19 526.305789 1

Linear Algebra

Linear algebra is a branch of mathematics with a wide range of practical applications. Many areas have tasks that can be expressed using linear algebra, and here are some examples from several fields: statistics (multiple linear regression and principle components analysis), data mining (clustering and classification), bioinformatics (analysis of microarray data), operations research (supply chain and other optimization problems), econometrics (analysis of consumer demand data), and finance (asset allocation problems). Various libraries for linear algebra are freely available for anyone to use. Oracle's package exposes matrix PL/SQL data types and wrapper PL/SQL subprograms for two of the most popular and robust of these libraries, BLAS and LAPACK.

Linear algebra depends on matrix manipulation. Performing matrix manipulation in PL/SQL in the past required inventing a matrix representation based on PL/SQL's native data types and then writing matrix manipulation routines from scratch. This required substantial programming effort and the performance of the resulting implementation was limited. If developers chose to send data to external packages for processing rather than create their own routines, data transfer back and forth could be time consuming. Using the package lets data stay within Oracle, removes the programming effort, and delivers a fast implementation.

Example 21-19 Linear Algebra

Here is an example of how Oracle's linear algebra support could be used for business analysis. It invokes a multiple linear regression application built using the package. The multiple regression application is implemented in an object called . Note that sample files for the OLS Regression object can be found in .

Consider the scenario of a retailer analyzing the effectiveness of its marketing program. Each of its stores allocates its marketing budget over the following possible programs: media advertisements (), promotions (), discount coupons (), and direct mailers (). The regression analysis builds a linear relationship between the amount of sales that an average store has in a given year () and the spending on the four components of the marketing program. Suppose that the marketing data is stored in the following table:

sales_marketing_data ( /* Store information*/ store_no NUMBER, year NUMBER, /* Sales revenue (in dollars)*/ sales NUMBER, /* sales amount*/ /* Marketing expenses (in dollars)*/ media NUMBER, /*media advertisements*/ promo NUMBER, /*promotions*/ disct NUMBER, /*dicount coupons*/ dmail NUMBER, /*direct mailers*/

Then you can build the following sales-marketing linear model using coefficients:

Sales Revenue = a + b Media Advisements + c Promotions + d Discount Coupons + e Direct Mailer

This model can be implemented as the following view, which refers to the OLS regression object:

CREATE OR REPLACE VIEW sales_marketing_model (year, ols) AS SELECT year, OLS_Regression( /* mean_y => */ AVG(sales), /* variance_y => */ var_pop(sales), /* MV mean vector => */ UTL_NLA_ARRAY_DBL (AVG(media),AVG(promo), AVG(disct),AVG(dmail)), /* VCM variance covariance matrix => */ UTL_NLA_ARRAY_DBL (var_pop(media),covar_pop(media,promo), covar_pop(media,disct),covar_pop(media,dmail), var_pop(promo),covar_pop(promo,disct), covar_pop(promo,dmail),var_pop(disct), covar_pop(disct,dmail),var_pop(dmail)), /* CV covariance vector => */ UTL_NLA_ARRAY_DBL (covar_pop(sales,media),covar_pop(sales,promo), covar_pop(sales,disct),covar_pop(sales,dmail))) FROM sales_marketing_data GROUP BY year;

Using this view, a marketing program manager can perform an analysis such as "Is this sales-marketing model reasonable for year 2004 data? That is, is the multiple-correlation greater than some acceptable value, say, 0.9?" The SQL for such a query might be as follows:

SELECT model.ols.getCorrelation(1) AS "Applicability of Linear Model" FROM sales_marketing_model model WHERE year = 2004;

You could also solve questions such as "What is the expected base-line sales revenue of a store without any marketing programs in 2003?" or "Which component of the marketing program was the most effective in 2004? That is, a dollar increase in which program produced the greatest expected increase in sales?"

See Oracle Database PL/SQL Packages and Types Reference for further information regarding the use of the package and linear algebra.

Frequent Itemsets

Instead of counting how often a given event occurs (for example, how often someone has purchased milk at the grocery), you may find it useful to count how often multiple events occur together (for example, how often someone has purchased both milk and cereal together at the grocery store). You can count these multiple events using what is called a frequent itemset, which is, as the name implies, a set of items. Some examples of itemsets could be all of the products that a given customer purchased in a single trip to the grocery store (commonly called a market basket), the web pages that a user accessed in a single session, or the financial services that a given customer utilizes.

The practical motivation for using a frequent itemset is to find those itemsets that occur most often. If you analyze a grocery store's point-of-sale data, you might, for example, discover that milk and bananas are the most commonly bought pair of items. Frequent itemsets have thus been used in business intelligence environments for many years, with the most common one being for market basket analysis in the retail industry. Frequent itemset calculations are integrated with the database, operating on top of relational tables and accessed through SQL. This integration provides the following key benefits:

  • Applications that previously relied on frequent itemset operations now benefit from significantly improved performance as well as simpler implementation.

  • SQL-based applications that did not previously use frequent itemsets can now be easily extended to take advantage of this functionality.

Frequent itemsets analysis is performed with the PL/SQL package . See Oracle Database PL/SQL Packages and Types Reference for more information. In addition, there is an example of frequent itemset usage in "Frequent itemsets".

Other Statistical Functions

Oracle introduces a set of SQL statistical functions and a statistics package, . This section lists some of the new functions along with basic syntax.

See Oracle Database PL/SQL Packages and Types Reference for detailed information about the package and Oracle Database SQL Reference for syntax and semantics.

Descriptive Statistics

You can calculate the following descriptive statistics:

  • Median of a Data Set

    Median (expr) [OVER (query_partition_clause)]
  • Mode of a Data Set

    STATS_MODE (expr)

Hypothesis Testing - Parametric Tests

You can calculate the following descriptive statistics:

  • One-Sample T-Test

    STATS_T_TEST_ONE (expr1, expr2 (a constant) [, return_value])

One thought on “Partition By Clause In Pl/Sql What Is The Assignment Operator

Leave a comment

L'indirizzo email non verrĂ  pubblicato. I campi obbligatori sono contrassegnati *