PostgreSQL 9.5: what's new? Part 2. TABLESAMPLE
we Continue to review innovations in PostgreSQL 9.5.
Part 1. INSERT... ON CONFLICT DO NOTHING/UPDATE ROW LEVEL SECURITY.
Part 3. GROUPING SETS, CUBE, ROLLUP
Sometimes there are tasks in which you want the table to select a number of random entries for this wrote sophisticated queries (to obtain truly random data — you need a lot of sweat). With the release of PostgreSQL 9.5, this task will become easier.
Using the keyword TABLESAMPLE can make a sample not all of the data from the table, but only some parts thereof, to a sample.
The syntax will be something like this:
sampling_method — method of sampling, the default in PostgreSQL 9.5 two of them: the SYSTEM and BERNOULLI, as argument, they take a floating-point number (or any valid expression that evaluates to a number), which is interpreted as a percentage for samples from 0 to 100.
Let's see the examples of how sampling works in PostgreSQL 9.5.
Suppose we have a table transaction which stores the transaction id, the transaction amount and date from the time when the transaction was completed. In the table, will add 100,000 records.
Try to take a sample of records of size 0.1% of the original table (100 entries):
Why we received 100 entries and 157? The fact that PostgreSQL keeps the table data in the page array size of 8 kb (by default, this parameter can be changed when building the server from source code) and the method of sample SYSTEM just takes the right amount of random pages for a given number of interest and gives them "as is". In this case, one page is placed 157 records. If you request 2 times more records to sample, then the data will be taken with 2 pages:
It should be understood that different pages can keep the same number of entries and therefore the number of records returned may vary from request to request.
In order to get the exact number of records you can use the expression LIMIT, but it is understood that we are all equal in this case you will get the records from one page. Therefore, if the values in the record depend on the order in which these entries were inserted, or the nature of the values in the records are chronological (as in our case in the field ending_time), it is likely that you will get meaningless results, making samples. For example, if we want to know through a sample, the maximum date when the transaction took place. then, making the same query several times, we get absolutely different results:
the
the
the
While the real value will be:
the
In order to obtain a more distributed sample, you can use the sample BERNOULLI that scans the entire table (actually, "flips a coin" for each entry) and selects the random write:
the
Now look at the performance, try to perform the obtaining of the average transaction amount in three different ways and to itself secondary:
1) sample:
2) Sample according to the method of SYSTEM:
3) Sample according to the method of BERNOULLI:
We see that sampling according to the method of SYSTEM is faster, but the accuracy he has a below, while sampling according to the method of BERNOULLI slower, but the accuracy of his above. You can choose a compromise between speed and accuracy. Also note that for sampling used a new type of scan: Sample scan.
Add to the table still records, even if it is 20 million records
We see that when the number of records, the method of BERNOULLI loses more in productivity. This is because it does, in fact, a full table scan, whereas SYSTEM returns just a few pages.
Now try to increase the percentage of the records:
As can be seen, the method SYSTEM has more to lose in performance with increasing percent of the sample. This is logical, since the BERNOULLI had done a full scan and makes it, while the SYSTEM should return 10 times more pages.
In the end, it can be noted that the method SYSTEM on a small percent of the sample is much faster than BERNOULLI, but it gives at least a random sample of records. But uvlecheniem percent this advantage is lost.
With the optional key word is REPEATABLE we can define seed to random values. If the two queries, the same method of sampling, sampling percentage and seed, then these two queries will be selected the same sample:
As we saw above, if the keyword is REPEATABLE is not specified, then every sample will be different from the previous one.
It is worth noting that the sampling is performed UNTIL the condition WHERE, that is, do not get to choose a sample on the condition. In this case, first, selects the sample, and then apply the condition WHERE, but since the probability to record the c id<100 from the table in 20000000 records is very small, then the sample will be empty:
SYSTEM and BERNOULLI is not the only possible sampling options, if you wish, you can write your own method for sampling. The documentation for this lies here. While custom methods of sampling can take more than one argument or not accept them at all. Also custom methods can not take into account the keyword is REPEATABLE.
This is my short story about sampling in PostgreSQL 9.5 is completed. Thank you for your attention!
PS it is Possible to estimate the number of records in the table using the sample :)
P. P. S. do Not do as written above, this is a joke and doesn't always work as it should.
In the following parts: GROUPINS SETS, ROLLUP, CUBE.
Article based on information from habrahabr.ru
Part 1. INSERT... ON CONFLICT DO NOTHING/UPDATE ROW LEVEL SECURITY.
Part 3. GROUPING SETS, CUBE, ROLLUP
author
my apologies for the delay with the release of the second part. I originally planned to release the second part of the article a week after the first, but due to the high employment, couldn't do it. So I decided that I will not publish large articles and small portions, but often.
Sometimes there are tasks in which you want the table to select a number of random entries for this wrote sophisticated queries (to obtain truly random data — you need a lot of sweat). With the release of PostgreSQL 9.5, this task will become easier.
Using the keyword TABLESAMPLE can make a sample not all of the data from the table, but only some parts thereof, to a sample.
The syntax will be something like this:
SELECT ... FROM TABLE_NAME ... TABLESAMPLE sampling_method ( argument [, ...] ) [ REPEATABLE ( seed ) ]
sampling_method — method of sampling, the default in PostgreSQL 9.5 two of them: the SYSTEM and BERNOULLI, as argument, they take a floating-point number (or any valid expression that evaluates to a number), which is interpreted as a percentage for samples from 0 to 100.
Let's see the examples of how sampling works in PostgreSQL 9.5.
Suppose we have a table transaction which stores the transaction id, the transaction amount and date from the time when the transaction was completed. In the table, will add 100,000 records.
CREATE TABLE transactions (
id SERIAL PRIMARY KEY,
amount NUMERIC(15,2),
ending_time TIMESTAMP
);
INSERT INTO transactions (amount, ending_time)
SELECT
(round(CAST(random() * 100000 AS NUMERIC), 2)),
now() - random() * CAST('1 day' AS INTERVAL)
FROM generate_series(1, 100000);
Try to take a sample of records of size 0.1% of the original table (100 entries):
SELECT * FROM transactions TABLESAMPLE SYSTEM (0.1)
Total query runtime: 213 ms.
157 rows retrieved.
Why we received 100 entries and 157? The fact that PostgreSQL keeps the table data in the page array size of 8 kb (by default, this parameter can be changed when building the server from source code) and the method of sample SYSTEM just takes the right amount of random pages for a given number of interest and gives them "as is". In this case, one page is placed 157 records. If you request 2 times more records to sample, then the data will be taken with 2 pages:
SELECT * FROM transactions TABLESAMPLE SYSTEM (0.2)
Total query runtime: 21 ms.
314 rows retrieved.
It should be understood that different pages can keep the same number of entries and therefore the number of records returned may vary from request to request.
In order to get the exact number of records you can use the expression LIMIT, but it is understood that we are all equal in this case you will get the records from one page. Therefore, if the values in the record depend on the order in which these entries were inserted, or the nature of the values in the records are chronological (as in our case in the field ending_time), it is likely that you will get meaningless results, making samples. For example, if we want to know through a sample, the maximum date when the transaction took place. then, making the same query several times, we get absolutely different results:
SELECT MAX(ending_time) FROM transactions TABLESAMPLE SYSTEM(0.1)
max |
---|
2014-11-08 22:30:32.720855 |
SELECT MAX(ending_time) FROM transactions TABLESAMPLE SYSTEM(0.1)
max |
---|
2014-12-02 11:42:32.720855 |
SELECT MAX(ending_time) FROM transactions TABLESAMPLE SYSTEM(0.1)
max |
---|
2014-10-21 09:40:32.720855 |
While the real value will be:
SELECT MAX(ending_time) FROM transactions
max |
---|
2014-12-07 04:04:32.720855 |
In order to obtain a more distributed sample, you can use the sample BERNOULLI that scans the entire table (actually, "flips a coin" for each entry) and selects the random write:
SELECT MAX(ending_time) FROM transactions TABLESAMPLE BERNOULLI(0.1)
max |
---|
2014-12-07 00:06:32.720855 |
Now look at the performance, try to perform the obtaining of the average transaction amount in three different ways and to itself secondary:
1) sample:
EXPLAIN ANALYZE SELECT AVG(amount) FROM transactions.
"Aggregate (cost=1887.01..1887.00 rows=1 width=8) (actual time=25.795 25.795..rows=1 loops=1)"
" -> Seq Scan on transactions (cost=0.00..1637.00 rows=100000 width=8) (actual time=0.005..12.438 rows=100000 loops=1)"
"Planning time: 0.055 ms"
"Execution time: 25.816 ms"
SELECT AVG(amount) FROM transactions.
50028.8742828
2) Sample according to the method of SYSTEM:
EXPLAIN ANALYZE SELECT AVG(amount) FROM transactions TABLESAMPLE SYSTEM(0.1)
"Aggregate (cost=1.25..1.26 rows=1 width=8) (actual time=0.088..0.088 rows=1 loops=1)"
" -> Sample Scan (system) on transactions (cost=0.00..1.00 rows=100 width=8) (actual time=0.017..0.048 rows=157 loops=1)"
"Planning time: 0.068 ms"
"Execution time: 0.120 ms"
SELECT AVG(amount) FROM transactions TABLESAMPLE SYSTEM(0.1)
53628.223694267516
3) Sample according to the method of BERNOULLI:
EXPLAIN ANALYZE SELECT AVG(amount) FROM transactions TABLESAMPLE BERNOULLI(0.1)
"Aggregate (cost=638.25 638.26..rows=1 width=8) (actual time=2.847 2.847..rows=1 loops=1)"
"Planning time: 0.145 ms"
"Execution time: 2.872 ms"
SELECT AVG(amount) FROM transactions TABLESAMPLE BERNOULLI(0.1)
50285.863240740741
We see that sampling according to the method of SYSTEM is faster, but the accuracy he has a below, while sampling according to the method of BERNOULLI slower, but the accuracy of his above. You can choose a compromise between speed and accuracy. Also note that for sampling used a new type of scan: Sample scan.
Add to the table still records, even if it is 20 million records
INSERT INTO transactions (amount, ending_time)
SELECT
(round(CAST(random() * 100000 AS DECIMAL), 2)),
now() - INTERVAL '1 year' + (i * INTERVAL '1 minute' )
FROM generate_series(100001, 20000000) i;
EXPLAIN ANALYZE SELECT AVG(amount) FROM transactions.
"Aggregate (cost=377372.70 377372.71..rows=1 width=8) (actual time=..4604.297 4604.297 rows=1 loops=1)"
" -> Seq Scan on transactions (cost=0.00..327375.96 rows=19998696 width=8) (actual time=0.027..2043.846 rows=20000000 loops=1)"
"Planning time: 0.063 ms"
"Execution time: 4604.325 ms"
SELECT AVG(amount) FROM transactions.
50002.888681451
EXPLAIN ANALYZE SELECT AVG(amount) FROM transactions TABLESAMPLE SYSTEM (0.1)
"Aggregate (cost=757.99..758.00 rows=1 width=8) (actual time=7.309 7.309..rows=1 loops=1)"
" -> Sample Scan (system) on transactions (cost=0.00..707.99 rows=19999 width=8) (actual time=0.057..4.588 rows=20096 loops=1)"
"Planning time: 0.073 ms"
"Execution time: 7.340 ms"
SELECT AVG(amount) FROM transactions TABLESAMPLE SYSTEM (0.1)
50323.198322551752
EXPLAIN ANALYZE SELECT AVG(amount) FROM transactions TABLESAMPLE BERNOULLI (0.1)
"Aggregate (cost=127638.99 127639.00..rows=1 width=8) (actual time=..751.831 751.832 rows=1 loops=1)"
" -> Sample Scan (bernoulli) on transactions (cost=0.00..127588.99 rows=19999 width=8) (actual time=0.260..747.682 rows=19899 loops=1)"
"Planning time: 0.055 ms"
"Execution time: 751.879 ms"
SELECT AVG(amount) FROM transactions TABLESAMPLE BERNOULLI (0.1)
50043.386386377336
We see that when the number of records, the method of BERNOULLI loses more in productivity. This is because it does, in fact, a full table scan, whereas SYSTEM returns just a few pages.
Now try to increase the percentage of the records:
EXPLAIN ANALYZE SELECT AVG(amount) FROM transactions TABLESAMPLE SYSTEM (1)
"Aggregate (cost=7591.84 7591.85..rows=1 width=8) (actual time=65.055 65.055..rows=1 loops=1)"
" -> Sample Scan (system) on transactions (cost=0.00..7091.87 rows=199987 width=8) (actual time=0.043..37.939 rows=200018 loops=1)"
"Planning time: 0.053 ms"
EXPLAIN ANALYZE SELECT AVG(amount) FROM transactions TABLESAMPLE BERNOULLI (1)
"Aggregate (cost=129888.84 129888.85..rows=1 width=8) (actual time=..799.826 799.826 rows=1 loops=1)"
" -> Sample Scan (bernoulli) on transactions (cost=0.00..129388.87 rows=199987 width=8) (actual time=0.035..769.899 rows=199682 loops=1)"
"Planning time: 0.063 ms"
"Execution time: 799.859 ms"
As can be seen, the method SYSTEM has more to lose in performance with increasing percent of the sample. This is logical, since the BERNOULLI had done a full scan and makes it, while the SYSTEM should return 10 times more pages.
In the end, it can be noted that the method SYSTEM on a small percent of the sample is much faster than BERNOULLI, but it gives at least a random sample of records. But uvlecheniem percent this advantage is lost.
With the optional key word is REPEATABLE we can define seed to random values. If the two queries, the same method of sampling, sampling percentage and seed, then these two queries will be selected the same sample:
SELECT MAX(amount) FROM transactions TABLESAMPLE SYSTEM (0.1) REPEATABLE (50)
99997.91
SELECT MAX(amount) FROM transactions TABLESAMPLE SYSTEM (0.1) REPEATABLE (300)
99999.15
SELECT MAX(amount) FROM transactions TABLESAMPLE BERNOULLI (0.1) REPEATABLE (50)
99995.9
SELECT MAX(amount) FROM transactions TABLESAMPLE SYSTEM (0.1) REPEATABLE (50)
99997.91
As we saw above, if the keyword is REPEATABLE is not specified, then every sample will be different from the previous one.
It is worth noting that the sampling is performed UNTIL the condition WHERE, that is, do not get to choose a sample on the condition. In this case, first, selects the sample, and then apply the condition WHERE, but since the probability to record the c id<100 from the table in 20000000 records is very small, then the sample will be empty:
SELECT * FROM transactions TABLESAMPLE SYSTEM (1) WHERE id<100
Total query runtime: 31 ms.
0 rows retrieved.
SYSTEM and BERNOULLI is not the only possible sampling options, if you wish, you can write your own method for sampling. The documentation for this lies here. While custom methods of sampling can take more than one argument or not accept them at all. Also custom methods can not take into account the keyword is REPEATABLE.
This is my short story about sampling in PostgreSQL 9.5 is completed. Thank you for your attention!
PS it is Possible to estimate the number of records in the table using the sample :)
SELECT COUNT(*)*100.0 FROM transactions TABLESAMPLE SYSTEM (1);
20001800
P. P. S. do Not do as written above, this is a joke and doesn't always work as it should.
In the following parts: GROUPINS SETS, ROLLUP, CUBE.
Комментарии
Отправить комментарий