I am currently trying to solve a problem where I need to get a randomized sample from a data table using SQL in Vertica. I need to get a randomized sample without replacement of 60,000 rows over a span of 3 years (2021 to 2024) from a data table. Each row has an OrderID as well as an Item, Price, and Date column. If an OrderID is in the randomized sample, then all instances of that OrderID from 2021 to 2024 in the data table must also be in the randomized sample. I also need this to be as efficient as possible. Is it possible to do this in SQL?
The data looks approximately like this:
OrderID | Item | Price | Date |
---|---|---|---|
1234 | Apple | 0.99 | 2024-03-25 |
2573 | Tomato | 0.99 | 2023-01-16 |
1234 | Cereal | 6.99 | 2024-03-25 |
1234 | Milk | 7.99 | 2024-03-25 |
8473 | Salmon | 15.99 | 2021-11-19 |
7392 | Plum | 0.99 | 2022-05-02 |
I am currently trying to solve a problem where I need to get a randomized sample from a data table using SQL in Vertica. I need to get a randomized sample without replacement of 60,000 rows over a span of 3 years (2021 to 2024) from a data table. Each row has an OrderID as well as an Item, Price, and Date column. If an OrderID is in the randomized sample, then all instances of that OrderID from 2021 to 2024 in the data table must also be in the randomized sample. I also need this to be as efficient as possible. Is it possible to do this in SQL?
The data looks approximately like this:
OrderID | Item | Price | Date |
---|---|---|---|
1234 | Apple | 0.99 | 2024-03-25 |
2573 | Tomato | 0.99 | 2023-01-16 |
1234 | Cereal | 6.99 | 2024-03-25 |
1234 | Milk | 7.99 | 2024-03-25 |
8473 | Salmon | 15.99 | 2021-11-19 |
7392 | Plum | 0.99 | 2022-05-02 |
Using my example above, If OrderID 1234 is selected for the randomized sample, then I want all instances of OrderID 1234 to appear in the sample along with other OrderIDs, like this:
Sample
OrderID | Item | Price | Date |
---|---|---|---|
1234 | Apple | 0.99 | 2024-03-25 |
1234 | Cereal | 6.99 | 2024-03-25 |
1234 | Milk | 7.99 | 2024-03-25 |
7392 | Plum | 0.99 | 2022-05-02 |
I've looked into various SQL functions for getting a randomized sample with replacement efficiently, but I haven't been able to figure out how to also get it so that if an OrderID is pulled into the sample, then all instances (rows) with that OrderID from those 3 years are also pulled into the sample. This is so far outside my current skill level in SQL that I currently feel very stuck on how to tackle this. Any help would be greatly appreciated!
I've tried this bit of code because I found something similar elsewhere, but even that isn't working (and to be honest, I'm not even sure if it's in the kind of SQL that Vertica would understand) and is giving me this error:
SQL ERROR [4856][42601][Vertica]VJDBC ERROR: Syntax error at or near "."
SELECT t.*
FROM table t
WHERE t.OrderID in (SELECT top(60000) t2.OrderID
FROM (SELECT DISTINCT t2.OrderID
FROM table t2
) t2
ORDER BY newid()
);
You say you want 60000 sample rows, but for each order ID in the sample you want all related rows. This is a tad too vague, because these two rules can be contradictory.
To illustrate this: Say, you want 3 sample rows and there are only six rows in the table for order IDs 1, 1, 2, 2, 3, 3. You see you cannot pick three sample rows and still obey the rule to have all of each order's rows in the sample. You will either have to pick two orders with a total of six rows or one order with a total of two rows, you can't pick orders that get you a total of three sample rows.
And 2021 to 2024 are four years, not three. :-)
Anyway, one approach to get the desired sample is to sort the orders randomly and have a running total of their transactions in the given years. Then just take all orders with a running total up to the 60000. Thus you may end up with exactly the 60000 rows or you may get several rows less than that, depending on the picked orders and their numbers of transactions in the table.
WITH
transactions AS
(
SELECT *
FROM mytable
WHERE date >= DATE '2021-01-01'
AND date < DATE '2025-01-01'
),
orders AS
(
SELECT orderid, SUM(COUNT(*)) OVER (ORDER BY NEWID()) AS cnt
FROM transactions
GROUP BY orderid
)
SELECT orderid, item, price, date
FROM transactions
WHERE orderid IN (SELECT orderid FROM orders WHERE cnt <= 60000)
ORDER BY orderid, date;
Demo (in PostgreSQL for the lack of Vertica): https://dbfiddle.uk/aSlvN1bB In this demo there are twelve rows in the table and I want a sample of up to five rows. Sometimes the query results in two rows only, sometimes in four rows, depending on which orders get picked.
Sampling without replacement is the same as reordering your data set with a random number. So step 1 is to get a dataset of your Orderids with counts, then assign each a random number and calculate a running sum of the counts based on the random order. Then finally keep every row whose running sum is under 60,000 and join it back to your base tables.
Select t.*
From table t Inner Join (
Select * From (
Select *, random() as sorter,
sum(recs) Over (Order By sorter) as cumrecs
From (
Select orderid, count(*) as recs
From table
Where date Between '2021-01-01' And '2024-12-31'
Group By orderid
)
Where cumrecs < 60000
) ss On t.orderid=ss.orderid
If you want to include the first orderid that takes you over 60,000 then after the Order By sorter
part add the following windowing spec Rows Between Unbounded Preceding And 1 Preceding
.
Your '3 years' from 2021 to 2024 is actually 4 years, as noted elsewhere, so adjust your date range accordingly.
tId
for orders and count order items itemsQty
for each order.tId
See example
OrderID | Item | Price | Date |
---|---|---|---|
1234 | Apple | 0.99 | 2024-03-25 |
2573 | Tomato | 0.99 | 2023-01-16 |
1234 | Cereal | 6.99 | 2024-03-25 |
1234 | Milk | 7.99 | 2024-03-25 |
8473 | Salmon | 15.99 | 2021-11-19 |
7392 | Plum | 0.99 | 2022-05-02 |
select o.*, b.*
from(
select *
,sum(itemQty)over(order by tId)rSum
from(
select OrderId
,count(*) itemQty
,newid() tId
from orders
group by OrderId
)a
)b
inner join orders o on o.orderId=b.orderId
where (rSum-itemQty)<5
order by tId
OrderID | Item | Price | Date | OrderId | itemQty | tId | rSum |
---|---|---|---|---|---|---|---|
8473 | Salmon | 15.99 | 2021-11-19 | 8473 | 1 | c645e1c8-bfed-474f-be38-2252e049a1bc | 1 |
7392 | Plum | 0.99 | 2022-05-02 | 7392 | 1 | 3a39d431-e487-4296-8c14-242003268302 | 2 |
1234 | Apple | 0.99 | 2024-03-25 | 1234 | 3 | 0ebcab72-4866-4671-a03e-5872dd36fc7f | 5 |
1234 | Cereal | 6.99 | 2024-03-25 | 1234 | 3 | 0ebcab72-4866-4671-a03e-5872dd36fc7f | 5 |
1234 | Milk | 7.99 | 2024-03-25 | 1234 | 3 | 0ebcab72-4866-4671-a03e-5872dd36fc7f | 5 |
fiddle
If you need 60,000 rows; and these rows must be from the year 2024 and the two preceding years - makes 3 years, and the years 2022,2023 and 2024; and get 60,000 rows, with the additional request that all items of the same randomly picked order from those 3 years are included, then you will probably need a multi-step approach.
And you will not get exactly 60,000 rows, but an approximation of that number.
Step 1: Determine the average number of items per order in those 3 years:
CREATE TABLE avgc(qty) AS (
WITH
itemsperorder(qty) AS (
SELECT
orderid
, COUNT(*)
FROM orders
WHERE date BETWEEN '2022-01-01' AND '2024-12-31'
)
SELECT AVG(qty)::INT FROM itemsperorder
;
Then, build a table with order ID-s
CREATE TABLE order_ids(orderid) AS
SELECT DISTINCT orderid FROM orders;
Then, pick a random extract of that order ID table into a common table expression that will return roughly 60,000 rows out of the orders
table when joined to it, thanks to your average calculation:
CREATE TABLE sample_orders AS
WITH
random_orderids AS (
SELECT
*
FROM order_ids
WHERE RANDOM() < (
SELECT 60000 / (SELECT qty FROM avgc) / (SELECT COUNT(*) FROM orders)
)
)
SELECT
o.*
FROM orders o
JOIN order_ids USING(orderid)
WHERE date BETWEEN '2022-01-01' AND '2024-12-31'
;
As Items are selected you need to record them in the sample table. You can then join on the sample table so they are not selected again.
--Select a random sample from the remaining items
--Join on the Sample table table to exclude previously selected orderIDs.
INSERT INTO SAMPLE
SELECT t.*
FROM Example t
WHERE t.OrderID in (
--Select just 1 random record
SELECT t2.OrderID
FROM (
--Select all records from Example that do not exist in sample
SELECT t2.OrderID
FROM Example t2
LEFT JOIN Sample s ON s.OrderID = t2.OrderID
WHERE s.OrderID IS NULL
) t2
ORDER BY random() --instead of newid()
LIMIT 1 --instead of TOP 1
);
SELECT * FROM Sample
Here is a Postgres fiddle that might be compatible with vertica