random - How do I take a randomized sample without replacement where if an ID is drawn, then all instances of that ID are added

admin2025-04-15  0

I am currently trying to solve a problem where I need to get a randomized sample from a data table using SQL in Vertica. I need to get a randomized sample without replacement of 60,000 rows over a span of 3 years (2021 to 2024) from a data table. Each row has an OrderID as well as an Item, Price, and Date column. If an OrderID is in the randomized sample, then all instances of that OrderID from 2021 to 2024 in the data table must also be in the randomized sample. I also need this to be as efficient as possible. Is it possible to do this in SQL?

The data looks approximately like this:

OrderID Item Price Date
1234 Apple 0.99 2024-03-25
2573 Tomato 0.99 2023-01-16
1234 Cereal 6.99 2024-03-25
1234 Milk 7.99 2024-03-25
8473 Salmon 15.99 2021-11-19
7392 Plum 0.99 2022-05-02

I am currently trying to solve a problem where I need to get a randomized sample from a data table using SQL in Vertica. I need to get a randomized sample without replacement of 60,000 rows over a span of 3 years (2021 to 2024) from a data table. Each row has an OrderID as well as an Item, Price, and Date column. If an OrderID is in the randomized sample, then all instances of that OrderID from 2021 to 2024 in the data table must also be in the randomized sample. I also need this to be as efficient as possible. Is it possible to do this in SQL?

The data looks approximately like this:

OrderID Item Price Date
1234 Apple 0.99 2024-03-25
2573 Tomato 0.99 2023-01-16
1234 Cereal 6.99 2024-03-25
1234 Milk 7.99 2024-03-25
8473 Salmon 15.99 2021-11-19
7392 Plum 0.99 2022-05-02

Using my example above, If OrderID 1234 is selected for the randomized sample, then I want all instances of OrderID 1234 to appear in the sample along with other OrderIDs, like this:

Sample

OrderID Item Price Date
1234 Apple 0.99 2024-03-25
1234 Cereal 6.99 2024-03-25
1234 Milk 7.99 2024-03-25
7392 Plum 0.99 2022-05-02

I've looked into various SQL functions for getting a randomized sample with replacement efficiently, but I haven't been able to figure out how to also get it so that if an OrderID is pulled into the sample, then all instances (rows) with that OrderID from those 3 years are also pulled into the sample. This is so far outside my current skill level in SQL that I currently feel very stuck on how to tackle this. Any help would be greatly appreciated!

I've tried this bit of code because I found something similar elsewhere, but even that isn't working (and to be honest, I'm not even sure if it's in the kind of SQL that Vertica would understand) and is giving me this error:

SQL ERROR [4856][42601][Vertica]VJDBC ERROR: Syntax error at or near "."

SELECT t.*
FROM table t
WHERE t.OrderID in (SELECT top(60000) t2.OrderID
    FROM (SELECT DISTINCT t2.OrderID
        FROM table t2
        ) t2
    ORDER BY newid()
    );
Share Improve this question asked Feb 4 at 12:56 KikiMcGeeKikiMcGee 91 bronze badge 1
  • Vertica PL/vSQL is largely compatible with PostgreSQL PL/pgSQL – Bart McEndree Commented Feb 4 at 13:51
Add a comment  | 

5 Answers 5

Reset to default 0

You say you want 60000 sample rows, but for each order ID in the sample you want all related rows. This is a tad too vague, because these two rules can be contradictory.

To illustrate this: Say, you want 3 sample rows and there are only six rows in the table for order IDs 1, 1, 2, 2, 3, 3. You see you cannot pick three sample rows and still obey the rule to have all of each order's rows in the sample. You will either have to pick two orders with a total of six rows or one order with a total of two rows, you can't pick orders that get you a total of three sample rows.

And 2021 to 2024 are four years, not three. :-)

Anyway, one approach to get the desired sample is to sort the orders randomly and have a running total of their transactions in the given years. Then just take all orders with a running total up to the 60000. Thus you may end up with exactly the 60000 rows or you may get several rows less than that, depending on the picked orders and their numbers of transactions in the table.

WITH
  transactions AS
  (
    SELECT *
    FROM mytable
    WHERE date >= DATE '2021-01-01'
      AND date <  DATE '2025-01-01'
  ),
  orders AS
  (
    SELECT orderid, SUM(COUNT(*)) OVER (ORDER BY NEWID()) AS cnt
    FROM transactions
    GROUP BY orderid
  )
SELECT orderid, item, price, date
FROM transactions
WHERE orderid IN (SELECT orderid FROM orders WHERE cnt <= 60000)
ORDER BY orderid, date;

Demo (in PostgreSQL for the lack of Vertica): https://dbfiddle.uk/aSlvN1bB In this demo there are twelve rows in the table and I want a sample of up to five rows. Sometimes the query results in two rows only, sometimes in four rows, depending on which orders get picked.

Sampling without replacement is the same as reordering your data set with a random number. So step 1 is to get a dataset of your Orderids with counts, then assign each a random number and calculate a running sum of the counts based on the random order. Then finally keep every row whose running sum is under 60,000 and join it back to your base tables.

Select t.*
From table t Inner Join (
    Select * From (
        Select *, random() as sorter,
           sum(recs) Over (Order By sorter) as cumrecs
        From (
            Select orderid, count(*) as recs
            From table
            Where date Between '2021-01-01' And '2024-12-31'
            Group By orderid
        )
    Where cumrecs < 60000
) ss On t.orderid=ss.orderid

If you want to include the first orderid that takes you over 60,000 then after the Order By sorter part add the following windowing spec Rows Between Unbounded Preceding And 1 Preceding.

Your '3 years' from 2021 to 2024 is actually 4 years, as noted elsewhere, so adjust your date range accordingly.

  1. Assign random id tId for orders and count order items itemsQty for each order.
  2. Calculate running total for orders with order by tId
  3. Take first orders, where running total is least then target row count. This condition is used, so that the entire item's of each order is included in the selection, even if the total number is more than the assigned number.
  4. Join selected orderId set with orders.

See example

OrderID Item Price Date
1234 Apple 0.99 2024-03-25
2573 Tomato 0.99 2023-01-16
1234 Cereal 6.99 2024-03-25
1234 Milk 7.99 2024-03-25
8473 Salmon 15.99 2021-11-19
7392 Plum 0.99 2022-05-02
select o.*, b.*
from(
select *
  ,sum(itemQty)over(order by tId)rSum
from(
select OrderId
  ,count(*) itemQty
  ,newid() tId
from orders
group by OrderId
)a
)b
inner join orders o on o.orderId=b.orderId
where  (rSum-itemQty)<5
order by tId
OrderID Item Price Date OrderId itemQty tId rSum
8473 Salmon 15.99 2021-11-19 8473 1 c645e1c8-bfed-474f-be38-2252e049a1bc 1
7392 Plum 0.99 2022-05-02 7392 1 3a39d431-e487-4296-8c14-242003268302 2
1234 Apple 0.99 2024-03-25 1234 3 0ebcab72-4866-4671-a03e-5872dd36fc7f 5
1234 Cereal 6.99 2024-03-25 1234 3 0ebcab72-4866-4671-a03e-5872dd36fc7f 5
1234 Milk 7.99 2024-03-25 1234 3 0ebcab72-4866-4671-a03e-5872dd36fc7f 5

fiddle

If you need 60,000 rows; and these rows must be from the year 2024 and the two preceding years - makes 3 years, and the years 2022,2023 and 2024; and get 60,000 rows, with the additional request that all items of the same randomly picked order from those 3 years are included, then you will probably need a multi-step approach.

And you will not get exactly 60,000 rows, but an approximation of that number.

Step 1: Determine the average number of items per order in those 3 years:

CREATE TABLE avgc(qty) AS (
  WITH
  itemsperorder(qty) AS (
    SELECT
      orderid
    , COUNT(*)
    FROM orders
    WHERE date BETWEEN '2022-01-01' AND '2024-12-31'
  )
  SELECT AVG(qty)::INT FROM itemsperorder
;

Then, build a table with order ID-s

CREATE TABLE order_ids(orderid) AS 
SELECT DISTINCT orderid FROM orders;

Then, pick a random extract of that order ID table into a common table expression that will return roughly 60,000 rows out of the orders table when joined to it, thanks to your average calculation:

CREATE TABLE sample_orders AS
WITH
random_orderids AS (
  SELECT
    *
  FROM order_ids
  WHERE RANDOM() < (
    SELECT 60000 / (SELECT qty FROM avgc) / (SELECT COUNT(*) FROM orders)
  )
)
SELECT
  o.*
FROM orders o
JOIN order_ids USING(orderid)
WHERE date BETWEEN '2022-01-01' AND '2024-12-31'
;

As Items are selected you need to record them in the sample table. You can then join on the sample table so they are not selected again.

--Select a random sample from the remaining items
--Join on the Sample table table to exclude previously selected orderIDs.
INSERT  INTO SAMPLE
SELECT t.*
FROM Example t
WHERE t.OrderID in (
                    --Select just 1 random record    
                    SELECT t2.OrderID
                    FROM (
                          --Select all records from Example that do not exist in sample 
                          SELECT  t2.OrderID
                          FROM Example t2
                          LEFT JOIN Sample s ON s.OrderID = t2.OrderID   
                          WHERE s.OrderID IS NULL 
                         ) t2
                    ORDER BY random() --instead of newid()
                    LIMIT 1 --instead of TOP 1
                   );
     
SELECT * FROM Sample

Here is a Postgres fiddle that might be compatible with vertica

转载请注明原文地址:http://www.anycun.com/QandA/1744717769a86651.html