储层采样是一类随机算法,用于从n项列表中随机选择k个样本,其中n是非常大或未知的数字。通常, n足够大,以致该列表无法放入主内存中。例如,Google和Facebook中的搜索查询列表。
因此,我们得到了一个大的数字数组(或流)(为简化起见),我们需要编写一个有效的函数来随机选择k个数字,其中1 <= k <= n 。令输入数组为stream []。
一个简单的解决方案是创建最大大小为k的数组tank [] 。一对一地从stream [0..n-1]中随机选择一个项目。如果先前未选择选定的项目,则将其放入tank [] 。要检查先前是否选择了某个项目,我们需要在tank []中搜索该项目。该算法的时间复杂度为O(k ^ 2) 。如果k大,这可能会很昂贵。另外,如果输入为流形式,则效率不高。
可以在O(n)时间内解决。该解决方案也非常适合流形式的输入。这个想法类似于这篇文章。以下是步骤。
1)创建一个数组库[0..k-1] ,并将流[]的前k个项目复制到其中。
2)现在,一个接一个地考虑从第(k + 1)个项目到第n个项目的所有项目。
… a)生成一个从0到i的随机数,其中i是stream []中当前项的索引。令生成的随机数为j 。
… b)如果j在0到k-1的范围内,则用流[i]替换储层[j ]
以下是上述算法的实现。
C++
// An efficient program to randomly select
// k items from a stream of items
#include
#include
using namespace std;
// A utility function to print an array
void printArray(int stream[], int n)
{
for (int i = 0; i < n; i++)
cout << stream[i] << " ";
cout << endl;
}
// A function to randomly select
// k items from stream[0..n-1].
void selectKItems(int stream[], int n, int k)
{
int i; // index for elements in stream[]
// reservoir[] is the output array. Initialize
// it with first k elements from stream[]
int reservoir[k];
for (i = 0; i < k; i++)
reservoir[i] = stream[i];
// Use a different seed value so that we don't get
// same result each time we run this program
srand(time(NULL));
// Iterate from the (k+1)th element to nth element
for (; i < n; i++)
{
// Pick a random index from 0 to i.
int j = rand() % (i + 1);
// If the randomly picked index is smaller than k,
// then replace the element present at the index
// with new element from stream
if (j < k)
reservoir[j] = stream[i];
}
cout << "Following are k randomly selected items \n";
printArray(reservoir, k);
}
// Driver Code
int main()
{
int stream[] = {1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 11, 12};
int n = sizeof(stream)/sizeof(stream[0]);
int k = 5;
selectKItems(stream, n, k);
return 0;
}
// This is code is contributed by rathbhupendra
C
// An efficient program to randomly select k items from a stream of items
#include
#include
#include
// A utility function to print an array
void printArray(int stream[], int n)
{
for (int i = 0; i < n; i++)
printf("%d ", stream[i]);
printf("\n");
}
// A function to randomly select k items from stream[0..n-1].
void selectKItems(int stream[], int n, int k)
{
int i; // index for elements in stream[]
// reservoir[] is the output array. Initialize it with
// first k elements from stream[]
int reservoir[k];
for (i = 0; i < k; i++)
reservoir[i] = stream[i];
// Use a different seed value so that we don't get
// same result each time we run this program
srand(time(NULL));
// Iterate from the (k+1)th element to nth element
for (; i < n; i++)
{
// Pick a random index from 0 to i.
int j = rand() % (i+1);
// If the randomly picked index is smaller than k, then replace
// the element present at the index with new element from stream
if (j < k)
reservoir[j] = stream[i];
}
printf("Following are k randomly selected items \n");
printArray(reservoir, k);
}
// Driver program to test above function.
int main()
{
int stream[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
int n = sizeof(stream)/sizeof(stream[0]);
int k = 5;
selectKItems(stream, n, k);
return 0;
}
Java
// An efficient Java program to randomly
// select k items from a stream of items
import java.util.Arrays;
import java.util.Random;
public class ReservoirSampling {
// A function to randomly select k items from
// stream[0..n-1].
static void selectKItems(int stream[], int n, int k)
{
int i; // index for elements in stream[]
// reservoir[] is the output array. Initialize it
// with first k elements from stream[]
int reservoir[] = new int[k];
for (i = 0; i < k; i++)
reservoir[i] = stream[i];
Random r = new Random();
// Iterate from the (k+1)th element to nth element
for (; i < n; i++) {
// Pick a random index from 0 to i.
int j = r.nextInt(i + 1);
// If the randomly picked index is smaller than
// k, then replace the element present at the
// index with new element from stream
if (j < k)
reservoir[j] = stream[i];
}
System.out.println(
"Following are k randomly selected items");
System.out.println(Arrays.toString(reservoir));
}
// Driver Program to test above method
public static void main(String[] args)
{
int stream[]
= { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 };
int n = stream.length;
int k = 5;
selectKItems(stream, n, k);
}
}
// This code is contributed by Sumit Ghosh
Python3
# An efficient Python3 program
# to randomly select k items
# from a stream of items
import random
# A utility function
# to print an array
def printArray(stream,n):
for i in range(n):
print(stream[i],end=" ");
print();
# A function to randomly select
# k items from stream[0..n-1].
def selectKItems(stream, n, k):
i=0;
# index for elements
# in stream[]
# reservoir[] is the output
# array. Initialize it with
# first k elements from stream[]
reservoir = [0]*k;
for i in range(k):
reservoir[i] = stream[i];
# Iterate from the (k+1)th
# element to nth element
while(i < n):
# Pick a random index
# from 0 to i.
j = random.randrange(i+1);
# If the randomly picked
# index is smaller than k,
# then replace the element
# present at the index
# with new element from stream
if(j < k):
reservoir[j] = stream[i];
i+=1;
print("Following are k randomly selected items");
printArray(reservoir, k);
# Driver Code
if __name__ == "__main__":
stream = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12];
n = len(stream);
k = 5;
selectKItems(stream, n, k);
# This code is contributed by mits
C#
// An efficient C# program to randomly
// select k items from a stream of items
using System;
using System.Collections;
public class ReservoirSampling
{
// A function to randomly select k
// items from stream[0..n-1].
static void selectKItems(int []stream,
int n, int k)
{
// index for elements in stream[]
int i;
// reservoir[] is the output array.
// Initialize it with first k
// elements from stream[]
int[] reservoir = new int[k];
for (i = 0; i < k; i++)
reservoir[i] = stream[i];
Random r = new Random();
// Iterate from the (k+1)th
// element to nth element
for (; i < n; i++)
{
// Pick a random index from 0 to i.
int j = r.Next(i + 1);
// If the randomly picked index
// is smaller than k, then replace
// the element present at the index
// with new element from stream
if(j < k)
reservoir[j] = stream[i];
}
Console.WriteLine("Following are k " +
"randomly selected items");
for (i = 0; i < k; i++)
Console.Write(reservoir[i]+" ");
}
//Driver code
static void Main()
{
int []stream = {1, 2, 3, 4, 5, 6, 7,
8, 9, 10, 11, 12};
int n = stream.Length;
int k = 5;
selectKItems(stream, n, k);
}
}
// This code is contributed by mits
PHP
Javascript
输出:
Following are k randomly selected items
6 2 11 8 12
Note: Output will differ every time as it selects and prints random elements
时间复杂度: O(n)
这是如何运作的?
为了证明该解决方案是完美的,我们必须证明0 <= i
情况1:对于最后nk个流项目,即对于stream [i] ,其中k <= i
对于每个这样的流项目stream [i] ,我们选择一个从0到i的随机索引,并且如果所选择的索引是前k个索引之一,则用stream [i]替换所选择的索引处的元素。
为了简化证明,让我们首先考虑最后一项。最后一项在最终存储库中的概率=最后一项从前k个索引中选择一个索引的概率= k / n (从大小为n的列表中选择k个项中的一个的概率)
现在让我们考虑倒数第二项。倒数第二项在最终库中的概率[] = [在迭代中为流[n-2]选取前k个索引之一的概率] X [在迭代中为流[n-1 ]选取索引的概率]与为stream [n-2]选择的索引不同[ ] = [ k /(n-1)] * [(n-1)/ n ] = k / n 。
类似地,我们可以考虑从stream [n-1]到stream [k]的所有流项目的其他项,并推广证明。
情况2:对于前k个流项目,即对于stream [i] ,其中0 <= i
最初的前k个项目最初被复制到tank [] ,以后可以在迭代中删除,以将stream [k]转换为stream [n] 。
stream [0..k-1]中的某个项目出现在最终数组中的概率=当项目stream [k],stream [k + 1],…时,该项目未被选中的概率。 stream [n-1]被认为= [k /(k + 1)] x [(k + 1)/(k + 2)] x [(k + 2)/(k + 3)] x…x [ (n-1)/ n] = k / n