📜  储层采样

📅  最后修改于: 2021-05-20 08:56:00             🧑  作者: Mango

储层采样是一类随机算法,用于从n项列表中随机选择k个样本,其中n是非常大或未知的数字。通常, n足够大,以致该列表无法放入主内存中。例如,Google和Facebook中的搜索查询列表。
因此,我们得到了一个大的数字数组(或流)(为简化起见),我们需要编写一个有效的函数来随机选择k个数字,其中1 <= k <= n 。令输入数组为stream []。

一个简单的解决方案是创建最大大小为k的数组tank [] 。一对一地从stream [0..n-1]中随机选择一个项目。如果先前未选择选定的项目,则将其放入tank [] 。要检查先前是否选择了某个项目,我们需要在tank []中搜索该项目。该算法的时间复杂度为O(k ^ 2) 。如果k大,这可能会很昂贵。另外,如果输入为流形式,则效率不高。

可以在O(n)时间内解决。该解决方案也非常适合流形式的输入。这个想法类似于这篇文章。以下是步骤。
1)创建一个数组库[0..k-1] ,并将流[]的k个项目复制到其中。
2)现在,一个接一个地考虑从第(k + 1)个项目到第n个项目的所有项目。
a)生成一个从0到i的随机数,其中istream []中当前项的索引。令生成的随机数为j
b)如果j在0到k-1的范围内,则用流[i]替换储层[j ]

以下是上述算法的实现。

C++
// An efficient program to randomly select
// k items from a stream of items
#include 
#include 
using namespace std;
 
// A utility function to print an array
void printArray(int stream[], int n)
{
    for (int i = 0; i < n; i++)
        cout << stream[i] << " ";
    cout << endl;
}
 
// A function to randomly select
// k items from stream[0..n-1].
void selectKItems(int stream[], int n, int k)
{
    int i; // index for elements in stream[]
 
    // reservoir[] is the output array. Initialize
    // it with first k elements from stream[]
    int reservoir[k];
    for (i = 0; i < k; i++)
        reservoir[i] = stream[i];
 
    // Use a different seed value so that we don't get
    // same result each time we run this program
    srand(time(NULL));
 
    // Iterate from the (k+1)th element to nth element
    for (; i < n; i++)
    {
        // Pick a random index from 0 to i.
        int j = rand() % (i + 1);
 
        // If the randomly picked index is smaller than k,
        // then replace the element present at the index
        // with new element from stream
        if (j < k)
        reservoir[j] = stream[i];
    }
 
    cout << "Following are k randomly selected items \n";
    printArray(reservoir, k);
}
 
// Driver Code
int main()
{
    int stream[] = {1, 2, 3, 4, 5, 6,
                    7, 8, 9, 10, 11, 12};
    int n = sizeof(stream)/sizeof(stream[0]);
    int k = 5;
    selectKItems(stream, n, k);
    return 0;
}
 
// This is code is contributed by rathbhupendra


C
// An efficient program to randomly select k items from a stream of items
 
#include 
#include 
#include 
 
// A utility function to print an array
void printArray(int stream[], int n)
{
    for (int i = 0; i < n; i++)
        printf("%d ", stream[i]);
    printf("\n");
}
 
// A function to randomly select k items from stream[0..n-1].
void selectKItems(int stream[], int n, int k)
{
    int i;  // index for elements in stream[]
 
    // reservoir[] is the output array. Initialize it with
    // first k elements from stream[]
    int reservoir[k];
    for (i = 0; i < k; i++)
        reservoir[i] = stream[i];
 
    // Use a different seed value so that we don't get
    // same result each time we run this program
    srand(time(NULL));
 
    // Iterate from the (k+1)th element to nth element
    for (; i < n; i++)
    {
        // Pick a random index from 0 to i.
        int j = rand() % (i+1);
 
        // If the randomly  picked index is smaller than k, then replace
        // the element present at the index with new element from stream
        if (j < k)
          reservoir[j] = stream[i];
    }
 
    printf("Following are k randomly selected items \n");
    printArray(reservoir, k);
}
 
// Driver program to test above function.
int main()
{
    int stream[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
    int n = sizeof(stream)/sizeof(stream[0]);
    int k = 5;
    selectKItems(stream, n, k);
    return 0;
}


Java
// An efficient Java program to randomly
// select k items from a stream of items
import java.util.Arrays;
import java.util.Random;
public class ReservoirSampling {
   
    // A function to randomly select k items from
    // stream[0..n-1].
    static void selectKItems(int stream[], int n, int k)
    {
        int i; // index for elements in stream[]
 
        // reservoir[] is the output array. Initialize it
        // with first k elements from stream[]
        int reservoir[] = new int[k];
        for (i = 0; i < k; i++)
            reservoir[i] = stream[i];
 
        Random r = new Random();
 
        // Iterate from the (k+1)th element to nth element
        for (; i < n; i++) {
            // Pick a random index from 0 to i.
            int j = r.nextInt(i + 1);
 
            // If the randomly  picked index is smaller than
            // k, then replace the element present at the
            // index with new element from stream
            if (j < k)
                reservoir[j] = stream[i];
        }
 
        System.out.println(
            "Following are k randomly selected items");
        System.out.println(Arrays.toString(reservoir));
    }
 
    // Driver Program to test above method
    public static void main(String[] args)
    {
        int stream[]
            = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 };
        int n = stream.length;
        int k = 5;
        selectKItems(stream, n, k);
    }
}
// This code is contributed by Sumit Ghosh


Python3
# An efficient Python3 program
# to randomly select k items
# from a stream of items
import random
# A utility function
# to print an array
def printArray(stream,n):
    for i in range(n):
        print(stream[i],end=" ");
    print();
 
# A function to randomly select
# k items from stream[0..n-1].
def selectKItems(stream, n, k):
        i=0;
        # index for elements
        # in stream[]
         
        # reservoir[] is the output
        # array. Initialize it with
        # first k elements from stream[]
        reservoir = [0]*k;
        for i in range(k):
            reservoir[i] = stream[i];
         
        # Iterate from the (k+1)th
        # element to nth element
        while(i < n):
            # Pick a random index
            # from 0 to i.
            j = random.randrange(i+1);
             
            # If the randomly picked
            # index is smaller than k,
            # then replace the element
            # present at the index
            # with new element from stream
            if(j < k):
                reservoir[j] = stream[i];
            i+=1;
         
        print("Following are k randomly selected items");
        printArray(reservoir, k);
     
# Driver Code
 
if __name__ == "__main__":
    stream = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12];
    n = len(stream);
    k = 5;
    selectKItems(stream, n, k);
 
# This code is contributed by mits


C#
// An efficient C# program to randomly
// select k items from a stream of items
using System;
using System.Collections;
 
public class ReservoirSampling
{
    // A function to randomly select k
    // items from stream[0..n-1].
    static void selectKItems(int []stream,
                            int n, int k)
    {
        // index for elements in stream[]
        int i;
         
        // reservoir[] is the output array.
        // Initialize it with first k
        //  elements from stream[]
        int[] reservoir = new int[k];
        for (i = 0; i < k; i++)
            reservoir[i] = stream[i];
         
        Random r = new Random();
         
        // Iterate from the (k+1)th
        // element to nth element
        for (; i < n; i++)
        {
            // Pick a random index from 0 to i.
            int j = r.Next(i + 1);
             
            // If the randomly picked index
            // is smaller than k, then replace
            // the element present at the index
            // with new element from stream
            if(j < k)
                reservoir[j] = stream[i];        
        }
         
        Console.WriteLine("Following are k " +
                    "randomly selected items");
        for (i = 0; i < k; i++)
        Console.Write(reservoir[i]+" ");
    }
     
    //Driver code
    static void Main()
    {
        int []stream = {1, 2, 3, 4, 5, 6, 7,
                        8, 9, 10, 11, 12};
        int n = stream.Length;
        int k = 5;
        selectKItems(stream, n, k);
    }
}
 
// This code is contributed by mits


PHP


Javascript


输出:

Following are k randomly selected items
6 2 11 8 12
Note: Output will differ every time as it selects and prints random elements

时间复杂度: O(n)

这是如何运作的?
为了证明该解决方案是完美的,我们必须证明0 <= i 任何项目流[i]在最终库[]中的概率为k / n 。让我们将证明分为两种情况,因为前k个项目被不同地对待。

情况1:对于最后nk个流项目,即对于stream [i] ,其中k <= i  
对于每个这样的流项目stream [i] ,我们选择一个从0到i的随机索引,并且如果所选择的索引是前k个索引之一,则用stream [i]替换在所选择的索引处的元素。
为了简化证明,让我们首先考虑最后一项。最后一项在最终存储库中的概率=最后一项从前k个索引中选择一个索引的概率= k / n (从大小为n的列表中选择k个项中的一个的概率)
现在让我们考虑倒数第二项。倒数第二项在最终库中的概率[] = [在迭代中为流[n-2]选取前k个索引之一的概率] X [在迭代中为流[n-1 ]选取索引的概率]与为stream [n-2]选择的索引不同[ ] = [ k /(n-1)] * [(n-1)/ n ] = k / n
类似地,我们可以考虑从stream [n-1]stream [k]的所有流项目的其他项,并推广证明。

情况2:对于前k个流项目,即对于stream [i] ,其中0 <= i
最初的前k个项目最初被复制到tank [] ,以后可以在迭代中将stream [k]改为stream [n]删除。
stream [0..k-1]中的某个项目位于最终数组中的概率=当项目stream [k],stream [k + 1],…时,该项目未被选中的概率。 stream [n-1]被认为= [k /(k + 1)] x [(k + 1)/(k + 2)] x [(k + 2)/(k + 3)] x…x [ (n-1)/ n] = k / n