Microsoft Research Community

error occurs when constructing a lda model with simple corpus

rated by 0 users
This post has 22 Replies | 7 Followers

Top 500 Contributor
Posts 2
xgear Posted: 02-21-2009 4:16 AM

Normal 0 7.8 磅 0 2 false false false MicrosoftInternetExplorer4

setting of the model:

number of documents in corpus:12

number of topics:3

number of words(terms) in corpus:12

for simplicity,suppose that each document is composed of 2 words

 

Normal 0 7.8 磅 0 2 false false false MicrosoftInternetExplorer4

The whole corpus is show in the table below ,with each line representing a document.

Original corpus

After indexing ,the whole corpus is denoted as

university test

teacher student

teacher university

university student

economy bank

economy money

stock economy

money stock

goverment policy

goverment president

goverment military

president policy

0, 3

0, 1

0, 2

1, 2

4, 7

4, 5

4, 6

5, 6

8, 11

8, 9

8, 10

9, 10

 

after runing the program,the consle window show:

Compile model.....complilation failed.

 

then a "transform chain" window shows information "can only indexed by loop variables,not index0",it seems the position where the error occurs is near(in) "two using nest" of source code

 

By the way,can jagged array provide a array of a array,where the length of last array is not fixed,so I can remove the limit that each docoment is composed of 2 words.

 

 

Your help is appreciated!

Normal 0 7.8 磅 0 2 false false false MicrosoftInternetExplorer4

 

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

        static void Main(string[] args)

        {

 

            int  M = 12;//number of documents in corpus

            int  K = 3;//number of topics

            int V = 12; //number of words(terms) in corpus

            int Nm = 2;//suppose that each document is composed of 2 words

 

            Range CorpusSize = new Range(M);

            Range TopicsNum = new Range(K);

            Range WordsNum = new Range(V);

            Range DocSize = new Range(Nm);

 

            double[] alpha={ 0.5, 0.5, 0.5 };

            double[] beta = { 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1 };

 

            VariableArray<Vector> theta = Variable.Array<Vector>(CorpusSize);

            VariableArray<Vector> phi = Variable.Array<Vector>(TopicsNum);

            theta[CorpusSize] = Variable.Dirichlet(alpha).ForEach(CorpusSize);

            phi[TopicsNum] = Variable.Dirichlet(beta).ForEach(TopicsNum);

            VariableArray2D<int> W = Variable.Array<int>(CorpusSize, DocSize);

            VariableArray2D<int> Z = Variable.Array<int>(CorpusSize, DocSize);

            using (Variable.ForEach(CorpusSize))

            {

                using (Variable.ForEach(DocSize))

                {

                    Z[CorpusSize, DocSize] = Variable.Discrete(theta[CorpusSize]);

                    W[CorpusSize, DocSize] = Variable.Discrete(phi[Z[CorpusSize, DocSize]]);

                }

            }

            W = Variable.Observed(new int[,] { { 0, 3 }, { 0, 1 }, { 0, 2 }, { 1, 2 }, { 4, 7 }, { 4, 5 }, { 4, 6 }, { 5, 6 }, { 8, 11 }, { 8, 9 }, { 8, 10 }, { 9, 10 } }, CorpusSize, DocSize);

            InferenceEngine engine = new InferenceEngine();

            Console.WriteLine(engine.Infer(Z));

           

}

 

Top 25 Contributor
Posts 37

Hi,

 

Since W depends on certain choices for Z, you have to add a gate (Variable.Switch).

Furthermore, you have to give set the valueRange attribute to Z, so infer.net knowns over which values the gate ranges.

Use the following code and your model compiles.

                    Z[CorpusSize, DocSize] = Variable.Discrete(theta[CorpusSize]).Attrib(new ValueRange (TopicsNum));

                    using(Variable.Switch(Z[CorpusSize, DocSize]))
                    {
                        W[CorpusSize, DocSize] = Variable.Discrete(phi[Z[CorpusSize, DocSize]]);
                    }

 

 

 

Laura

 

Top 25 Contributor
Posts 37

I just came across a flaw in your code.

In your example, you first create a datastructure for W and wire it to the model. then you redefine W using a new observed data structure, which is not linked to the model. Since the data is not linked, infer() get the inference results based only on the prior.

You have to define your observed variables W as such upfront.

instead of

            VariableArray2D<int> W = Variable.Array<int>(CorpusSize, DocSize).Named("W");

 

use the following line (and omit it later on)
            VariableArray2D<int> W = Variable.Observed(new int[,] { { 0, 3 }, { 0, 1 }, { 0, 2 }, { 1, 2 }, { 4, 7 }, { 4, 5 }, { 4, 6 }, { 5, 6 }, { 8, 11 }, { 8, 9 }, { 8, 10 }, { 9, 10 } }, CorpusSize, DocSize);

 

 

Another thing is that you have to break symmetry, otherwise all phis will be identical.

To break symmetry slightly, create a dense Dirichlet (denseBeta). Draw K times from it using dirich.Sample(), convert it to an infer.net array and call phi.InitializeTo()

            double[] denseBeta = new double[V];
            for (int v = 0; v < V; v++) denseBeta[v] = 10.0;

            Dirichlet[] initPhi = new Dirichlet[K];
            Dirichlet dirich = (new Dirichlet(denseBeta));
            for (int k = 0; k < K; k++)
            {
                initPhi[k] = new Dirichlet(dirich.Sample());
            }
            phi.InitialiseTo(Distribution<Vector>.Array(initPhi));

 

Laura

 

Top 25 Contributor
Posts 37

To answer you final question, yes, using jagged arrays documents can have different length. If you need an example, in John Guiver's post i in the Bernoulli thread (http://community.research.microsoft.com/forums/p/2779/4511.aspx#4511 ) "e" is a jagged random variable array. Note that "sRange" is a variable range depending on "uRange".

 

Laura

Top 10 Contributor
Posts 56

Just to summarise everything Laura has noted (many thanks Laura), including the jagged array stuff, here is a modified version of your C# code that will compile and run:

static void Main(string [] args)
{
   
int K = 3;    //number of topics
   
int V = 12;  //number of words(terms) in corpus
   
// Documents of variable length
   
int[][] docs = {
       
new int[] { 0, 3, 4 },
       
new int[] { 0, 1 },
       
new int[] { 0, 2, 4, 5 },
       
new int[] { 1, 2 },
       
new int[] { 4, 7 },
       
new int[] { 4, 5 },
       
new int[] { 4, 6 },
       
new int[] { 5, 6 },
       
new int[] { 8, 11 },
       
new int[] { 8, 9 },
       
new int[] { 8, 10 },
       
new int[] { 9, 10 }};

    // Put the sizes into an array
   
int M = docs.Length;
   
int[] sizes = new int[M];
   
for (int i = 0; i < M; i++)
        sizes[ i ] = docs[ i ].Length;

   
// Set up the ranges
   
Range CorpusSize = new Range(M);
   
Range TopicsNum = new Range(K);
   
Range WordsNum = new Range(V);
   
VariableArray<int> docSizeVar = Variable.Observed(sizes, CorpusSize);
   
Range DocSize = new Range(docSizeVar[CorpusSize]);

   
double[] alpha= { 0.5, 0.5, 0.5 };
   
double[] beta = { 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1 };
   
VariableArray<Vector> theta = Variable.Array<Vector>(CorpusSize);
   
VariableArray<Vector> phi = Variable.Array<Vector>(TopicsNum);
    theta[CorpusSize] =
Variable.Dirichlet(alpha).ForEach(CorpusSize);
    phi[TopicsNum] =
Variable.Dirichlet(beta).ForEach(TopicsNum);

   
// Break symmetry by initialising phi marginals
   
Vector denseBeta = new Vector(V, 10.0);
   
Dirichlet[] initPhi = new Dirichlet[K];
   
Dirichlet dirich = new Dirichlet(denseBeta);
   
for (int k=0; k < K; k++)
        initPhi[k] =
new Dirichlet(dirich.Sample());
    phi.InitialiseTo(
Distribution<Vector>.Array(initPhi));

   
var Z = Variable.Array(Variable.Array<int>(DocSize), CorpusSize);
   
var W = Variable.Array(Variable.Array<int>(DocSize), CorpusSize);
    W.ObservedValue = docs;

   
using (Variable.ForEach(CorpusSize))
    {
       
using (Variable.ForEach(DocSize))
        {
            Z[CorpusSize][DocSize] =
Variable.Discrete(theta[CorpusSize]).Attrib(new ValueRange(TopicsNum));
           
using (Variable.Switch(Z[CorpusSize][DocSize]))
            {
                W[CorpusSize][DocSize] =
Variable.Discrete(phi[Z[CorpusSize][DocSize]]);
            }
        }
    }
   
InferenceEngine engine = new InferenceEngine();
   
Console.WriteLine(engine.Infer(Z));
}

Top 500 Contributor
Posts 2

thanks

Top 150 Contributor
Posts 5

whatshould be the returned value type of engine.Infer(Z) in last line? I wanna store the posterior distribution of Z in a local variable for future use. tried several types but seemed not working.

Top 150 Contributor
Posts 5

oh, it seems correct if I use a variable of DistributionArray<DistributionRefArray<Discrete, int>>

Thanks all

Top 10 Contributor
Posts 56

Although what you have is correct in this case, DistributionArray, DistributionRefArray, and other distribution array classes are not designed to be used in the API - Infer.NET may use any one of a number of classes to internally represent distribution arrays, chosing the most efficient representation for the model. However, they can all be referenced via the IDistribution<> interface.

We encourage you to use either one of the following two approaches, depending on what you want to do with the posterior. 

 IDistribution<int[][]> ZPostAsDistribution = engine.Infer<IDistribution<int[][]>>(Z);

Discrete[][] ZPostAsArray = Distribution.ToArray<Discrete[][]>(engine.Infer(Z));

We are looking at possibly making the second case more succinct in a future release by just allowing Discrete[][] to be a type parameter for the Infer method.

John G.

Top 25 Contributor
Posts 37

Normal 0 21 false false false DE X-NONE X-NONE MicrosoftInternetExplorer4

Hi John,

 

Hiding the Ref/Struct arrays is a cool thing I wasn't yet aware of.

 

Unfortunately I can not make it work in F#. I tried the following, but the compiler complains "The field, constructor or member 'ToArray' is not defined. " This is particularly funny since I can select the method from the member list of the Distribution class.

 

 

        let infResult = inferenceEngine.Infer<IDistribution<Beta[]>>(epsilon)

        let infResultObj = inferenceEngine.Infer<obj>(epsilon)

        let epsilonPostAsArray = Distribution.ToArray<Beta[]>(infResultObj)       

 

is there anything special to this method?

 

Laura

Top 10 Contributor
Posts 56

I think that in F# you currently need to use Distribution< >.ToArray rather than Distribution.ToArray. This is an F# bug that has been logged - it occurs when you have a generic and non-generic version of the same class name, and the non-generic version (Distribution in our case) has a generic method (ToArray in our case)

John

Top 25 Contributor
Posts 37

I tried the following as well, still get the same error. I tried to rebuild all, just in case. Still no success.

 

let epsilonPostAsArray = Distribution<_>.ToArray<Beta[]>(infResultObj)       

let epsilonPostAsArray = Distribution<Beta>.ToArray<Beta[]>(infResultObj)       

// just in case I was referencing the wrong class

let epsilonPostAsArray = MicrosoftResearch.Infer.Distributions.Distribution<_>.ToArray<Beta[]>(infResultObj)       

 

I find it strance that the following expression does not give compile errors.

let x = Distribution.Equals(infResult, infResultObj)

That is why I wonder what might be so special about the ToArray method.

 

Laura

Top 10 Contributor
Posts 56

You must have a space rather than an underscore in Distribution< >

John

Top 25 Contributor
Posts 37

Thanks, John!

Top 50 Contributor
Posts 14

Hi,

May I know the mathematical reason for breaking symmetry? What's so bad about all phis being identical? If we supply the data, the model learns and adapts accordingly, so I am not sure why we have to break symmetry.

Page 1 of 2 (23 items) 1 2 Next > | RSS
©2009 Microsoft Corporation. All rights reserved. Terms of Use | Trademarks | Privacy Statement | Feedback