How does an LSTM process sequences longer than its memory?





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}







1












$begingroup$


Terminology:





  • Cell: the LSTM unit containing input, forget, output gates and the hidden hT and cell state cT.


  • Hidden units/memory: How far back in time the LSTM is "unrolled". A hidden unit is an instance of the cell at a particular time.

  • A hidden unit is parameterized by [wT, cT, hT-1]: The gate weights for the current hidden unit, the current cell state, and the last hidden unit's output. Where wT represents input, output, forget gate weights.




An LSTM maintains separate gate weights wT for each hidden unit. This way it can treat different points in time of a sequence differently.



Let's say an LSTM has 3 hidden units so has gate weights w1, 2, w3 for each of them. Then a sequence x1, x2,...xN comes through. I am illustrating the cell as it transitions over time:



@ t=1
xN....x3 x2 x1

[w1, c1, h0]

(c2, h1)

@ t=2
xN....x4 x3 x2 x1

[w2, c2, h1]

(c3, h2)

@ t=3
xN....x5 x4 x3 x2 x1

[w3, c3, h2]

(c4, h3)


But what happens at t=4? The LSTM only has memory, therefore gate weights, for 3 steps:



@ t=4
xN....x6 x5 x4 x3 x2 x1

[w?, c3, h2]

(c4, h3)


What weights are used for x4 and all the following inputs? In essence, how are sequences that are longer than an LSTM cell's memory treated? Do the gate weights reset back to w1, or do they remain static at their latest value wT?





Edit: My question is not a duplicate of the LSTM inference question. It is asking about multi-step prediction from inputs. However, I am asking about what weights are used over time for sequences that are longer than the internal hidden cell states. The question of weights is not addressed in that answer.










share|cite|improve this question











$endgroup$












  • $begingroup$
    @Sycorax not a duplicate. That question is asking about multi-step prediction and how inputs relate to it. My question is about the internal mechanics i.e. what weights are used for sequences longer than memory.
    $endgroup$
    – hazrmard
    Apr 3 at 16:44










  • $begingroup$
    My question is not about prediction in the first place. It is about the use of weights. This is not addressed in the other question.
    $endgroup$
    – hazrmard
    Apr 3 at 16:47










  • $begingroup$
    Ah, I see. Withdrawn.
    $endgroup$
    – Sycorax
    Apr 3 at 16:53


















1












$begingroup$


Terminology:





  • Cell: the LSTM unit containing input, forget, output gates and the hidden hT and cell state cT.


  • Hidden units/memory: How far back in time the LSTM is "unrolled". A hidden unit is an instance of the cell at a particular time.

  • A hidden unit is parameterized by [wT, cT, hT-1]: The gate weights for the current hidden unit, the current cell state, and the last hidden unit's output. Where wT represents input, output, forget gate weights.




An LSTM maintains separate gate weights wT for each hidden unit. This way it can treat different points in time of a sequence differently.



Let's say an LSTM has 3 hidden units so has gate weights w1, 2, w3 for each of them. Then a sequence x1, x2,...xN comes through. I am illustrating the cell as it transitions over time:



@ t=1
xN....x3 x2 x1

[w1, c1, h0]

(c2, h1)

@ t=2
xN....x4 x3 x2 x1

[w2, c2, h1]

(c3, h2)

@ t=3
xN....x5 x4 x3 x2 x1

[w3, c3, h2]

(c4, h3)


But what happens at t=4? The LSTM only has memory, therefore gate weights, for 3 steps:



@ t=4
xN....x6 x5 x4 x3 x2 x1

[w?, c3, h2]

(c4, h3)


What weights are used for x4 and all the following inputs? In essence, how are sequences that are longer than an LSTM cell's memory treated? Do the gate weights reset back to w1, or do they remain static at their latest value wT?





Edit: My question is not a duplicate of the LSTM inference question. It is asking about multi-step prediction from inputs. However, I am asking about what weights are used over time for sequences that are longer than the internal hidden cell states. The question of weights is not addressed in that answer.










share|cite|improve this question











$endgroup$












  • $begingroup$
    @Sycorax not a duplicate. That question is asking about multi-step prediction and how inputs relate to it. My question is about the internal mechanics i.e. what weights are used for sequences longer than memory.
    $endgroup$
    – hazrmard
    Apr 3 at 16:44










  • $begingroup$
    My question is not about prediction in the first place. It is about the use of weights. This is not addressed in the other question.
    $endgroup$
    – hazrmard
    Apr 3 at 16:47










  • $begingroup$
    Ah, I see. Withdrawn.
    $endgroup$
    – Sycorax
    Apr 3 at 16:53














1












1








1


1



$begingroup$


Terminology:





  • Cell: the LSTM unit containing input, forget, output gates and the hidden hT and cell state cT.


  • Hidden units/memory: How far back in time the LSTM is "unrolled". A hidden unit is an instance of the cell at a particular time.

  • A hidden unit is parameterized by [wT, cT, hT-1]: The gate weights for the current hidden unit, the current cell state, and the last hidden unit's output. Where wT represents input, output, forget gate weights.




An LSTM maintains separate gate weights wT for each hidden unit. This way it can treat different points in time of a sequence differently.



Let's say an LSTM has 3 hidden units so has gate weights w1, 2, w3 for each of them. Then a sequence x1, x2,...xN comes through. I am illustrating the cell as it transitions over time:



@ t=1
xN....x3 x2 x1

[w1, c1, h0]

(c2, h1)

@ t=2
xN....x4 x3 x2 x1

[w2, c2, h1]

(c3, h2)

@ t=3
xN....x5 x4 x3 x2 x1

[w3, c3, h2]

(c4, h3)


But what happens at t=4? The LSTM only has memory, therefore gate weights, for 3 steps:



@ t=4
xN....x6 x5 x4 x3 x2 x1

[w?, c3, h2]

(c4, h3)


What weights are used for x4 and all the following inputs? In essence, how are sequences that are longer than an LSTM cell's memory treated? Do the gate weights reset back to w1, or do they remain static at their latest value wT?





Edit: My question is not a duplicate of the LSTM inference question. It is asking about multi-step prediction from inputs. However, I am asking about what weights are used over time for sequences that are longer than the internal hidden cell states. The question of weights is not addressed in that answer.










share|cite|improve this question











$endgroup$




Terminology:





  • Cell: the LSTM unit containing input, forget, output gates and the hidden hT and cell state cT.


  • Hidden units/memory: How far back in time the LSTM is "unrolled". A hidden unit is an instance of the cell at a particular time.

  • A hidden unit is parameterized by [wT, cT, hT-1]: The gate weights for the current hidden unit, the current cell state, and the last hidden unit's output. Where wT represents input, output, forget gate weights.




An LSTM maintains separate gate weights wT for each hidden unit. This way it can treat different points in time of a sequence differently.



Let's say an LSTM has 3 hidden units so has gate weights w1, 2, w3 for each of them. Then a sequence x1, x2,...xN comes through. I am illustrating the cell as it transitions over time:



@ t=1
xN....x3 x2 x1

[w1, c1, h0]

(c2, h1)

@ t=2
xN....x4 x3 x2 x1

[w2, c2, h1]

(c3, h2)

@ t=3
xN....x5 x4 x3 x2 x1

[w3, c3, h2]

(c4, h3)


But what happens at t=4? The LSTM only has memory, therefore gate weights, for 3 steps:



@ t=4
xN....x6 x5 x4 x3 x2 x1

[w?, c3, h2]

(c4, h3)


What weights are used for x4 and all the following inputs? In essence, how are sequences that are longer than an LSTM cell's memory treated? Do the gate weights reset back to w1, or do they remain static at their latest value wT?





Edit: My question is not a duplicate of the LSTM inference question. It is asking about multi-step prediction from inputs. However, I am asking about what weights are used over time for sequences that are longer than the internal hidden cell states. The question of weights is not addressed in that answer.







neural-networks lstm rnn






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited Apr 3 at 16:58









Sycorax

42.4k12111207




42.4k12111207










asked Apr 3 at 16:30









hazrmardhazrmard

1085




1085












  • $begingroup$
    @Sycorax not a duplicate. That question is asking about multi-step prediction and how inputs relate to it. My question is about the internal mechanics i.e. what weights are used for sequences longer than memory.
    $endgroup$
    – hazrmard
    Apr 3 at 16:44










  • $begingroup$
    My question is not about prediction in the first place. It is about the use of weights. This is not addressed in the other question.
    $endgroup$
    – hazrmard
    Apr 3 at 16:47










  • $begingroup$
    Ah, I see. Withdrawn.
    $endgroup$
    – Sycorax
    Apr 3 at 16:53


















  • $begingroup$
    @Sycorax not a duplicate. That question is asking about multi-step prediction and how inputs relate to it. My question is about the internal mechanics i.e. what weights are used for sequences longer than memory.
    $endgroup$
    – hazrmard
    Apr 3 at 16:44










  • $begingroup$
    My question is not about prediction in the first place. It is about the use of weights. This is not addressed in the other question.
    $endgroup$
    – hazrmard
    Apr 3 at 16:47










  • $begingroup$
    Ah, I see. Withdrawn.
    $endgroup$
    – Sycorax
    Apr 3 at 16:53
















$begingroup$
@Sycorax not a duplicate. That question is asking about multi-step prediction and how inputs relate to it. My question is about the internal mechanics i.e. what weights are used for sequences longer than memory.
$endgroup$
– hazrmard
Apr 3 at 16:44




$begingroup$
@Sycorax not a duplicate. That question is asking about multi-step prediction and how inputs relate to it. My question is about the internal mechanics i.e. what weights are used for sequences longer than memory.
$endgroup$
– hazrmard
Apr 3 at 16:44












$begingroup$
My question is not about prediction in the first place. It is about the use of weights. This is not addressed in the other question.
$endgroup$
– hazrmard
Apr 3 at 16:47




$begingroup$
My question is not about prediction in the first place. It is about the use of weights. This is not addressed in the other question.
$endgroup$
– hazrmard
Apr 3 at 16:47












$begingroup$
Ah, I see. Withdrawn.
$endgroup$
– Sycorax
Apr 3 at 16:53




$begingroup$
Ah, I see. Withdrawn.
$endgroup$
– Sycorax
Apr 3 at 16:53










2 Answers
2






active

oldest

votes


















4












$begingroup$

The gates are a function of the weights, the cell state and the hidden state.
The weights are fixed.



Consider the equation for the forget gate $f_t$:
$$f_t = sigma(W_f cdot [h_{t-1}, x_t]+b_f)$$
The forget gate uses the new data $x_t$ and the hidden state $h_{t-1}$, but $W_f$ and $b_f$ are fixed. This is why the LSTM only needs to keep the previous $h$ and the previous $c$.



More information:
http://colah.github.io/posts/2015-08-Understanding-LSTMs/






share|cite|improve this answer











$endgroup$













  • $begingroup$
    So my premise that separate weights are maintained for each hidden state is incorrect? Only a single set of weights is learned no matter how far back in time the network is rolled?
    $endgroup$
    – hazrmard
    Apr 3 at 17:07






  • 1




    $begingroup$
    Yes, that was why I was initially confused. There's a single set of weights and biases. The weights, the previous $h$ and previous $c$, and the new input $x$ are used to update the gates. The gates are used to (1) produce the prediction for the next step and (2) update $h$ and $c$.
    $endgroup$
    – Sycorax
    Apr 3 at 17:10





















3












$begingroup$

Expanding a little bit what Sycorax said, the basic recurrent cell is something like



$$ h_t = tanh(W_{hh}h_{t-1} + W_{xh}x_t) $$



It is a function of previous hidden state $h_{t-1}$ and current input $x_t$, and returns current hidden state $h_t$. Same applies to LSTM cell that is a special kind of RNN.



So it does not "look" directly at any input in the past, the information from the past is passed only through the hidden states $h_t$. What follows, in theory all the history to some extent contributes to $h_t$. If you have multiple such cells, then they do not look directly at different points in time, but just use different weights. Of course, it can be the case that some of the hidden states will carry more information from points that are further in time, as compared to other hidden states, but the kind of information that is carried is something that is learned from the data, rather then forced.






share|cite|improve this answer









$endgroup$














    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "65"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f400998%2fhow-does-an-lstm-process-sequences-longer-than-its-memory%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    4












    $begingroup$

    The gates are a function of the weights, the cell state and the hidden state.
    The weights are fixed.



    Consider the equation for the forget gate $f_t$:
    $$f_t = sigma(W_f cdot [h_{t-1}, x_t]+b_f)$$
    The forget gate uses the new data $x_t$ and the hidden state $h_{t-1}$, but $W_f$ and $b_f$ are fixed. This is why the LSTM only needs to keep the previous $h$ and the previous $c$.



    More information:
    http://colah.github.io/posts/2015-08-Understanding-LSTMs/






    share|cite|improve this answer











    $endgroup$













    • $begingroup$
      So my premise that separate weights are maintained for each hidden state is incorrect? Only a single set of weights is learned no matter how far back in time the network is rolled?
      $endgroup$
      – hazrmard
      Apr 3 at 17:07






    • 1




      $begingroup$
      Yes, that was why I was initially confused. There's a single set of weights and biases. The weights, the previous $h$ and previous $c$, and the new input $x$ are used to update the gates. The gates are used to (1) produce the prediction for the next step and (2) update $h$ and $c$.
      $endgroup$
      – Sycorax
      Apr 3 at 17:10


















    4












    $begingroup$

    The gates are a function of the weights, the cell state and the hidden state.
    The weights are fixed.



    Consider the equation for the forget gate $f_t$:
    $$f_t = sigma(W_f cdot [h_{t-1}, x_t]+b_f)$$
    The forget gate uses the new data $x_t$ and the hidden state $h_{t-1}$, but $W_f$ and $b_f$ are fixed. This is why the LSTM only needs to keep the previous $h$ and the previous $c$.



    More information:
    http://colah.github.io/posts/2015-08-Understanding-LSTMs/






    share|cite|improve this answer











    $endgroup$













    • $begingroup$
      So my premise that separate weights are maintained for each hidden state is incorrect? Only a single set of weights is learned no matter how far back in time the network is rolled?
      $endgroup$
      – hazrmard
      Apr 3 at 17:07






    • 1




      $begingroup$
      Yes, that was why I was initially confused. There's a single set of weights and biases. The weights, the previous $h$ and previous $c$, and the new input $x$ are used to update the gates. The gates are used to (1) produce the prediction for the next step and (2) update $h$ and $c$.
      $endgroup$
      – Sycorax
      Apr 3 at 17:10
















    4












    4








    4





    $begingroup$

    The gates are a function of the weights, the cell state and the hidden state.
    The weights are fixed.



    Consider the equation for the forget gate $f_t$:
    $$f_t = sigma(W_f cdot [h_{t-1}, x_t]+b_f)$$
    The forget gate uses the new data $x_t$ and the hidden state $h_{t-1}$, but $W_f$ and $b_f$ are fixed. This is why the LSTM only needs to keep the previous $h$ and the previous $c$.



    More information:
    http://colah.github.io/posts/2015-08-Understanding-LSTMs/






    share|cite|improve this answer











    $endgroup$



    The gates are a function of the weights, the cell state and the hidden state.
    The weights are fixed.



    Consider the equation for the forget gate $f_t$:
    $$f_t = sigma(W_f cdot [h_{t-1}, x_t]+b_f)$$
    The forget gate uses the new data $x_t$ and the hidden state $h_{t-1}$, but $W_f$ and $b_f$ are fixed. This is why the LSTM only needs to keep the previous $h$ and the previous $c$.



    More information:
    http://colah.github.io/posts/2015-08-Understanding-LSTMs/







    share|cite|improve this answer














    share|cite|improve this answer



    share|cite|improve this answer








    edited Apr 3 at 17:02

























    answered Apr 3 at 16:53









    SycoraxSycorax

    42.4k12111207




    42.4k12111207












    • $begingroup$
      So my premise that separate weights are maintained for each hidden state is incorrect? Only a single set of weights is learned no matter how far back in time the network is rolled?
      $endgroup$
      – hazrmard
      Apr 3 at 17:07






    • 1




      $begingroup$
      Yes, that was why I was initially confused. There's a single set of weights and biases. The weights, the previous $h$ and previous $c$, and the new input $x$ are used to update the gates. The gates are used to (1) produce the prediction for the next step and (2) update $h$ and $c$.
      $endgroup$
      – Sycorax
      Apr 3 at 17:10




















    • $begingroup$
      So my premise that separate weights are maintained for each hidden state is incorrect? Only a single set of weights is learned no matter how far back in time the network is rolled?
      $endgroup$
      – hazrmard
      Apr 3 at 17:07






    • 1




      $begingroup$
      Yes, that was why I was initially confused. There's a single set of weights and biases. The weights, the previous $h$ and previous $c$, and the new input $x$ are used to update the gates. The gates are used to (1) produce the prediction for the next step and (2) update $h$ and $c$.
      $endgroup$
      – Sycorax
      Apr 3 at 17:10


















    $begingroup$
    So my premise that separate weights are maintained for each hidden state is incorrect? Only a single set of weights is learned no matter how far back in time the network is rolled?
    $endgroup$
    – hazrmard
    Apr 3 at 17:07




    $begingroup$
    So my premise that separate weights are maintained for each hidden state is incorrect? Only a single set of weights is learned no matter how far back in time the network is rolled?
    $endgroup$
    – hazrmard
    Apr 3 at 17:07




    1




    1




    $begingroup$
    Yes, that was why I was initially confused. There's a single set of weights and biases. The weights, the previous $h$ and previous $c$, and the new input $x$ are used to update the gates. The gates are used to (1) produce the prediction for the next step and (2) update $h$ and $c$.
    $endgroup$
    – Sycorax
    Apr 3 at 17:10






    $begingroup$
    Yes, that was why I was initially confused. There's a single set of weights and biases. The weights, the previous $h$ and previous $c$, and the new input $x$ are used to update the gates. The gates are used to (1) produce the prediction for the next step and (2) update $h$ and $c$.
    $endgroup$
    – Sycorax
    Apr 3 at 17:10















    3












    $begingroup$

    Expanding a little bit what Sycorax said, the basic recurrent cell is something like



    $$ h_t = tanh(W_{hh}h_{t-1} + W_{xh}x_t) $$



    It is a function of previous hidden state $h_{t-1}$ and current input $x_t$, and returns current hidden state $h_t$. Same applies to LSTM cell that is a special kind of RNN.



    So it does not "look" directly at any input in the past, the information from the past is passed only through the hidden states $h_t$. What follows, in theory all the history to some extent contributes to $h_t$. If you have multiple such cells, then they do not look directly at different points in time, but just use different weights. Of course, it can be the case that some of the hidden states will carry more information from points that are further in time, as compared to other hidden states, but the kind of information that is carried is something that is learned from the data, rather then forced.






    share|cite|improve this answer









    $endgroup$


















      3












      $begingroup$

      Expanding a little bit what Sycorax said, the basic recurrent cell is something like



      $$ h_t = tanh(W_{hh}h_{t-1} + W_{xh}x_t) $$



      It is a function of previous hidden state $h_{t-1}$ and current input $x_t$, and returns current hidden state $h_t$. Same applies to LSTM cell that is a special kind of RNN.



      So it does not "look" directly at any input in the past, the information from the past is passed only through the hidden states $h_t$. What follows, in theory all the history to some extent contributes to $h_t$. If you have multiple such cells, then they do not look directly at different points in time, but just use different weights. Of course, it can be the case that some of the hidden states will carry more information from points that are further in time, as compared to other hidden states, but the kind of information that is carried is something that is learned from the data, rather then forced.






      share|cite|improve this answer









      $endgroup$
















        3












        3








        3





        $begingroup$

        Expanding a little bit what Sycorax said, the basic recurrent cell is something like



        $$ h_t = tanh(W_{hh}h_{t-1} + W_{xh}x_t) $$



        It is a function of previous hidden state $h_{t-1}$ and current input $x_t$, and returns current hidden state $h_t$. Same applies to LSTM cell that is a special kind of RNN.



        So it does not "look" directly at any input in the past, the information from the past is passed only through the hidden states $h_t$. What follows, in theory all the history to some extent contributes to $h_t$. If you have multiple such cells, then they do not look directly at different points in time, but just use different weights. Of course, it can be the case that some of the hidden states will carry more information from points that are further in time, as compared to other hidden states, but the kind of information that is carried is something that is learned from the data, rather then forced.






        share|cite|improve this answer









        $endgroup$



        Expanding a little bit what Sycorax said, the basic recurrent cell is something like



        $$ h_t = tanh(W_{hh}h_{t-1} + W_{xh}x_t) $$



        It is a function of previous hidden state $h_{t-1}$ and current input $x_t$, and returns current hidden state $h_t$. Same applies to LSTM cell that is a special kind of RNN.



        So it does not "look" directly at any input in the past, the information from the past is passed only through the hidden states $h_t$. What follows, in theory all the history to some extent contributes to $h_t$. If you have multiple such cells, then they do not look directly at different points in time, but just use different weights. Of course, it can be the case that some of the hidden states will carry more information from points that are further in time, as compared to other hidden states, but the kind of information that is carried is something that is learned from the data, rather then forced.







        share|cite|improve this answer












        share|cite|improve this answer



        share|cite|improve this answer










        answered Apr 3 at 17:24









        TimTim

        60k9133228




        60k9133228






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Cross Validated!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f400998%2fhow-does-an-lstm-process-sequences-longer-than-its-memory%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Plaza Victoria

            In PowerPoint, is there a keyboard shortcut for bulleted / numbered list?

            How to put 3 figures in Latex with 2 figures side by side and 1 below these side by side images but in...