Training vs validation set












1












$begingroup$


I have a data set with which I am trying to find correlations.



I split the data into a training set and a validation set. I also have a solver I built which finds the "best coefficients" to give me the best results on the training set.



After solving for the training set, the validation set shows completely different results which do not support the results on the training set.



I then made my solver output all of the results, as opposed to only the best results. There are cases which show positive, very similar results in both the training and validation sets, and there are results that show different results in both sets.



Is it okay to choose, by hand, the results that show the most similar and most positive results in both validation and training sets, or does this defeat the purpose of the validation set and invalidate the results?










share|cite|improve this question









$endgroup$








  • 1




    $begingroup$
    Maybe move to Data Science?
    $endgroup$
    – Paul Childs
    Dec 10 '18 at 7:45










  • $begingroup$
    Your best bet here is to use cross-validation, if you want to improve the performance of your model on the validation set. As indicated elsewhere, choosing the best results by hand is a BIG no-no, I would agree.
    $endgroup$
    – Adrian Keister
    Dec 10 '18 at 14:57
















1












$begingroup$


I have a data set with which I am trying to find correlations.



I split the data into a training set and a validation set. I also have a solver I built which finds the "best coefficients" to give me the best results on the training set.



After solving for the training set, the validation set shows completely different results which do not support the results on the training set.



I then made my solver output all of the results, as opposed to only the best results. There are cases which show positive, very similar results in both the training and validation sets, and there are results that show different results in both sets.



Is it okay to choose, by hand, the results that show the most similar and most positive results in both validation and training sets, or does this defeat the purpose of the validation set and invalidate the results?










share|cite|improve this question









$endgroup$








  • 1




    $begingroup$
    Maybe move to Data Science?
    $endgroup$
    – Paul Childs
    Dec 10 '18 at 7:45










  • $begingroup$
    Your best bet here is to use cross-validation, if you want to improve the performance of your model on the validation set. As indicated elsewhere, choosing the best results by hand is a BIG no-no, I would agree.
    $endgroup$
    – Adrian Keister
    Dec 10 '18 at 14:57














1












1








1





$begingroup$


I have a data set with which I am trying to find correlations.



I split the data into a training set and a validation set. I also have a solver I built which finds the "best coefficients" to give me the best results on the training set.



After solving for the training set, the validation set shows completely different results which do not support the results on the training set.



I then made my solver output all of the results, as opposed to only the best results. There are cases which show positive, very similar results in both the training and validation sets, and there are results that show different results in both sets.



Is it okay to choose, by hand, the results that show the most similar and most positive results in both validation and training sets, or does this defeat the purpose of the validation set and invalidate the results?










share|cite|improve this question









$endgroup$




I have a data set with which I am trying to find correlations.



I split the data into a training set and a validation set. I also have a solver I built which finds the "best coefficients" to give me the best results on the training set.



After solving for the training set, the validation set shows completely different results which do not support the results on the training set.



I then made my solver output all of the results, as opposed to only the best results. There are cases which show positive, very similar results in both the training and validation sets, and there are results that show different results in both sets.



Is it okay to choose, by hand, the results that show the most similar and most positive results in both validation and training sets, or does this defeat the purpose of the validation set and invalidate the results?







data-analysis






share|cite|improve this question













share|cite|improve this question











share|cite|improve this question




share|cite|improve this question










asked Dec 10 '18 at 7:26









FrankFrank

16210




16210








  • 1




    $begingroup$
    Maybe move to Data Science?
    $endgroup$
    – Paul Childs
    Dec 10 '18 at 7:45










  • $begingroup$
    Your best bet here is to use cross-validation, if you want to improve the performance of your model on the validation set. As indicated elsewhere, choosing the best results by hand is a BIG no-no, I would agree.
    $endgroup$
    – Adrian Keister
    Dec 10 '18 at 14:57














  • 1




    $begingroup$
    Maybe move to Data Science?
    $endgroup$
    – Paul Childs
    Dec 10 '18 at 7:45










  • $begingroup$
    Your best bet here is to use cross-validation, if you want to improve the performance of your model on the validation set. As indicated elsewhere, choosing the best results by hand is a BIG no-no, I would agree.
    $endgroup$
    – Adrian Keister
    Dec 10 '18 at 14:57








1




1




$begingroup$
Maybe move to Data Science?
$endgroup$
– Paul Childs
Dec 10 '18 at 7:45




$begingroup$
Maybe move to Data Science?
$endgroup$
– Paul Childs
Dec 10 '18 at 7:45












$begingroup$
Your best bet here is to use cross-validation, if you want to improve the performance of your model on the validation set. As indicated elsewhere, choosing the best results by hand is a BIG no-no, I would agree.
$endgroup$
– Adrian Keister
Dec 10 '18 at 14:57




$begingroup$
Your best bet here is to use cross-validation, if you want to improve the performance of your model on the validation set. As indicated elsewhere, choosing the best results by hand is a BIG no-no, I would agree.
$endgroup$
– Adrian Keister
Dec 10 '18 at 14:57










1 Answer
1






active

oldest

votes


















1












$begingroup$

This absolutely will invalidate the results. Data science is meant to be scientific; free from bias. It's ok to make a hypothesis and test it out, but if the results aren't what you want, that's ok, you revise the hypothesis and try again. That's what good science does. But manual "data wrangling" is a big no-no.






share|cite|improve this answer









$endgroup$













  • $begingroup$
    Thanks for your answer. I worry that my solver is just finding outliers in the data, and focusing on them. The results that match both might be the third or fourth best results that the solver found. It is still technically found by the solver. Does this make any difference?
    $endgroup$
    – Frank
    Dec 10 '18 at 7:55










  • $begingroup$
    A least mean squares regression for example is biased towards outliers, due to the nonlinearity of the square. There are other means - biased differently - as well as techniques for filtering out noise, but these shouldn't be done by hand, but based on a knowledge of the expected error.
    $endgroup$
    – Paul Childs
    Dec 10 '18 at 23:22











Your Answer





StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "69"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3033582%2ftraining-vs-validation-set%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1












$begingroup$

This absolutely will invalidate the results. Data science is meant to be scientific; free from bias. It's ok to make a hypothesis and test it out, but if the results aren't what you want, that's ok, you revise the hypothesis and try again. That's what good science does. But manual "data wrangling" is a big no-no.






share|cite|improve this answer









$endgroup$













  • $begingroup$
    Thanks for your answer. I worry that my solver is just finding outliers in the data, and focusing on them. The results that match both might be the third or fourth best results that the solver found. It is still technically found by the solver. Does this make any difference?
    $endgroup$
    – Frank
    Dec 10 '18 at 7:55










  • $begingroup$
    A least mean squares regression for example is biased towards outliers, due to the nonlinearity of the square. There are other means - biased differently - as well as techniques for filtering out noise, but these shouldn't be done by hand, but based on a knowledge of the expected error.
    $endgroup$
    – Paul Childs
    Dec 10 '18 at 23:22
















1












$begingroup$

This absolutely will invalidate the results. Data science is meant to be scientific; free from bias. It's ok to make a hypothesis and test it out, but if the results aren't what you want, that's ok, you revise the hypothesis and try again. That's what good science does. But manual "data wrangling" is a big no-no.






share|cite|improve this answer









$endgroup$













  • $begingroup$
    Thanks for your answer. I worry that my solver is just finding outliers in the data, and focusing on them. The results that match both might be the third or fourth best results that the solver found. It is still technically found by the solver. Does this make any difference?
    $endgroup$
    – Frank
    Dec 10 '18 at 7:55










  • $begingroup$
    A least mean squares regression for example is biased towards outliers, due to the nonlinearity of the square. There are other means - biased differently - as well as techniques for filtering out noise, but these shouldn't be done by hand, but based on a knowledge of the expected error.
    $endgroup$
    – Paul Childs
    Dec 10 '18 at 23:22














1












1








1





$begingroup$

This absolutely will invalidate the results. Data science is meant to be scientific; free from bias. It's ok to make a hypothesis and test it out, but if the results aren't what you want, that's ok, you revise the hypothesis and try again. That's what good science does. But manual "data wrangling" is a big no-no.






share|cite|improve this answer









$endgroup$



This absolutely will invalidate the results. Data science is meant to be scientific; free from bias. It's ok to make a hypothesis and test it out, but if the results aren't what you want, that's ok, you revise the hypothesis and try again. That's what good science does. But manual "data wrangling" is a big no-no.







share|cite|improve this answer












share|cite|improve this answer



share|cite|improve this answer










answered Dec 10 '18 at 7:44









Paul ChildsPaul Childs

2157




2157












  • $begingroup$
    Thanks for your answer. I worry that my solver is just finding outliers in the data, and focusing on them. The results that match both might be the third or fourth best results that the solver found. It is still technically found by the solver. Does this make any difference?
    $endgroup$
    – Frank
    Dec 10 '18 at 7:55










  • $begingroup$
    A least mean squares regression for example is biased towards outliers, due to the nonlinearity of the square. There are other means - biased differently - as well as techniques for filtering out noise, but these shouldn't be done by hand, but based on a knowledge of the expected error.
    $endgroup$
    – Paul Childs
    Dec 10 '18 at 23:22


















  • $begingroup$
    Thanks for your answer. I worry that my solver is just finding outliers in the data, and focusing on them. The results that match both might be the third or fourth best results that the solver found. It is still technically found by the solver. Does this make any difference?
    $endgroup$
    – Frank
    Dec 10 '18 at 7:55










  • $begingroup$
    A least mean squares regression for example is biased towards outliers, due to the nonlinearity of the square. There are other means - biased differently - as well as techniques for filtering out noise, but these shouldn't be done by hand, but based on a knowledge of the expected error.
    $endgroup$
    – Paul Childs
    Dec 10 '18 at 23:22
















$begingroup$
Thanks for your answer. I worry that my solver is just finding outliers in the data, and focusing on them. The results that match both might be the third or fourth best results that the solver found. It is still technically found by the solver. Does this make any difference?
$endgroup$
– Frank
Dec 10 '18 at 7:55




$begingroup$
Thanks for your answer. I worry that my solver is just finding outliers in the data, and focusing on them. The results that match both might be the third or fourth best results that the solver found. It is still technically found by the solver. Does this make any difference?
$endgroup$
– Frank
Dec 10 '18 at 7:55












$begingroup$
A least mean squares regression for example is biased towards outliers, due to the nonlinearity of the square. There are other means - biased differently - as well as techniques for filtering out noise, but these shouldn't be done by hand, but based on a knowledge of the expected error.
$endgroup$
– Paul Childs
Dec 10 '18 at 23:22




$begingroup$
A least mean squares regression for example is biased towards outliers, due to the nonlinearity of the square. There are other means - biased differently - as well as techniques for filtering out noise, but these shouldn't be done by hand, but based on a knowledge of the expected error.
$endgroup$
– Paul Childs
Dec 10 '18 at 23:22


















draft saved

draft discarded




















































Thanks for contributing an answer to Mathematics Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3033582%2ftraining-vs-validation-set%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Plaza Victoria

Puebla de Zaragoza

Musa