diff --git a/00_Introduction.qmd b/00_Introduction.qmd index 34bd408e8c71528e05a7f9dcd6fe809214d97b94..f84cce30bef499ccf33ba2c7e5e6f7365d0869a6 100644 --- a/00_Introduction.qmd +++ b/00_Introduction.qmd @@ -153,7 +153,7 @@ data bases - Apply basic data engineering techniques using Python in Jupyter Notebooks - Understand and apply basic data analytics techniques to create business value -- Understand data fellacies and know how to avoid them +- Understand data fallacies and know how to avoid them - Reflect on data ethics in a personal and business context ## Agenda @@ -162,7 +162,7 @@ business value 2. Data Insights Process 3. Data Engineering 4. Data Analysis -5. Data Fellacies +5. Data Fallacies 6. Data Storytelling 7. Data Ethics diff --git a/05_data_fellacies.qmd b/05_data_fellacies.qmd index 04c8bb4db94ebec59d9186342228cd3379f74070..c071c4a04fb09011be01ad772b133b922d3ebe4a 100644 --- a/05_data_fellacies.qmd +++ b/05_data_fellacies.qmd @@ -1,6 +1,6 @@ --- title: "Data Literacy" -subtitle: "Chapter 5: Data Fellacies" +subtitle: "Chapter 5: Data Fallacies" author: Prof. Dr. Michael Bücker number-offset: [6,0] bibliography: references.bib @@ -15,7 +15,7 @@ bibliography: references.bib -# Data fellacies {background-color="#0014a0"} +# Data fallacies {background-color="#0014a0"} ::: footer ::: @@ -23,7 +23,25 @@ bibliography: references.bib ## Definition -# Exercise {background-color="#0014a0"} +- Data fallacies are misleading conclusions drawn from data due to incorrect reasoning or flawed analysis methods. +- They can result from improper data collection, misinterpretation of statistical significance, or failure to account for variables. +- Common types include sampling bias, false causality, and overlooking Simpson's Paradox. +- Such fallacies can lead to poor decision-making and misinformed policies. +- Awareness and rigorous statistical testing are key to avoiding data fallacies. + + +## Guidelines + +- Ensure rigorous statistical testing and data collection methods. +- Be aware of common data fallacies and actively check for them in analysis. +- Utilize randomized controlled trials to establish causation where possible. +- Collect data that is representative of the entire population to avoid sampling bias. +- Apply critical thinking and seek peer review to challenge and validate findings. +- Visualize data to identify patterns or anomalies that summary statistics may miss. +- Include context and consider external factors when interpreting ratios and percentages. +- Remain skeptical of correlations until causation has been reliably demonstrated. + + # Examples {background-color="#0014a0"} @@ -36,6 +54,11 @@ bibliography: references.bib :::: {.columns} ::: {.column width="47.5%"} +- Survivorship bias is a logical error focusing on aspects that support surviving a process and overlooking those that did not due to lack of visibility. +- Common in finance where funds with better performance are more visible than those performing poorly or those that have ceased to exist. +- It can lead to overly optimistic beliefs because failures are ignored, such as in wartime machinery damage analysis where only returning aircraft were examined. +- Often causes false conclusions by examining only "success" stories and not considering the unseen failures behind data points. +- To avoid survivorship bias, it's important to look at the whole picture, including those elements or data sets that might have been excluded over time. ::: @@ -51,12 +74,18 @@ bibliography: references.bib -## False causality +## False causality and spurious correlation :::: {.columns} ::: {.column width="47.5%"} -[https://www.tylervigen.com/spurious-correlations](https://www.tylervigen.com/spurious-correlations) + +- False causality refers to incorrectly assuming that a correlation between two variables implies one causes the other. +- Coincidental timing can lead to assuming a cause-and-effect relationship where none exists. +- It overlooks potential underlying factors or confounding variables that may influence the correlation. +- Critical analysis with rigorous statistical methods is needed to establish true causality. + + ::: ::: {.column width="5%"} @@ -66,22 +95,108 @@ bibliography: references.bib ::: {.column width="47.5%"} )](img/fellacies_falsecausality.png){#fig-falsecausality} +See also: [https://www.tylervigen.com/spurious-correlations](https://www.tylervigen.com/spurious-correlations) + +::: +:::: + + +## Danger of summary statistics + +:::: {.columns} + +::: {.column width="47.5%"} + +- Anscombe's Quartet shows four datasets with identical summary statistics but different distributions. +- Reliance on summary stats can be misleading as they may not capture data nuances. +- Outliers and data distribution patterns are not evident in summary statistics. +- Data visualization is essential alongside numerical summary to understand data fully. + + + +::: + +::: {.column width="5%"} + +::: + +::: {.column width="47.5%"} +)](img/fellacies_summarymetrics.png){#fig-fellacies_summarymetrics} + +see also: [https://www.research.autodesk.com/publications/same-stats-different-graphs/](https://www.research.autodesk.com/publications/same-stats-different-graphs/) + +::: +:::: + + +## Simpson's Paradox + +:::: {.columns} + +::: {.column width="47.5%"} + +- Simpson's Paradox occurs when a trend appears in different groups of data but disappears or reverses when these groups are combined. +- This paradox warns against combining data from multiple sources without accounting for potential lurking variables. +- It demonstrates how aggregated data can mask or misrepresent the actual nature of the relationships within the data. +- Careful examination of data subsets is necessary to ensure accurate interpretation of statistical results. + + + + +::: + +::: {.column width="5%"} + +::: + +::: {.column width="47.5%"} +)](img/fellacies_simpsonsparadox.png){#fig-fellacies_simpsonsparadox} + ::: :::: -## Exercise -::: callout-caution -## Exercise +## Sampling bias + +:::: {.columns} + +::: {.column width="47.5%"} + +- Sampling bias occurs when a sample is not representative of the population as a whole. +- It results in certain groups being overrepresented or underrepresented in the sample. +- Common causes include non-random sampling, pre-selection criteria, and inaccessible segments of the population. +- The bias skews the statistical analysis, leading to inaccurate conclusions. +- To avoid sampling bias, random sampling and thorough design of the sampling process are crucial. -Please analyze the following use case ::: +::: {.column width="5%"} + +::: + +::: {.column width="47.5%"} +)](img/samplingbias.png){#fig-samplingbias} + +::: +:::: + + +## Data fallacies when using ratios/percentages instead of absolute values + +- **Misleading Magnitudes**: Ratios and percentages can exaggerate the importance of changes in small data sets. +- **Base Rate Neglect**: Interpretations may be skewed if the original quantities (base rates) involved are not considered. +- **False Equivalence**: Ratios might imply a misleading equivalence between groups of vastly different sizes. +- **Percentage Change Oversights**: Large percentage increases can occur from small absolute changes, and vice versa. +- **Omitted Numerator/Denominator Context**: Without context on what's being divided, ratios can be meaningless or deceptive. +- **Aggregation Fallacies**: Percentages of subgroups may not accurately represent the whole if the subgroup sizes are disproportionate. + + + # References {.unnumbered .scrollable} diff --git a/img/fellacies_simpsonsparadox.png b/img/fellacies_simpsonsparadox.png new file mode 100644 index 0000000000000000000000000000000000000000..d7f3843f13a0feca7b207952b60a20d1b2888e2b Binary files /dev/null and b/img/fellacies_simpsonsparadox.png differ diff --git a/img/fellacies_summarymetrics.png b/img/fellacies_summarymetrics.png new file mode 100644 index 0000000000000000000000000000000000000000..55d46d26f67fd24e6d6690e7f416f9fa2c0221b8 Binary files /dev/null and b/img/fellacies_summarymetrics.png differ diff --git a/img/samplingbias.png b/img/samplingbias.png new file mode 100644 index 0000000000000000000000000000000000000000..d5865d5f947ff2977edb0b2554b8c6b73b01ec9f Binary files /dev/null and b/img/samplingbias.png differ diff --git a/output/00_Introduction.html b/output/00_Introduction.html index f887020d8b920acf60629bd8efd64636b94f18a6..6b948a7c1dcc2065e7476fe5ee87894c0967d193 100644 --- a/output/00_Introduction.html +++ b/output/00_Introduction.html @@ -587,7 +587,7 @@ Chief Economist at Google</p> <li>Have an overview of data bases technologies and implementation of data bases</li> <li>Apply basic data engineering techniques using Python in Jupyter Notebooks</li> <li>Understand and apply basic data analytics techniques to create business value</li> -<li>Understand data fellacies and know how to avoid them</li> +<li>Understand data fallacies and know how to avoid them</li> <li>Reflect on data ethics in a personal and business context</li> </ul> </section> @@ -598,7 +598,7 @@ Chief Economist at Google</p> <li>Data Insights Process</li> <li>Data Engineering</li> <li>Data Analysis</li> -<li>Data Fellacies</li> +<li>Data Fallacies</li> <li>Data Storytelling</li> <li>Data Ethics</li> </ol> diff --git a/output/05_data_fellacies.html b/output/05_data_fellacies.html index 45fa12b51449a735a669947160982edef6711155..b2c901fdff2739874e19ef23a290cf2eb3e83cc7 100644 --- a/output/05_data_fellacies.html +++ b/output/05_data_fellacies.html @@ -347,7 +347,7 @@ <section id="title-slide" data-background-image="img/title.png" data-background-size="cover" class="quarto-title-block center"> <h1 class="title">Data Literacy</h1> - <p class="subtitle">Chapter 6: Data Storytelling</p> + <p class="subtitle">Chapter 5: Data Fallacies</p> <div class="quarto-title-authors"> <div class="quarto-title-author"> @@ -361,9 +361,8 @@ Prof. Dr. Michael Bücker <nav role="doc-toc"> <h2 id="toc-title">Table of contents</h2> <ul> -<li><a href="#/data-storytelling" id="/toc-data-storytelling"><span class="header-section-number">6.1</span> Data storytelling</a></li> -<li><a href="#/exercise" id="/toc-exercise"><span class="header-section-number">6.2</span> Exercise</a></li> -<li><a href="#/examples" id="/toc-examples"><span class="header-section-number">6.3</span> Examples</a></li> +<li><a href="#/data-fallacies" id="/toc-data-fallacies"><span class="header-section-number">6.1</span> Data fallacies</a></li> +<li><a href="#/examples" id="/toc-examples"><span class="header-section-number">6.2</span> Examples</a></li> <li><a href="#/references" id="/toc-references">References</a></li> </ul> </nav> @@ -371,29 +370,169 @@ Prof. Dr. Michael Bücker <section id="where-are-we" class="slide level3 unnumbered"> <h3>Where are we?</h3> -<img data-src="img/data_storytelling.png" class="r-stretch"></section> +<img data-src="img/data_analysis.png" class="r-stretch"></section> <section> -<section id="data-storytelling" class="title-slide slide level2 center" data-background-color="#0014a0" data-number="6.1"> -<h2><span class="header-section-number">6.1</span> Data storytelling</h2> +<section id="data-fallacies" class="title-slide slide level2 center" data-background-color="#0014a0" data-number="6.1"> +<h2><span class="header-section-number">6.1</span> Data fallacies</h2> <div class="footer"> </div> </section> <section id="definition" class="slide level3" data-number="6.1.1"> <h3><span class="header-section-number">6.1.1</span> Definition</h3> -</section></section> -<section id="exercise" class="title-slide slide level2 center" data-background-color="#0014a0" data-number="6.2"> -<h2><span class="header-section-number">6.2</span> Exercise</h2> - +<ul> +<li>Data fallacies are misleading conclusions drawn from data due to incorrect reasoning or flawed analysis methods.</li> +<li>They can result from improper data collection, misinterpretation of statistical significance, or failure to account for variables.</li> +<li>Common types include sampling bias, false causality, and overlooking Simpson’s Paradox.</li> +<li>Such fallacies can lead to poor decision-making and misinformed policies.</li> +<li>Awareness and rigorous statistical testing are key to avoiding data fallacies.</li> +</ul> </section> - -<section id="examples" class="title-slide slide level2 center" data-background-color="#0014a0" data-number="6.3"> -<h2><span class="header-section-number">6.3</span> Examples</h2> +<section id="guidelines" class="slide level3" data-number="6.1.2"> +<h3><span class="header-section-number">6.1.2</span> Guidelines</h3> +<ul> +<li>Ensure rigorous statistical testing and data collection methods.</li> +<li>Be aware of common data fallacies and actively check for them in analysis.</li> +<li>Utilize randomized controlled trials to establish causation where possible.</li> +<li>Collect data that is representative of the entire population to avoid sampling bias.</li> +<li>Apply critical thinking and seek peer review to challenge and validate findings.</li> +<li>Visualize data to identify patterns or anomalies that summary statistics may miss.</li> +<li>Include context and consider external factors when interpreting ratios and percentages.</li> +<li>Remain skeptical of correlations until causation has been reliably demonstrated.</li> +</ul> +</section></section> +<section> +<section id="examples" class="title-slide slide level2 center" data-background-color="#0014a0" data-number="6.2"> +<h2><span class="header-section-number">6.2</span> Examples</h2> <div class="footer"> </div> </section> - +<section id="survivorship-bias" class="slide level3" data-number="6.2.1"> +<h3><span class="header-section-number">6.2.1</span> Survivorship bias</h3> +<div class="columns"> +<div class="column" style="width:47.5%;"> +<ul> +<li>Survivorship bias is a logical error focusing on aspects that support surviving a process and overlooking those that did not due to lack of visibility.</li> +<li>Common in finance where funds with better performance are more visible than those performing poorly or those that have ceased to exist.</li> +<li>It can lead to overly optimistic beliefs because failures are ignored, such as in wartime machinery damage analysis where only returning aircraft were examined.</li> +<li>Often causes false conclusions by examining only “success†stories and not considering the unseen failures behind data points.</li> +<li>To avoid survivorship bias, it’s important to look at the whole picture, including those elements or data sets that might have been excluded over time.</li> +</ul> +</div><div class="column" style="width:5%;"> + +</div><div class="column" style="width:47.5%;"> +<div id="fig-survivorshipbias" class="quarto-figure quarto-figure-center"> +<figure> +<p><img data-src="img/fellacies_survivorshipbias.png"></p> +<figcaption>Figure 6.1: Visualization of the survivorship bias (<em>Source:</em> <a href="https://www.geckoboard.com/best-practice/statistical-fallacies/">https://www.geckoboard.com/best-practice/statistical-fallacies/</a>)</figcaption> +</figure> +</div> +</div> +</div> +</section> +<section id="false-causality-and-spurious-correlation" class="slide level3" data-number="6.2.2"> +<h3><span class="header-section-number">6.2.2</span> False causality and spurious correlation</h3> +<div class="columns"> +<div class="column" style="width:47.5%;"> +<ul> +<li>False causality refers to incorrectly assuming that a correlation between two variables implies one causes the other.</li> +<li>Coincidental timing can lead to assuming a cause-and-effect relationship where none exists.</li> +<li>It overlooks potential underlying factors or confounding variables that may influence the correlation.</li> +<li>Critical analysis with rigorous statistical methods is needed to establish true causality.</li> +</ul> +</div><div class="column" style="width:5%;"> + +</div><div class="column" style="width:47.5%;"> +<div id="fig-falsecausality" class="quarto-figure quarto-figure-center"> +<figure> +<p><img data-src="img/fellacies_falsecausality.png"></p> +<figcaption>Figure 6.2: Visualization of false causality (<em>Source:</em> <a href="https://www.geckoboard.com/best-practice/statistical-fallacies/">https://www.geckoboard.com/best-practice/statistical-fallacies/</a>)</figcaption> +</figure> +</div> +<p>See also: <a href="https://www.tylervigen.com/spurious-correlations">https://www.tylervigen.com/spurious-correlations</a></p> +</div> +</div> +</section> +<section id="danger-of-summary-statistics" class="slide level3" data-number="6.2.3"> +<h3><span class="header-section-number">6.2.3</span> Danger of summary statistics</h3> +<div class="columns"> +<div class="column" style="width:47.5%;"> +<ul> +<li>Anscombe’s Quartet shows four datasets with identical summary statistics but different distributions.</li> +<li>Reliance on summary stats can be misleading as they may not capture data nuances.</li> +<li>Outliers and data distribution patterns are not evident in summary statistics.</li> +<li>Data visualization is essential alongside numerical summary to understand data fully.</li> +</ul> +</div><div class="column" style="width:5%;"> + +</div><div class="column" style="width:47.5%;"> +<div id="fig-fellacies_summarymetrics" class="quarto-figure quarto-figure-center"> +<figure> +<p><img data-src="img/fellacies_summarymetrics.png"></p> +<figcaption>Figure 6.3: Visualization of the danger of summary statistics (<em>Source:</em> <a href="https://www.geckoboard.com/best-practice/statistical-fallacies/">https://www.geckoboard.com/best-practice/statistical-fallacies/</a>)</figcaption> +</figure> +</div> +<p>see also: <a href="https://www.research.autodesk.com/publications/same-stats-different-graphs/">https://www.research.autodesk.com/publications/same-stats-different-graphs/</a></p> +</div> +</div> +</section> +<section id="simpsons-paradox" class="slide level3" data-number="6.2.4"> +<h3><span class="header-section-number">6.2.4</span> Simpson’s Paradox</h3> +<div class="columns"> +<div class="column" style="width:47.5%;"> +<ul> +<li>Simpson’s Paradox occurs when a trend appears in different groups of data but disappears or reverses when these groups are combined.</li> +<li>This paradox warns against combining data from multiple sources without accounting for potential lurking variables.</li> +<li>It demonstrates how aggregated data can mask or misrepresent the actual nature of the relationships within the data.</li> +<li>Careful examination of data subsets is necessary to ensure accurate interpretation of statistical results.</li> +</ul> +</div><div class="column" style="width:5%;"> + +</div><div class="column" style="width:47.5%;"> +<div id="fig-fellacies_simpsonsparadox" class="quarto-figure quarto-figure-center"> +<figure> +<p><img data-src="img/fellacies_simpsonsparadox.png"></p> +<figcaption>Figure 6.4: Visualization of Simpson’s Paradox (<em>Source:</em> <a href="https://www.geckoboard.com/best-practice/statistical-fallacies/">https://www.geckoboard.com/best-practice/statistical-fallacies/</a>)</figcaption> +</figure> +</div> +</div> +</div> +</section> +<section id="sampling-bias" class="slide level3" data-number="6.2.5"> +<h3><span class="header-section-number">6.2.5</span> Sampling bias</h3> +<div class="columns"> +<div class="column" style="width:47.5%;"> +<ul> +<li>Sampling bias occurs when a sample is not representative of the population as a whole.</li> +<li>It results in certain groups being overrepresented or underrepresented in the sample.</li> +<li>Common causes include non-random sampling, pre-selection criteria, and inaccessible segments of the population.</li> +<li>The bias skews the statistical analysis, leading to inaccurate conclusions.</li> +<li>To avoid sampling bias, random sampling and thorough design of the sampling process are crucial.</li> +</ul> +</div><div class="column" style="width:5%;"> + +</div><div class="column" style="width:47.5%;"> +<div id="fig-samplingbias" class="quarto-figure quarto-figure-center"> +<figure> +<p><img data-src="img/samplingbias.png"></p> +<figcaption>Figure 6.5: Visualization of a sampling bias (<em>Source:</em> <a href="https://www.geckoboard.com/best-practice/statistical-fallacies/">https://www.geckoboard.com/best-practice/statistical-fallacies/</a>)</figcaption> +</figure> +</div> +</div> +</div> +</section> +<section id="data-fallacies-when-using-ratiospercentages-instead-of-absolute-values" class="slide level3" data-number="6.2.6"> +<h3><span class="header-section-number">6.2.6</span> Data fallacies when using ratios/percentages instead of absolute values</h3> +<ul> +<li><strong>Misleading Magnitudes</strong>: Ratios and percentages can exaggerate the importance of changes in small data sets.</li> +<li><strong>Base Rate Neglect</strong>: Interpretations may be skewed if the original quantities (base rates) involved are not considered.</li> +<li><strong>False Equivalence</strong>: Ratios might imply a misleading equivalence between groups of vastly different sizes.</li> +<li><strong>Percentage Change Oversights</strong>: Large percentage increases can occur from small absolute changes, and vice versa.</li> +<li><strong>Omitted Numerator/Denominator Context</strong>: Without context on what’s being divided, ratios can be meaningless or deceptive.</li> +<li><strong>Aggregation Fallacies</strong>: Percentages of subgroups may not accurately represent the whole if the subgroup sizes are disproportionate.</li> +</ul> +</section></section> <section id="references" class="title-slide slide level2 unnumbered scrollable smaller"> <h2>References</h2> <div id="refs" role="list"> diff --git a/output/img/fellacies_simpsonsparadox.png b/output/img/fellacies_simpsonsparadox.png new file mode 100644 index 0000000000000000000000000000000000000000..d7f3843f13a0feca7b207952b60a20d1b2888e2b Binary files /dev/null and b/output/img/fellacies_simpsonsparadox.png differ diff --git a/output/img/fellacies_summarymetrics.png b/output/img/fellacies_summarymetrics.png new file mode 100644 index 0000000000000000000000000000000000000000..55d46d26f67fd24e6d6690e7f416f9fa2c0221b8 Binary files /dev/null and b/output/img/fellacies_summarymetrics.png differ diff --git a/output/img/samplingbias.png b/output/img/samplingbias.png new file mode 100644 index 0000000000000000000000000000000000000000..d5865d5f947ff2977edb0b2554b8c6b73b01ec9f Binary files /dev/null and b/output/img/samplingbias.png differ