Skip to content
Snippets Groups Projects
01_data.qmd 10 KiB
Newer Older
Michael Buecker's avatar
Michael Buecker committed


---
title: "Data Literacy"
Michael Buecker's avatar
Michael Buecker committed
subtitle: "Chapter 1: Data and Data Bases"
Michael Buecker's avatar
Michael Buecker committed
author: Prof. Dr. Michael Bücker
number-offset: [1,0]
bibliography: references.bib
---



# Data {background-color="#0014a0"}

::: footer
:::


## What is data?

:::: {.columns}

::: {.column width="47.5%"}
- **Data** represents information (i.e. details of facts and processes) based on known or assumed agreements in a form that can be processed by machine. 
- **Digital** data is represented by characters. A character (or: symbol) is an element from a finite set of different elements agreed upon to represent information, the so-called character set (or: alphabet).
- **Analog** data is represented by continuous functions. The analog representation is based on a physical quantity that changes continuously according to the facts or processes to be represented. Example: thermometer, slide rule

:::

::: {.column width="5%"}

:::

::: {.column width="47.5%"}
- Most important aspect of analog data: stepless
- Digitization of analog data: analog information is measured in short time intervals and digital information is determined for each measured value
- The quality of this conversion depends on the short time interval between two measurements and on the accuracy of the measurement
Michael Buecker's avatar
Michael Buecker committed

Michael Buecker's avatar
Michael Buecker committed
![An example for quantization of an analog signal](https://upload.wikimedia.org/wikipedia/commons/7/70/Quantized.signal.svg){#fig-quantized}

Michael Buecker's avatar
Michael Buecker committed
:::
::::

## Storage of information on computer systems

- The elementary components on the lowest layer of a computer include **transistors**, which are used for switches that are turned on or off by electrical impulses
- The states of a switch can be used to **store information**. The switches‘ states are the elementary form of information representation.
- A **binary character or bit** (binary digit) is a character from a character set of two characters. Any character can be used to represent the bits; we commonly use the characters 0 and 1.
- All data and programs are represented by **sequences of bits** during internal computer processing.

## Coding of information

#### Central question: How can information be stored using binary characters?
- A **code** defines how information is represented by a given set of characters.
- The **dual system**, also called **binary system**, is a number system that uses a character set of only two different digits, namely 0 and 1, to represent numbers.
- When numbers are represented in the **dual system**, the digits are written one after the other without separators, as in the decimal system that is usually used, but their significance corresponds to the power of two that corresponds to the position and not to the power of ten (as in the decimal system).

## Binary coding of information

::: callout-caution
## Homework

Please watch the following video:

{{< video https://youtu.be/1GSjbWt0c9M?si=xc9oYYQOmvT4VCif  width="1800" height="800">}}  

:::



## Data types

Michael Buecker's avatar
Michael Buecker committed
![Python data types](https://i.imgur.com/6cg2E9Q.png){#fig-pythondatatypes}
Michael Buecker's avatar
Michael Buecker committed
## Data types in Python
Michael Buecker's avatar
Michael Buecker committed

:::: {.columns}

::: {.column width="47.5%"}
- Integer
```{python}
#| echo: true

# Integer
i = 1
print(i)
type(i)
```

- Float
```{python}
#| echo: true
# Float 
f = 1.1
print(f)
type(f)
```


:::

::: {.column width="5%"}

:::

::: {.column width="47.5%"}

- Boolean
```{python}
#| echo: true
# Boolean
b = True
print(b)
type(b)
```

- String
```{python}
#| echo: true
# String
s = "Text"
print(s)
type(s)
```
:::
::::



## From data to wisdom (1/4) {#sec-datawisdom}

Michael Buecker's avatar
Michael Buecker committed
![The data pyramid (part 1)](img/pyramid1.png){#fig-datapyramid1}
Michael Buecker's avatar
Michael Buecker committed


## [-@sec-datawisdom] From data to wisdom (2/4) {.unnumbered}

Michael Buecker's avatar
Michael Buecker committed
![The data pyramid (part 2)](img/pyramid2.png){#fig-datapyramid2}
Michael Buecker's avatar
Michael Buecker committed
 
## [-@sec-datawisdom] From data to wisdom (3/4) {.unnumbered}

Michael Buecker's avatar
Michael Buecker committed
![The data pyramid (part 3)](img/pyramid3.png){#fig-datapyramid3}
Michael Buecker's avatar
Michael Buecker committed


## [-@sec-datawisdom] From data to wisdom (4/4)  {.unnumbered}

Michael Buecker's avatar
Michael Buecker committed
![The data pyramid (part 4)](img/pyramid4.png){#fig-datapyramid4}
Michael Buecker's avatar
Michael Buecker committed



## Data characteristics

Michael Buecker's avatar
Michael Buecker committed
![Types of data characteristics](img/datacharacteristics.png){#fig-datchar}
Michael Buecker's avatar
Michael Buecker committed
## Data types


Michael Buecker's avatar
Michael Buecker committed

# Databases {background-color="#0014a0"}

::: footer
:::

Michael Buecker's avatar
Michael Buecker committed
## Motivation

Michael Buecker's avatar
Michael Buecker committed
:::: {.columns}

::: {.column width="47.5%"}

- **Structured Storage**: Organizes data in a defined manner, allowing for relationship establishment between data types.
- **Data Integrity and Accuracy**: Ensures data remains accurate and consistent through integrity constraints and validation mechanisms.
- **Ease of Data Retrieval**: Facilitates data extraction through sophisticated querying and reporting capabilities.
- **Data Security**: Provides robust protection features to safeguard sensitive data through access controls.
- **Concurrency Control**: Supports simultaneous data access by multiple users while maintaining data consistency.


:::

::: {.column width="5%"}

:::

::: {.column width="47.5%"}
- **Data Backup and Recovery**: Offers built-in features to protect against data loss and enables data restoration.
- **Scalability and Performance**: Efficiently handles growing data and transactions, ensuring application responsiveness.
- **Compliance and Auditing**: Supports regulatory compliance and provides auditing tools for tracking data access.
- **Cost Efficiency**: Reduces total ownership cost through consolidated data management and automation.
- **Data Analysis and Decision-Making**: Enables data mining and analysis for informed decision-making and insights.

:::
::::

Michael Buecker's avatar
Michael Buecker committed

## Relational data models

Michael Buecker's avatar
Michael Buecker committed
- **Definition**: A relational data model organizes data into tables (or relations) where each table represents a different entity, and each row in a table represents a unique instance of that entity. Columns within the tables represent attributes of the entities.

- **Normalization**: A technique used to minimize data redundancy and avoid undesirable characteristics like insertion, update, and deletion anomalies by organizing data in a way that eliminates repeating groups and ensures data dependencies make sense.

- **ACID Properties**:
  - **Atomicity**: Ensures that all parts of a transaction are completed successfully or not at all.
  - **Consistency**: Ensures that the database remains in a consistent state before and after the transaction.
  - **Isolation**: Ensures that transactions are securely and independently processed at the same time without interference.
  - **Durability**: Ensures that the effects of a transaction are permanent and can withstand system failures.


- **Schema**: Defines the structure of the relational database including tables, fields, and the relationships between them. The schema acts as a blueprint for how data is organized and how relationships between data are handled.




## Relational data schemas

A Relational Model is a type of database model based on the concept of relations, which are akin to tables of data. In a relation, data is organized in tuples (rows) and attributes (columns).

:::: {.columns}

::: {.column width="47.5%"}


#### 1. Relations (Tables)
- A **Relation** is a set of tuples.
- Each **Tuple** represents a single item.
- Each **Attribute** in a tuple has a specific data type.


#### 2. Relationship cardinalities
- **One-to-One (1:1):** Each item in one relation is linked to exactly one item in another relation.
- **One-to-Many (1:M):** One item in a relation can be linked to many items in another relation.
- **Many-to-Many (M:M):** Items in one relation can be linked to multiple items in another relation.

:::

::: {.column width="5%"}

:::

::: {.column width="47.5%"}


#### 3. Keys
- **Primary Key:** A unique identifier for each tuple within a relation.
- **Foreign Key:** A field in one relation that refers to the primary key in another relation.

#### 4. Integrity Constraints
- **Entity Integrity:** E.g. no primary key value can be null.
- **Referential Integrity:** Ensures that relationships between relations are maintained.
- ...
:::
::::




## Visualization of relational data models


:::: {.columns}

::: {.column width="47.5%"}


![Exmaple for the visualization of a relational data model](https://dev.mysql.com/doc/employee/en/images/employees-schema.png){#fig-relmod}


:::

::: {.column width="5%"}

:::

::: {.column width="47.5%"}

- In a visualization of relational data models, each **table** is represented by a box with the table's name on top and the list of **columns/attributes** below
- Special columns like **primary and foreign keys** are marked
- **Relationships** are represented by connections between the tables with respective notations for the **cardinalities** (see [@fig-cardinalities])

![Notation of relationship cardinalities](https://d2slcw3kip6qmk.cloudfront.net/marketing/pages/chart/erd-symbols/ERD-Notation.PNG){#fig-cardinalities}
:::
::::

Michael Buecker's avatar
Michael Buecker committed
## Accessing data bases

## Working with data bases - SQL
Michael Buecker's avatar
Michael Buecker committed
## Other types of data bases
Michael Buecker's avatar
Michael Buecker committed
Traditional Relational Database Management Systems (RDBMS) have been the standard for data storage and management. However, with the advent of big data and real-time applications, other database models have emerged to address specific needs.

:::: {.columns}

::: {.column width="47.5%"}
#### 1. NoSQL Databases
- **Key-Value Stores:** Simple and highly scalable, e.g., Redis, DynamoDB.
- **Document Stores:** Store, retrieve, and manage document-oriented information, e.g., MongoDB, CouchDB.
- **Column-family Stores:** Ideal for handling large data sets, e.g., Cassandra, HBase.
- **Graph Databases:** Excellent for managing interconnected data, e.g., Neo4j, Amazon Neptune.

#### 2. NewSQL Databases
- Aim to provide the scalability of NoSQL databases while maintaining the ACID properties of relational databases, e.g. Google Spanner, CockroachDB.
:::

::: {.column width="5%"}

:::

::: {.column width="47.5%"}


#### 3. In-Memory Databases (IMDBs)
- Store data in the main memory (instead of disk) for faster data access, e.g., Redis, SAP HANA.

#### 4. Time Series Databases (TSDBs)
- Optimized for handling time-series data, e.g., InfluxDB, Prometheus.

#### 5. Multi-model Databases
- Support multiple data models within a single, integrated backend, e.g., ArangoDB, OrientDB.

:::
::::





Michael Buecker's avatar
Michael Buecker committed
# References {.unnumbered .scrollable}

::: {#refs}
:::