Joyful Little Bubbles London Skyline
Hero Image

Charity Overview - Databases

posted by Nothingislost @ Thursday, 25 February 2021

With databases we want to find a way to best represent the data. A simple example might be a questionnaire with people's views on a particular topic. These questions may have some ticks and number boxes for age, gender, income bracket, married status, city, and some textual responses to the views.

# Basic Representation

The first four are represented in a straight forward way - we could have a table with the following rows

Table: Personal
| id | age | gender | status | income | city

and we might say the colums will hold this sort of data type:

  • id : primary key
  • age : number
  • gender : number
  • status : number
  • income : number
  • city : text

We have abstracted gender into a number, each number representing a selection on gender. We have abstracted status into a number where each number represents a selection on status; we have abstracted income bracket into a number where each number represent a selection on income bracket; we have kept age and city roughly as you would expect. The id is the primary key and will be a unique identifier for a particular record. This will usually be a number that increments everytime a new record is added to the database.

# Normalising

We could go a step further and normalise, which means we're designing it so data is only needed once where possible and then we relate to that data. so we might have two tables.

Table: Personal
| id | age | gender | status | income | foreign_key_city |

Table: City
| id | city |

The data types are the same. However, here any given city is added to the database once. We can also give it a unique primary key to keep it consistent. Now, when we add a record to Personal, we can add a relationship to a record in table City. If the city doesn't exist, we just add it to table City. The foreign key in table Personal is the primary key of the given record in table City. This is a one-to-one relationship; one city to one person. We can do the same with income bracket as well.

# Representing Questions

For the textual aspect, lets say there are 3 questions. Each record only relates to a single Personal so we have a foreign key pointing to a record in Personal. We could have a table like so:

Table: Questions
| id | q1 | q2 | q3 | foreign_key_personal |

  • id : primary key
  • q1 : text
  • q2 : text
  • q3 : text

This will link to the questions table. It's easier to think about. Databases are optimised to handle large text but that is just easier to logically break the structure up.

# Data Extractions

We have got a way to store the questions in the database. It is quite bruteforce to query that textual data, there is a lot of extra language that we might not necessarily have to trawl through on our searches. So we could create a more accessible representation and link it to the questions, and this could come in the form of keywords. These words can be established based on frequency of use, for instance, so we process all the textual data before inserting into the database, using it as a corpus for keyword analysis and extraction.

At this point, we should redesign table Questions. The foreign key remains the same, but each question number has the one-to-one relationship, instead of each grouping of questions.

Table: Question:
| id | question_number | question_response | foreign_key_personal |

And design the keyword table.

Table: Keyword
| id | keyword |

Now we start bringing the keywords in. Since this is a many-to-many relationship we need to make, what I call (probably got another name), a linking table. Each record creates a link between a question and a particular keyword.

| foreign_key_question | foreign_key_keyword |

so this allows us to represent a many-to-many relationship; each record is a unique combination of question and keyword.

# Round up

We take the data we have and find ways to usefully represent the data. We can then query that data in different ways to find relationships and further build statistical analysis. For instance we can already start building an idea of how many times a particular keyword for a particular income bracket will appear.

The different ways we can represent the data can help guide us in how we approach to certain questions. For instance, we are wanting to understand a particular aspect, we can gear some questions towards a certain kind of analysis later on.