Here you can find tips on management, processing and analysis of data.

Databases

Data manipulation using Excel, R or Python as described in section 1 usually depends on dataframes that fit into RAM and can be handled in memory. Functionality can be extended to manipulate data that does not fit into memory (using the dplyr package for R). But for big data that is analyzed in many different ways (over a long period and/ or by multiple users), using databases is an alternative approach. Databases permanently store Data in a flexible way and data has to be retrieved from storage (e.g. hard disk drive or solid state drive) using queries. Typically, databases are relational and can be managed using the programming language SQL. One of the most popular open-source relational database management software (RDBMS) is MySQL, a more advanced alternative is PostgreSQL. It is much easier to set up and use a database than the unqualified guess might suggest. It does not require additional hardware such as an additional computer (host) that would serve as a dedicated server. Everything works with a single personal computer! in the context of databases, the word “server” usually refers to the server-software that runs in the background of most database systems. During first setup you will come across a lot of technical terms that stem from network administration but if you follow instructions tightly, you don’t have to know anything about these terms. You can skip reading the following background information and proceed with step 1 (see below). Databases like MySQL or PostgreSQL are usually implemented as a Server – client system that uses the TC/IP communication protocol and the Server can be set up to be accessible from a network / the internet. The server software manages the access of users to the database(s). The client software is the frontend (SQL editor) that is used by the database administrator und -user to interact with the database on the server. Management of user access and user privileges is typically required and requires persons with database-administrator privileges to think about how to manage system-/ database integrity and security concerns by restricting privileges of other database users.

MySQL and DBeaver plugin for Bio7 / Eclipse

One relational database management system (RDBMS) is MySQL. The windows installer installs “MySQL Server” with its MySQL command line client, the GUI client software called “MySQL Workbench”, the MySQL Router as well as some connectors that are needed to interact with the databases using alternative clients and programming languages.

Setting up MySQL and DBeaver and creating a database

It is as easy as following the steps below:

1) Install MySQL (for Windows: Use the MySQL installer).

During the installation of MySQL, follow the wizard to configure the MySQL Server instance.

i) Choose “Standalone MySQL Server / Classic MySQL Replication. Click “Next”.

ii) Choose Config type Development Computer”. “TCP/IP” should be selected. Click “Next”.

iii) Choose “Use Legacy Authentication Method (Retain MySQL 5.x compatibility)

iv) Enter root password (click “check” if required). Click “Next”

v) Configure MySQL Server as a Windows Service. Choose “Standard System Account”.Click “Next”

vi) Click Execute

2) After installation has completed, start the MySQL Workbench and click on the entry below “Connections” to connect to the local instance of the MySQL Server via the MySQL Router. Then, enter the password for the user the connection and click OK.

3) In the bottom left right-click “Schemas” and left-click “Create New Schema”. This creates a new Schema ( = Database). Enter a name, e.g. “Database1”

4) Under “Users”, add the following users and select their roles and passwords:

User name host role

Admin DB-Admin

5) Open Bio7 / Eclipse and check, if the plugin DBeaver is installed with all bundled extensions (connectors etc.) and create a new project using File -> New -> Project -> DBeaver

6) Left-click Perspective ->“DBeaver”. At the let side of the DBeaver perspective, left-click “Database Navigator”.

7) Right-click on the project and choose “Set Active” from the context menu that appears.

8) Select “New Connection”. Enter a connection name. Under Database: Enter the name of the previously created Schema (Database1). Enter the user name of the desired user you want to log in as (e.g. Admin).

9) During connection, enter the same user name as specified in step 9) and enter the respective password.

10) In the Database Navigator of the DBeaver perspective, double-click “Users” and then left-click on the button “Create New User” (or: in the Database Navigator, right-click “Users” and in the context menu left-click “Create New User”).Add the following users:

User name host role
AdminLocal localhost DB-Admin
UserAdmin   User-Admin
User1   Custom*
User1Local localhost Custom*
Viewer1   Custom2**

* Custom role: select the following properties: File, Process, Create tablespace, RESOURCE_GRPOUP_USER. Under “Schema privileges” select the desired schema / catalog and tables. Select the desired privileges, e.g. CREATE, Create view, Insert, References, Select, Show view, Create routine, Execute.

** Custom2: select the following properties: RESOURCE_GRPOUP_USER. Under “Schema privileges” select the desired schema / catalog and tables. Select the desired privileges, e.g. Create view, References, Select, Show view, Create routine, Execute.

Create tables, columns and add data

1) In the Database Navigator of the DBeaver Perspective, expand the desired database (e.g. Database1) and double click “Tables”. Left-click on the button “Create new table” (or: in the Database Navigator, right-click “Tables” and in the context menu left-click “Create new table”). Choose a lowercase name without spaces (e.g. t1_table1), set Auto Increment to 1. Select the engine innoDB (default) or CONNECT (for connection to files outside the database).

2) Select a table (e.g. t1_tablename1) and

2a) left-click on the button “Create New Column” (or: in the Database Navigator, right-click “Tables” and in the context menu left-click “Create New Column”. Choose a lowercase name without spaces (e.g. t1ca_id or t1cb_name), set data type to INTEGER (int). Select “Non-Zero” and Auto Increment.

2b) Left-click on the desired table. Left-click “constraints”. Left-click “Create New Constraint”. Choose the desired column from the dropdown menu, enter the same name as the column, select PRIMARY KEY from the dropdown menu. Left-click “OK”. Left-click the table and left-click “Save”. In the wizard, left-click “Persist”.

2c) Select the same table and left-click on the button “Create new Column”. Choose a lowercase name without spaces (e.g. colb_name), set data type to varchar(80) for strings with a maximum length of 80 characters. Set a default value (e.g. NA or NULL (= missing value, not applicable). Select “Auto Increment” only if appropriate. Left-click OK. Left-click the table and left-click “Save”. In the wizard, left-click “Persist”.

2d) Repeat step 2c) for more columns, replace the name of the column. Select the data type INTEGER for integers or DOUBLE for floating point numbers or BOOL for Boolean expressions. Set the default value either to NULL or to 0 as desired.

3) Select a table and left-click “Data”. Use the provided buttons to add rows, then select cells to edit data or to delete rows. For the column with the PRIMARY KEY, only enter 1 to the first cell and do not change the following rows manually to make use of the auto increment. Left-click “Save”. In the wizard, left-click “Persist”.

Data retrieval from a database: queries using SQL

1) To filter database content and show the result, perform a database query by left-clicking the tool “SQL Editor” in the DBeaver toolbar at the top. The following query (script) returns all cells of columns cu, cv from table tx for which the conditions <condition(s)> are met:

SELECT cu, cv, … FROM tx WHERE <condition(s)>

2) To save a script like the query from step 1), right-click into the script and from the context menu choose file -> rename. Right-click the script again and from the context menu select file -> save script.

3) Left-click the button “Execute script”.

4) The results of a query can be a) copied to clipboard as ascii text with comma separated values, or b) exported to a new text file with comma separated values (.csv). c) dismissed and a new querry can create a new table inside the database that contains the query results

4a) To copy the results of a query, right click into the result output and in the context menu left-click Advanced copy -> Advanced copy …

4b) To save the results in a new text file, right click into the result output and in the context menu left-click export Resultset… -> CSV -> Next -> Next -> Next. Specify the target directory and the file name. Left-click Next. (See MySQL reference manual section 13.2.10.1 SELECT … INTO syntax )

4c) To create a new table ty in the database with the results of a query at the time of that scipt’s execution, execute the following query:

CREATE ty

SELECT cu, cv, … FROM tx WHERE <condition(s)>

Processing of data from a MySQL database with the MySQL ODBC Connector using R and the packages odbc, dbplyr and dplyr

Run the R editor of your choice (e.g. Deducer or Bio7 or Eclipse with StatET). Install the R packages RmySQL or odbc. Both are MySQL drivers for R and rely on DBI (R database interface) which is installed automatically on the fly. RmySQL is directly connecting using the MySQL JDBC connector, but it is licensed under the terms of the viral license GPL. So I recommend to install the MySQL ODBC connector and on Microsoft Windows systems, use the Mircosoft Windows ODBC data source administrator connector 32bit to create a new user data source name, e.g. MySQLodbdc64bitdsn1Database1Unicode, as described in the documentation. Then copy and modify the following R script: https://gitlab.com/DataAnalysisByDerAndere/DatabaseInteractionByDerAndere/RdatabaseInteraction/blob/master/src/odbc_MySQL_databaseinteraction.R

In the code above, add or remove the wanted columns and replace the placeholders (<…>) with the desired code (or remove the WHERE \<condition(s)\> statement) in dbSendQuery() command used to perform the SQL query.

The package dplyr provides an efficient abstraction layer that provides fast functions for data manipulation because it was written with the package Rcpp which makes it possible to combine code written in R and C/C++. First, install the package dplyr. The dependencies (e.g. knitr, rmarkdown, ggplot2, RPostgreSQL, RSQLite, DBI, colorspace, utf8 and many others are installed on the fly automatically. Most importantly, the package dbplyr, the MySQL driver RmySQL and the R database interface DBI are also installed automatically together with dplyr and serve as backends for the communication with databases. Tutorials are available online. And here. The following script does the same as the one above, but uses dplyr:

https://gitlab.com/DataAnalysisByDerAndere/DatabaseInteractionByDerAndere/RdatabaseInteraction/blob/master/src/dplyr_databaseinteraction.R

In the script above, add or remove the wanted columns and replace the placeholders (<…>) with the desired code (or remove the %>% filter() statement) in the tbl() command used to perform the SQL query.

Processing of data from a MySQL database using Python with PyMySQL or SQLAlchemy

The mysqlclient python wheel can be used as a driver for the MySQL database, but it is licensed under the terms of GPL. I therefore recommend the pymysql python wheel.

In addition the python package SQLAlchemy serves as a high level ORM. To install these packages on Microsoft Windows, right-click cmd.exe and in the context menu left-click “run as administrator” and enter pip install PyMySQL and execute by pressing enter, then enter pip install SQLAlchemy and execute by pressing enter. Run the Python editor of your choice (e.g. Bio7 / Eclipse with the plug-in PyDev). The connection to a database on a running MySQL server is established via PyMYSQL by executing a Python script with the following script:

https://gitlab.com/DataAnalysisByDerAndere/DatabaseInteractionByDerAndere/PyDatabaseInteraction/blob/master/src/py_database_interaction_pack1/PyMySQL-demo1.py

In that script, add or remove the wanted columns and replace the placeholders (<…>) with the desired code (e.g. remove the WHERE <condition(s)> statement in the execute() command used to perform the SQL query.

SQLAlchemy is an object relational mapper (ORM) that can be used to interact with databases using a high level abstraction layer. The SQLAlchemy core contains the SQL Expression Language that makes Python scripts portable across several SQL databases. A tutorial for the SQL Expression Language of SQLAlchemy core is available online. The connect string for mysqlclient is the same as for the original MySQL-Python. Here is the Python script above rewritten to take advantage of the SQL Expression Language from the SQLAlchemy core: https://gitlab.com/DataAnalysisByDerAndere/DatabaseInteractionByDerAndere/PyDatabaseInteraction/blob/master/src/py_database_interaction_pack1/sqlalchemy-pymysql-sqlexpressionlanguage-demo.py

In that script, add or remove the wanted columns and replace the placeholders (<…>) with the desired code (or replace the .where() part of the select.where() command (read about method chaining in the manual of SQLAlchemy) in the execute() command that is used to perform the SQL query.

If you want to use the object relational mapper (ORM) abstraction level of SQLAlchemy, the same is done with the following Python Script:

https://gitlab.com/DataAnalysisByDerAndere/DatabaseInteractionByDerAndere/PyDatabaseInteraction/blob/master/src/py_database_interaction_pack1/sqlalchemy-pymysql-demo1.py

In that script, we take advantage of deferred reflection. The argument __table_args__ provides variables to hold table parameters and "autoload" = True makes it possible to reflect the respective table in the database automatically without specifying the columns (fields) and their names again. The session.query() command is used as described in the section “Loading Columns” of the SQLAlchemy manual. Fully automated database reflection using the automap functionality is showcased in the comments. If selected data from different databases has to be reflected automatically, metadata can be predefined using MetaData() and the metadata.reflect() method, and passed as the metadata-argument of the automap_base() method as shown in the last code example that is commented out.

Further Reading

If real-time processing of data streams is required, have a look at this article to see how Celery works in conjugation with RabbitMQ and SQLAlchemy. An Alternative is Confluent Open Source, which builds upon Apache Kafka.

https://www.toptal.com/python/python-design-patterns


Copyright 2018 - 2021 DerAndere