Concurrency safe IOPS efficient MySQL with PHP PDO
The problem
I've experience using many solutions for MySQL like Amazon RDS, Google Cloud SQL, Rackspace, Azure, and Openshift. Not all offered the same configuration flexibility but the one bottleneck that all shared was IOPS.
In a startup or a company designing an architecture for a new product with MySQL, they all aim for scalability which is all well and good.
For a developer 'brain' scalability means these things;
- Read Replica/s
- Sharding or multi-master setup
- Cluster db's or data separation
- Normalising data in preparation of ETL
I've used many frameworks that many if not most PHP products are based on, Laravel, CI, Symfony, Zend, Wordpress, Drupal, Doctrine, the list goes on.
The one thing they all share is the lack of IOPS awareness, worse, you cannot work with them to gain in IOPS considered solutions either.
No developer for any of these frameworks has built in IOPS optimisations because i believe IOPS itself is a mystery to them and more of an Engineer's place to figure out.
In an ecosystem where hardware is software developers must write IOPS considered code
I am not university educated, just passionate. So why is it that the most renown frameworks had no developers or engineers identify IOPS issues and build considered solutions?
So what is IOPS
When buying an SSD you may have seen the IOPS benchmarks and though a bigger number is better. That is like saying putting more fuel in your car makes it drive faster. Now I have your attention let me explain what parts of the term IOPS pertains to developers code.
IOPS consists of;
- Physical disk operations, read, write, delete (also a write operation), and seek. The later applies only to magnetic disks. I'll reference only SSD for this post.
- Queue of operations to perform, to be referred as "Queue Depth"
- The controller's ability to manage the operations and optimise them by understanding the queue.
For this article i'll not elaborate much on the third point, other than saying a larger queue depth allows the controller to make optimisations and reach the highest IOPS rates you see in the benchmark figures. So basically at very large queue depths you can reach the IOPS figures on the box, which is like saying when you check in to Git 3 weeks worth of work in 1 commit and tell the customer you did it in a day because that's what the progress report appears to be showing, you had a good day. Well its the same for IOPS, you'll perform at approx 15% of the figure on the box 99% of the time and only reach the peak figures when you are so far in the red that you know you have bigger problems than your hardware to upgrade.
IOPS in developer terms
When you run a SELECT statement to find a record and nothing else is going on the SSD controller will likely perform a Sequential Read, that is, Just go fetch what you asked because there isn't anything else in queue. Throw a few more SELECTs, and some UPDATE, INSERT, DELETE statements too and run them the SSD will likely random read while write operations are performed providing performance benefits.
Now script to run that 1000 times asynchronously, you now have huge queue depth and what is likely to happen on a busy site. The SSD controller will likely read using Tagged or Native command queue (TCQ or NCQ) logic to keep up with demand. But for an application this leads to observations of delayed reads as the write are prioritized and much slower than a read on an SSD.
Write operations are prioritized and much slower than a read on an SSD
Many developers consider using a Read Replica DB at this point. If you're not familiar what a "readrep" is, it's simply all your main database data replicated and in-sync to a copy that you read from. It is very fast, implementing many binary logs, referred to as binlog, to do this by making only the most efficient writes to the database which is usually in memory then to disk.
A readrep allows you to perform all of your database SELECT statements (read operations) fast and without INSERTs being prioritised which delays the return of even the simplest of SELECT statements.
However if you're making many changes to the main DB and the binlog grows, the readrep's queue depth will grow and your read operations are once again delayed. You're back at square one but this time you'll encounter many cases where expectations of sequential write then read in your application are inconsistent and not as was expected by the application. You're likely handling much more capacity now so having a readrep was a valuable decision but what's next? Your options are to look at sharding, clustering, or upgrading hardware because SSD performance for IOPS greatly differs from one model to model.
But are they really what you should do?
Rearchitecture and hardware upgrading doesn't mean you are capable of scalability!
Instead of throwing money at the problem look at your code first. Identify that you are doing some crazy things like batch processes doing many UPDATEs that merely change only a timestamp for last access time, or expecting changes but the only data changed upon comparison is a modification time. Look at what is going on with fresh eyes and only move on from a potential redundant SQL UPDATE if it is necessary for a business case.
You may also find that an API is INSERTing duplicate records, and you may initially identify the client application or App is responsible for sending duplicated network calls to your APIs. Do you blame them and ask them to solve your duplicate data concern? No, consider there are likely network issues, offline, or data loss concerns built into the client side and it is a feature not a bug - Your backend code and SQL are really at fault here and need to be addressing this type of concurrency issue correctly. If 2 inserts come in that are duplicates in terms of a business rule, then it is your backend not meeting the business rule.
Ask; is a modification date needed if no other data was changed?
Ask questions about the relevancy of each write operation and use the below recommendations to mitigate redundancies.
Quick Wins
MySQL engine MyISAM tables are read optimised whereas InnoDB will priorities writes and data integrity features such as key constraints, T-SQL, and PL-SQL. If the features of InnoDB are not utilised on a table and you are writing data less than a ~10 times a second (which is infrequent for a database), then MyISAM would be a far better choice for your table.
MyISAM tables usually contain read heavy information that is rarely changed such as user account details information, and you would then have frequently updated pieces of data stored on a InnoDB table such as latest activity timestamp and read from it only when required.
Another quick win would be establishing a read replica database, the benefits are stated well enough above.
Code that mitigates IOPS
If you're establishing a new project the following considerations are going to be vital to a successful scalable application in PHP and MySQL. For existing codebases it isn't an insurmountable task but in most cases it requires a high level of business requirements knowledge to refactor generally speaking and changing SQL is no exception.
Mitigate all IOPS heavy INSERT and UPDATEs
When executing an INSERT from a user or API requester as the source of the data you should always consider that the data may already exist.
Generally as a developer you would choose InnoDB and use a unique constraint on a key or set of columns to ensure the database isn't polluted with duplicate values. If the unique constraint check fails it can cause unhandled exceptions on the client or server at worst, a bad UX at best.
Unique constraints are not the answer
To ensure we have data integrity with my solution we don't need InnoDB or unique constraints, these are the dependencies of using ON DUPLICATE KEY UPDATE
functionality. It's nice to not need InnoDB and unique constraints because we can choose to utilise the read optimised MyISAM table and still handle duplicate inserts nicely with the added bonus of IOPS mitigation.
Handling the INSERT correctly
Here is an example of an INSERT with MySQL that is;
- MyISAM and InnoDB compatible
- Concurrency safe
- IOPS efficient
You may be familiar with it without knowing it's benefits!
This will handle the "double post" issue most code struggles with for a relatable use case.
Pseudo code;
INSERT INTO {$table} ({$columns})
SELECT {$values} FROM DUAL
WHERE NOT EXISTS(
SELECT 1 FROM {$table} WHERE {$uniqueKeys} LIMIT 1
);
A executable example;
INSERT INTO users (Email, Name, City, Active)
SELECT 'chris@example.com', 'Chris', 'Melbourne', 1 FROM DUAL
WHERE NOT EXISTS(
SELECT 1 FROM users WHERE Email='chris@example.com' LIMIT 1
);
Result for inserting;
Query OK, 1 rows affected (0.00 sec)
Result when the data exists and no insert was executed;
+-----------+----------------+---------------------+
| Email | Name | City | Active
+-----------+----------------+---------------------+
| chris@ghost.io | Chris | Melbourne | 1
+-----------+----------------+---------------------+
1 row in set (0.00 sec)
So what is happening here? The engine will first do an inexpensive read using values you intend to insert, and if the data is not already in there it will INSERT them immediately.
You may think that this is 2x IOPS and less efficient, you would be correct to assume that too. Before dismissing the idea try giving this more context in terms of data that you may want to UPDATE because you had already done a SELECT earlier and you found the data. You have a delay between those 2 queries and a DELETE may have arrived in between and the UPDATE will fail due to nothing more than the I/O of your PHP code talking to MySQL using 2 SQL statements instead of 1.
I hope you have followed that well, if not, I'll give a code example of what i briefly talked about above.
Handling the UPDATE correctly
For UPDATEs, you basically should never update when the only data change is a mod date or last accessed date unless it is the intended functionality. A business case for updating only a modify or access timestamp would be an audit trail or online presence status (Chris was online 2 mins ago), mostly anything else falls into the category of not updating if the only change is a modify or access timestamp.
Using PDO fetch with the result form the INSERT where EXISTS method above we actually get the values stored in the database if the unique data exists, which saved us concurrency concerns already but now we can use if for a comparison before we even decide if an UPDATE needs to be executed without any more than 1 SQL statement being executed so far.
I'm not going to demonstrate an UPDATE statement, nor will I provide examples of PHP dong value comparison as it is fairly straight forward.
The key considerations for your comparison code are;
- Skipping any column data you've hardcoded for your INSERT, such as NULL and UTC_TIMESTAMP as examples
- You might as well also skip comparing the unique column because we ended up at this point because we found a record matching
- Omit completely any columns when you know or trust that the INSERTed value is correct or static
All of these considerations are designed to speed up the time it takes to get to the UPDATE execution as soon as possible as the SELECT returned.
(you may choose to lock the record if using InnoDB but i advise against that mainly due to bad UX and exception implied handling on clients).
Wrapping this up
We've made a lot of considerations and worked through some examples, you're head is probably spinning so lets take a look at what this might look like when its all put together in a nice reusable single function call in PHP7.
mysql_put: my take on Concurrency safe IOPS efficient MySQL with PHP PDO
/**
* @param $table string
* @param $mods array keys; "col": required, "bind": false for hardcode, "val": value, "unique": used for exists check and update where clause
* @return array
*/
public static function mysql_put(string $table, array $mods): array {
$dbh = $PDOEndpoint; /* @var $dbh \PDO */
$columns = implode(',', array_map(function($data) { return $data['col']; }, $mods));
$params = [];
$select = [];
$update = [];
$unique = [];
$exists = [];
array_map(function($data) use (&$exists, &$unique) {
if ($data['unique'] === true && $data['bind'] !== false) {
$exists[] = $data['col'].' = :'.$data['col'];
$unique[] = $data['col'];
}
}, $mods);
array_map(function($data) use (&$params, &$select, &$unique, &$update) {
if ($data['bind'] !== false) {
$bindParam = ':'.$data['col'];
$params[$bindParam] = $data['val'];
$select[] = $bindParam;
if (!in_array($data['col'], $unique)) {
$update[] = $data['col'].' = '.$bindParam;
}
} else {
$select[] = $data['val'];
if (!in_array($data['col'], $unique)) {
$update[] = $data['col'].' = '.$data['val'];
}
}
}, $mods);
$selectSQL = 'SELECT '.implode(',',$select);
$updateSQL = implode(',', $update);
$existsSQL = implode(' AND ', $exists);
$insertQuery =
"INSERT INTO {$table} ({$columns})
{$selectSQL}
FROM DUAL
WHERE NOT EXISTS(SELECT 1 FROM {$table} WHERE {$existsSQL} LIMIT 1);";
$updateQuery =
"UPDATE {$table} SET
{$updateSQL}
WHERE {$existsSQL}
LIMIT 1;";
try {
$insertSTMT = $dbh->prepare($insertQuery);
foreach ($params as $key => $value) {
$dataType = \PDO::PARAM_STR;
if (is_null($value)) {
$dataType = \PDO::PARAM_NULL;
} elseif (is_int($value)) {
$dataType = \PDO::PARAM_INT;
}
$insertSTMT->bindValue($key, $value, $dataType);
}
$insertSTMT->execute();
$lastId = $dbh->lastInsertId();
if (!is_numeric($lastId)) {
unset($lastId);
$dbResult = $insertSTMT->fetchAll(\PDO::FETCH_ASSOC);
$needsUpdate = false;
foreach ($dbResult[0]??[] as $key => $value) {
if (isset($params[$key]) && $value != $params[$key]) {
$needsUpdate = true;
break;
}
}
if ($needsUpdate) {
$updateSTMT = $dbh->prepare($updateQuery);
foreach ($params as $key => $value) {
$dataType = \PDO::PARAM_STR;
if (is_null($value)) {
$dataType = \PDO::PARAM_NULL;
} elseif (is_int($value)) {
$dataType = \PDO::PARAM_INT;
}
$updateSTMT->bindValue($key, $value, $dataType);
}
$updateSTMT->execute();
$rowCount = $updateSTMT->rowCount();
}
}
} catch(\PDOException $e) {
print('[mysql_put] Error ('.$e->getCode().') '.$e->getMessage());
}
return [
'insert' => $lastId ?? false,
'needsUpdate' => $needsUpdate ?? null,
'update' => $rowCount ?? false
];
}
Calling this bad boy looks like this;
mysql_put('users', [
['col'=>'userName', 'val' => $userName, 'unique' => true],
['col'=>'deviceId', 'val' => $session, 'unique' => true],
['col'=>'removed', 'val' => $removed],
['col'=>'modDate', 'bind'=>false, 'val' => 'UTC_TIMESTAMP()']
]);
Results
The expected IOPS for certain scenarios;
Inserting data without checking for uniqueness:
Before: 1 write.
After: 1 read, 1 write.
Inserting data without checking for uniqueness, double post concurrency issue:
Before: 2 write.
After: 2 read, 1 write.
Inserting data with a successful check for uniqueness, double post concurrency issue:
Before: 2 reads, 1 write.
After: 2 read, 1 write.
Inserting new data on a unique key;
Before: 1 read, 1 write.
After: 1 read, 1 write.
If NOT EXISTS
fails, update data no change;
Before: 1 read, 1 write.
After: 1 read.
If NOT EXISTS
fails, update data;
Before: 1 read, 1 write.
After: 1 read, 1 write.
Conclusion
We've learned that we can keep data integrity all without using InnoDB constraints or hurting IOPS.
Changing read heavy tables to MyISAM we gain huge performance benefits directly int he MySQL engine.
A read replica database is a certain option for scalability, whereas for clustering and sharding, you'd best explore simpler enhancements first before doing a rearchitecture to handle the fundamental change.
And finally, utilising some commonly misunderstood SQL techniques, For all of the positive use cases we see IOPS mitigation improvements.
I hope this article helps you and saves you time - please spread the knowledge!
Member discussion