Tag Archives: Developer tips

Getting Email Sending Settings Right

Post Syndicated from Bozho original https://techblog.bozho.net/getting-email-sending-settings-right/

Email. The most common means of internet communication, that has been around for ages and nobody has been able to replace it. And yet, it is so hard to get it right in terms of configuration so that messages don’t get sent in spam.

Setting up your own email server is a nightmare, of course, but even cloud email providers don’t let you skip any of the steps – you have to know them and you have to do them. And if you miss something, business will suffer, as a portion of your important emails will get in spam. The sad part is that even if you get all of them right, emails can still get in spam, but this is on the spam filters and there’s little you can do about it. (I dread the moment when an outgoing server will run the email through standard spam filters with standard sets of configurations to see if it would get flagged as spam or not)

This won’t be the first “how to configure email” overview article that mentions all the pitfalls, but most of the ones I see omit some important details. But if you’ve seen one, you are probably familiar with what has to be done – configure SPF, DKIM and DMARC. But doing that in practice is trickier, as this meme (by a friend of mine) implies:

So, you have an organization that wants to send email from its “example.com” domain. Most tutorials assume that you want to send email from one server, which is almost never the case. You need it for the corporate emails (which would in many cases be a hosted or cloud MS Exchange, or Google Suite), you need it for system emails from your applications (one or more of them), which use either another internal server or a cloud email provider, e.g. Amazon SES, you also need it for your website, which uses the hosting provider email server, and you need it for your email campaigns, e.g. via Mailchimp or Sendgrid.

All of the email providers mentioned above have some parts of the picture in their documentation but it doesn’t work in combination. As most providers wrongly assume they are the only one. Their examples assume that and their automated verifications assume that – e.g. Microsoft checks if your SPF record matches exactly what they provide, rather than checking if their servers are allowed by your more complex SPF record which includes all of the above providers.

So let’s get to the individual items you have to configure. Most of them are DNS records, which explains why a technical person in each organization has to do it manually, rather than each service pushing it there automatically after some API authentication:

  • SPF (Sender Policy Framework) – a DNS recrod that lists the permitted senders (IP addresses) and an instruction flag on what to do with those that don’t match. In the typical scenario you need to include multiple senders’ policies rather than listing IP addresses, as they can change. E.g. in order to use Office365, you have to add include:spf.protection.outlook.com. Note that this should be a TXT record, but some DNS providers support a special type of record – SPF. So some older software may expect an SPF header, which means you should support both records with identical values. The syntax is straighforward but sometimes tricky, so you can use a tool to generate and validate it.
  • DKIM (DomainKeys Identified Mail) – a DNS record that lets email senders sign their emails. The DNS record includes the public key used to verify the signature. Why is it needed if there’s SPF? Among other things (like non-repudiation), because with SPF the From header can still be spoofed. Not that DKIM always helps with that, but in combination with DMARC it does. How does DKIM work in multi-sender scenario? You have multiple DKIM selectors which means multiple TXT records. Usually every provider will recommend its own selector (e.g. selector1._domainkey.example.com). Some may insist on being default._domainkey, and if two of them insist on that, you should contact support (the message contains the selector and then verification will fail if it does not match). Email providers would prefer CNAME instead of TXT records as that allows them to rotate the keys without you having to change your DNS records.
  • DMARC (Domain-based Message Authentication, Reporting and Conformance) – this DNS record contains the policy according to which your emails should be validated – it enforces SPF and DKIM and tells the receiving side what to do if they fail. You can have one DMARC policy (again as a TXT record). The syntax is not exactly human readable, so use a tool to generate and validate it. An important aspect of DMARC is that you can receive reports in case of failures – you can specify an email where reports are sent and you can analyze them. There are services (like ReportURI) that can aggregate and analyze these reports. I prefer setting multiple report emails – one administrative and one for ReportURI.
  • PTR (pointer) – this is used for reverse DNS loopkups – it maps a domain name to an IP address (as opposed to A records which map IP adresses to domain names). Spam filters use it to check incoming email. The PTR records should be there for the servers that send the email, e.g. mail.example.com. External providers are likely to already have that record so no need to worry about it. And in many cases you won’t even be able to anyway if you don’t own/control the network.
  • Service-specific settings – you may configure your headers properly but the service sending the emails (e.g. Office365, Mailchimp) might still be missing some configuration. In some cases you have to manually confirm your headers in order to enable DKIM signing. With MS Exhange, for example, you have to execute a few PowerShell commands to generate and then confirm the DKIM records.
  • Blacklists – if you haven’t had everything setup correctly, or if one of your sending servers/services has been compromised, your domain and/or servers may be present in some blacklist. You have to check that. There are tools that aggregate blacklists and check against the, e.g. this one

After everything is done, you can run a spam test using some of the available online tools: e.g. this, this or this. And speaking of tools, MXToolbox has many useful tools to verify all aspects of email configuration.

By the way, this is not everything you can configure about your email. SMTP over TLS (SMTPS), MTA-STS, TLS RPT, DANE. I’ve omitted them because they are about encrypted communication and not about spam, but you should review them for proper email configuration.

By now (even if you already knew most of the things above) you are probably wondering “why did we get here?”. Why do we have to do so many things just to send simple email. Well, first, it’s not that simple to have a universal messaging protocol. It looks simple to use and that’s the great part of it, but it does hide some complexities. The second reason is that the SMTP protocol was not designed with security in mind. Spam and phishing were maybe not seen as such a big issue and so the protocol does not have built-in guarantees for anything. It doesn’t have encryption, authentication, non-repudiation, anything.

That’s why this set of instruments evolved over time to add these security features to email (I haven’t talked about encryption, as it’s handled differently). Why did it have to be DNS-based? It’s the most logical solution, as it guarantees the ownership of the domain, which is what matters even visually to the recipient in the end. But it makes administration more complicated, as you are limited to one-line, semicolon or space separated formats. I think it would be helpful to have a way to delegate all of that to external services, e.g. by a single authenticating DNS record which points to a URL which provide all these policies. For example an EML record to point to https://example.com/email-policies which can publish them in a prettier and more readable (e.g. JSON) format and does that in a single place rather than having to generate multiple records. Maybe that has its own cons, like having the policy server compromised.

But if anything is obvious it is that everything should be designed with security in mind. And every malicious scenario should be taken into account. Because adding security later makes things even more complicated.

The post Getting Email Sending Settings Right appeared first on Bozho's tech blog.

A Disk-Backed ArrayList

Post Syndicated from Bozho original https://techblog.bozho.net/a-disk-backed-arraylist/

It sometimes happens that your list can become too big to fit in memory and you have to do something in order to avoid running out of memory.

The proper way to do that is streaming – instead of fitting everything in memory, you should stream data from the source and discard the entries that are already processed.

However, there are cases when code that’s outside of your control requires a List and you can’t use streaming. These cases are rather rare but in case you hit them, you have to find a workaround. One is to re-implement the code to work with streaming, but depending on the way the library is written, it may not be possible. So the other option is to use a disk-backed list – one that works as a list, but underneath stores and loads elements from disk.

Searching for existing solutions results in several 3+ years old repos like this one and this one and this one.

And then there’s MapDB, which is great and supported. It’s mostly about maps, but it does support a List as well, as shown here.

And finally, you have the option to implement something simpler yourself, in case you need just iteration and almost nothing else. I’ve done it here – DiskBackedArrayList.java. It doesn’t support many things (not all methods are overridden to throw an exception, but they should). But most importantly, it doesn’t support random adding and random getting, and also toArray(). It’s purely “fill the collection” and then “iterate the collection”. It relies on ObjectOutputStream which is not terribly efficient, but is simple to use. Note that I’ve allowed a short in-memory prependList in case small amounts of data need to be prepended to the list.

The list gets filled in memory until a specified threshold and then gets flushed to disk, clearing the memory which starts getting filled again. This too can be more efficient – with background flushing in another thread that doesn’t interfere with adding elements to the list, but optimizations complicate things and in this case the total running time was not an issue. Most importantly, the iterator() method is overridden to return a custom iterator that first streams the prepended list, then reads everything from disk and finally iterates over the latest batch which is still in memory. And finally, the clear() method should be called in the end in order to close the underlying stream. An output stream could be opened and closed on each flush, but ObjectOutputStream can’t be used in append mode due to some implementation specific about writing headers first.

So basically we hide the streaming approach underneath a List interface – it’s still streaming elements and discarding them when not needed. Ideally this should be done at the source of the data (e.g. a database, message queue, etc.) rather than using the disk as overflow space, but there are cases where using the disk is fine. This implementation is a starting point, as it’s not tested in production, but illustrates that you can adapt existing classes to use different data access patterns if needed.

The post A Disk-Backed ArrayList appeared first on Bozho's tech blog.

Near Real-Time Indexing With ElasticSearch

Post Syndicated from Bozho original https://techblog.bozho.net/near-real-time-indexing-with-elasticsearch/

Choosing your indexing strategy is hard. The Elasticsearch documentation does have some general recommendations, and there are some tips from other companies, but it also depends on the particular usecase. In the typical scenario you have a database as the source of truth, and you have an index that makes things searchable. And you can have the following strategies:

  • Index as data comes – you insert in the database and index at the same time. It makes sense if there isn’t too much data; otherwise indexing becomes very inefficient.
  • Store in database, index with scheduled job – this is probably the most common approach and is also easy to implement. However, it can have issues if there’s a lot of data to index, as it has to be precisely fetched with (from, to) criteria from the database, and your index lags behind the actual data with the number of seconds (or minutes) between scheduled job runs
  • Push to a message queue and write an indexing consumer – you can run something like RabbitMQ and have multiple consumers that poll data and index it. This is not straightforward to implement because you have to poll multiple items in order to leverage batch indexing, and then only mark them as consumed upon successful batch execution – somewhat transactional behaviour.
  • Queue items in memory and flush them regularly – this may be good and efficient, but you may lose data if a node dies, so you have to have some sort of healthcheck based on the data in the database
  • Hybrid – do a combination of the above; for example if you need to enrich the raw data and update the index at a later stage, you can queue items in memory and then use “store in database, index with scheduled job” to update the index and fill in any missing item. Or you can index as some parts of the data come, and use another strategy for the more active types of data

We have recently decided to implement the “queue in memory” approach (in combination with another one, as we have to do some scheduled post-processing anyway). And the first attempt was to use a class provided by the Elasticsearch client – the BulkProcessor. The logic is clear – accumulate index requests in memory and flush them to Elasticsearch in batches either if a certain limit is reached, or at a fixed time interval. So at most every X seconds and at most at every Y records there will be a batch index request. That achieves near real-time indexing without putting too much stress on Elasticsearch. It also allows multiple bulk indexing requests at the same time, as per Elasticsearch recommendations.

However, we are using the REST API (via Jest) which is not supported by the BulkProcessor. We tried to plug a REST indexing logic instead of the current native one, and although it almost worked, in the process we noticed something worrying – the internalAdd method, which gets invoked every time an index request is added to the bulk, is synchronized. Which means threads will block, waiting for each other to add stuff to the bulk. This sounded suboptimal and risky for production environments, so we went for a separate implementation. It can be seen here – ESBulkProcessor.

It allows for multiple threads to flush to Elasticsearch simultaneously, but only one thread (using a lock) to consume from the queue in order to form the batches. Since this is a fast operation, it’s fine to have it serialized. And not because the concurrent queue can’t handle multiple threads reading from it – it can; but reaching the condition for forming the bulk by multiple threads at the same time will result in several small batches rather than one big one, hence the need for only one consumer at a time. This is not a huge problem so the lock can be removed. But it’s important to note it’s not blocking.

This has been in production for a while now and doesn’t seem to have any issues. I will report any changes if there are such due to increased load or edge cases.

It’s important to reiterate the issue if this is the only indexing logic – your application node may fail and you may end up with missing data in Elasticsearch. We are not in that scenario, and I’m not sure which is the best approach to remedy it – be it to do a partial reindex of recent data in case of a failed server, or a batch process the checks if there aren’t mismatches between the database and the index. Of course, we should also say that you may not always have a database – sometimes Elasticsearch is all you have for data storage, and in that case some sort of queue persistence is needed.

The ultimate goal is to have a near real-time indexing as users will expect to see their data as soon as possible, while at the same time not overwhelming the Elasticsearch cluster.

The topic of “what’s the best way to index data” is huge and I hope I’ve clarified it at least a little bit and that our contribution makes sense for other scenarios as well.

The post Near Real-Time Indexing With ElasticSearch appeared first on Bozho's tech blog.

A Technical Guide to CCPA

Post Syndicated from Bozho original https://techblog.bozho.net/a-technical-guide-to-ccpa/

CCPA, or the California Consumer Privacy Act, is the upcoming “small GDPR” that is applied for all companies that have users from California (i.e. it has extraterritorial application). It is not as massive as GDPR, but you may want to follow its general recommendations.

A few years ago I wrote a technical GDPR guide. Now I’d like to do the same with CCPA. GDPR is much more prescriptive on the fact that you should protect users’ data, whereas CCPA seems to be mainly concerned with the rights of the users – to be informed, to opt out of having their data sold, and to be forgotten. That focus is mainly because other laws in California and the US have provisions about protecting confidentiality of data and data breaches; in that regard GDPR is a more holistic piece of legislation, whereas CCPA covers mostly the aspect of users’ rights (or “consumers”, which is the term used in CCPA). I’ll use “user” as it’s the term more often use in technical discussions.

I’ll list below some important points from CCPA – this is not an exhaustive list of requirements to a software system, but aims to highlight some important bits. And, obviously, I’m not a lawyer, but I’ve been doing data protection consultations and products (like SentinelDB) for the past several years, so I’m qualified to talk about the technical side of privacy regulations.

  • Right of access – you should be able to export (in a human-readable format, and preferable in machine-readable as well) all the data that you have collected about an individual. Their account details, their orders, their preferences, their posts and comments, etc.
  • Deletion – you should delete any data you hold about the user. Exceptions apply, of course, including data used for prevention of fraud, other legal reasons, needed for debugging, necessary to complete the business requirement, or anything that the user can reasonably expect. From a technical perspective, this means you most likely have to delete what’s in your database, but other places where you have personal data, like logs or analytics, can be skipped (provided you don’t use it to reconstruct user profiles, of course)
  • Notify 3rd party providers that received data from you – when data deletion is requested, you have to somehow send notifications to wherever you’ve sent personal data. This can be a SaaS like Mailchimp, Salesforce or Hubspot, or it can be someone you sold the data (apparently that’s a major thing in CCPA). So ideally you should know where data has been sent and invoke APIs for forgetting it. Fortunately, most of these companies are already compliant with GDPR anyway, so they have these endpoints exposed. You just have to add the logic. If your company sells data by posting dumps to S3 or sending Excel sheets via email, you have a bigger problem as you have to keep track of those activities and send unstructured requests (e.g. emails).
  • Data lineage – this is not spelled out as a requirement, but it follows from multiple articles, including the one for deletion as well as the one for disclosing who data was sent to and where did data came from in your system (in order to know if you can re-sell it, among other things). In order to avoid buying expensive data lineage solutions, you can either have a spreadsheet (in case of simpler processes), or come up with a meaningful way to tag your data. For example, using a separate table with columns (ID, table, sourceType, sourceId, sourceDetails), where ID and table identify a record of personal data in your database, sourceType is the way you have ingested the data (e.g. API call, S3, email) and the ID is the identifier that you can use to track how it came in your system – API key, S3 bucket name, email “from”, or even company registration ID (data might still be sent around flash drives, I guess). Similar table for the outgoing data (with targetType and targetId). It’s a simplified implementation but it might work in cases where a spreadsheet would be too cumbersome to take care of.
  • Age restriction – if you’ve had the opportunity to know the age of a person whose data you have, you should check it. That means not to ignore the age or data of birth field when you import data from 3rd parties, and also to politely ask users about their age. You can’t sell that data, so you need to know which records are automatically opted out. If you never ever sell data, well, it’s still a good idea to keep it (per GDPR)
  • Don’t discriminate if users have used their privacy rights – that’s more of a business requirement, but as technical people we should know that we are not allowed to have logic based on users having used their CCPA (or GDPR) rights. From a data organization perspective, I’d put rights requests in a separate database than the actual data to make it harder to fulfill such requirements. You can’t just do a SQL query to check if someone should get a better price, you should do cross system integration and that might dissuade product owners from breaking the law; furthermore it will be a good sign in case of audits.
  • “Do Not Sell My Personal Information” – this should be on the homepage if you have to comply with CCPA. It’s a bit of a harsh requirement, but it should take users to a form where they can opt out of having their data sold. As mentioned in a previous point, this could be a different system to hold users’ CCPA preferences. It might be easier to just have a set of columns in the users’ table, of course.
  • Identifying users is an important aspect. CCPA speaks about “verifiable requests”. So if someone drops you an email “I want my data deleted”, you should be able to confirm it’s really them. In an online system that can be a button in the user profile (for opting out, for deletion, or for data access) – if they know the password, it’s fairly certain it’s them. However, in some cases, users don’t have accounts in the system. In that case there should be other ways to identify them. SSN sounds like one, and although it’s a terrible things to use for authentication, with the lack of universal digital identity, especially in the US, it’s hard not to use it at least as part of the identifying information. But it can’t be the only thing – it’s not a password, it’s an identifier. So users sharing their SSN (if you have it), their phone or address, passport or driving license might be some data points to collect for identifying them. Note that once you collect that data, you can’t use it for other purposes, even if you are tempted to. CCPA requires also a toll-free phone support, which is hardly applicable to non-US companies even though they have customers in California, but it poses the question of identifying people online based on real-world data rather than account credentials. And please don’t ask users about their passwords over the phone; just initiate a request on their behalf in the system and direct them to login and confirm it. There should be additional guidelines for identifying users as per 1798.185(a)(7).
  • Deidentification and aggregate consumer information – aggregated information, e.g. statistics, is not personal data, unless you are able to extract personal data based on it (e.g. the statistics is split per town and age and you have only two users in a given town, you can easily see who is who). Aggregated data is differentiate from deidentified data, which is data that has its identifiers removed. Simply removing identifiers, though, might again not be sufficient to deidentify data – based on several other data points, like IP address (+ logs), physical address (+ snail mail history), phone (+ phone book), one can be uniquely identified. If you can’t reasonably identify a person based on a set of data, it can be considered deidentified. Do make the mental exercise of thinking how to deidentify your data, as then it’s much easier to share it (or sell it) to third parties. Probably nobody minds being part of an aggregated statistics sold to someone, or an anonymized account used for trend analysis.
  • Pseudonymization is a measure to be taken in many scenarios to protect data. CCPA mentions it particularly in research context, but I’d support a generic pseudonymization functionality. That means replacing the identifying information with a pseudonym, that’s not reversible unless a secret piece of data is used. Think of it (and you can do that quite literally) as encrypting the identifier(s) with a secret key to form the pseudonym. You can then give that data to third parties to work with it (e.g. to do market segmentation) and then give it back to you. You can then decrypt the pseudonyms and fill the obtained market segment(s) into your own database. The 3rd party doesn’t get personal information, but you still get the relevant data
  • Audit trail is not explicitly stated as a requirement, but since you have the obligation to handle users requests and track the use of their data in and outside of your system, it’s a good idea to have a form of audit trail – who did what with which data; who handled a particular user request; how was the user identified in order to perform the request, etc.

As CCPA is not concerned with data confidentiality requirements, I won’t repeat my GDPR advice about using encryption whenever possible (notably, for backups), or about internal security measures for authentication.

CCPA is focused on the rights of your users and you should be able to handle them (and track how you handled them). You can have manual and spreadsheet based processes if you are not too big, and you should definitely check with your legal team if and to what extent CCPA applies to your company. But if you have implemented the GDPR data subject rights, it’s likely that you are already compliant with CCPA in terms of the overall system architecture, except for a few minor details.

The post A Technical Guide to CCPA appeared first on Bozho's tech blog.

Restoring Cassandra Priam Backup With sstableloader

Post Syndicated from Bozho original https://techblog.bozho.net/restoring-cassandra-priam-backup-with-sstableloader/

I’ve previously written about setting up Cassandra and Priam for backup and cluster management. The example that I gave for backup restore there, however, is not applicable in every situation – it may not work on a completely separate cluster, for example. Or in case of partial restore to just one table, rather than the whole database.

In such cases you may choose to do a restore using the sstableloader utility. It has a straightforward syntax:

sudo sstableloader -d, -ts /etc/cassandra/conf/truststore.jks \
   -ks /etc/cassandra/conf/node.jks -f /etc/cassandra/conf/cassandra.yaml  \

If you look at your Priam-generated backup, it looks like you can just copy the files (e.g. via s3 aws cp on AWS) for the particular tables and sstableloader import them. There’s a catch, however. In order to save space, Priam is using Snappy to compress all of the files. So if you try to feed them to any Cassandra utility, it will complain that they are corrupted.

So you have to decompress them before using sstableloader or anything else. But how? Well, Priam offers a service for that – you call it by passing the absolute path to a compressed file and the absolute path to where the uncompressed should be placed and it does the simple job of streaming the original through a decompressor. For decompressing an entire backup, I’ve written a python script. It assumes a certain structure, but you can parameterize it to make it more flexible. Here’s the code (excuse my non-idiomatic Python, I’m only using it for simple scripting):

#! /usr/bin/env python
# python script used to pass each backup file through the decompression facility of Priam (using Snappy)
# so that it can be used with sstableloader for restore
import os
import requests

rootdir = '/home/ec2-user/backup'
target = '/home/ec2-user/keyspace'

for subdir, dirs, files in os.walk(rootdir):
    for file in files:
        fullpath = os.path.join(subdir, file)
        parent = os.path.join(fullpath, os.pardir)
        table = os.path.basename(os.path.abspath(parent))
        targetdir = target + "/" + table + "/"
        if not os.path.exists(targetdir):

        url = 'http://localhost:8080/Priam/REST/v1/cassadmin/decompress?in=' + fullpath + '&out=' + target + "/" + table + "/" + file

Now you have decompressed backup files that you can restore using sstableloader. It may take some time if you have a lot of data, and you should not run the restore at the same time a snapshot backup is performed, as it may fail (was warned by the documentation)

As a general note here, it’s very important to have backups but it’s much more important to be able to restore from them. A backup is useless if you don’t have a restore procedure. And simply having the tools available (e.g. Priam) doesn’t mean you can a restore procedure ready to execute. You should be doing test restores on active staging data as well as full restores on an empty, newly formed cluster, as there are different restore scenarios.

The post Restoring Cassandra Priam Backup With sstableloader appeared first on Bozho's tech blog.

The Personal Data Store Pattern

Post Syndicated from Bozho original https://techblog.bozho.net/the-personal-data-store-pattern/

With the recent trend towards data protection and privacy, as well as the requirements of data protection regulations like GDPR and CCPA, some organizations are trying to reorganize their personal data so that it has a higher level of protection.

One path that I’ve seen organizations take is to apply the (what I call) “Personal data store” pattern. That is, to extract all personal data from existing systems and store it in a single place, where it’s accessible via APIs (or in some cases directly through the database). The personal data store is well guarded, audited, has proper audit trail and anomaly detection, and offers privacy-preserving features.

It makes sense to focus one’s data protection efforts predominantly in one place rather than scatter it across dozens of systems. Of course it’s far from trivial to migrate so much data from legacy systems to a new module and then upgrade them to still be able to request and use it when needed. That’s why in some cases the pattern is applied only to sensitive data – medical, biometric, credit cards, etc.

For the sake of completeness, there’s something else called “personal data stores” and it means an architecture where the users themselves store their own data in order to be in control. While this is nice in theory, in practice very few users have the capacity to do so, and while I admire the Solid project, for example, I don’t think it is viable pattern for many organizations, as in many cases users don’t directly interact with the company, but the company still processes large amounts of their personal data.

So, the personal data store pattern is an architectural approach to personal data protection. It can be implemented as a “personal data microservice”, with CRUD operations on predefined data entities, an external service can be used (e.g. SentinelDB, a project of mine), or it can just be a centralized database that has some proxy in front of it to control the access patterns. You an imagine it as externalizing your application’s “users” table and its related tables.

It sounds a little bit like a data warehouse for personal data, but the major difference is that it’s used for operational data, rather than (just) analysis and reporting. All (or most) of your other applications/microservices interact constantly with the personal data store whenever they need to access or update (or “forget”) personal data.

Some of the main features of such a personal data store, the combination of which protect against data breaches, in my view, include:

  • Easy to use interface (e.g. RESTful web services or simply SQL) – systems that integrate with the personal data store should be built in a way that a simple DAO layer implementation gets swapped and then data that was previously accessed form a local database is now obtained from the personal data store. This is not always easy, as ORM technologies add a layer of complexity.
  • High level of general security – servers protected with 2FA, access control, segregated networks, restricted physical access, firewalls, intrusion prevention systems, etc. The good things is that it’s easier to apply all the best practices applied to a single system instead of applying it (and keeping it that way) to every system.
  • Encryption – but not just “data at rest” encryption; especially sensitive data can and should be encrypted with well protected and rotated keys. That way the “honest but curious” admin won’t be able to extract anything form the underlying database
  • Audit trail – all infosec and data protection standards and regulations focus on accountability and traceability. There should not be a way to extract or modify personal data without leaving a trace (and ideally, that trace should be protected as well)
  • Anomaly detection – checking if there is something strange/anomalous in the data access patterns. Such strange access patterns can mean a data breach is happening, and the personal data store can actively block it. There is a lot of software out there that does anomaly detection on network traffic, but it’s much better if the rules (or machine learning) are domain-specific. “Monitor for increased traffic to those servers” is one thing, but it’s much better to be able to say “monitor for out-of-the ordinary accesses to personal data of such and such kind”
  • Pseudonymization – many systems that need the personal data don’t actually need to know who it is about. That includes marketing, including outsourcing to 3rd parties, reporting functionalities, etc. So the personal data store can return data that does not allow a person do be identified, but a pseudo-ID instead. That way, when updates are made back to the personal data store, they can still refer to a particular person, via the pseudonymous ID, but the application that extracted the data in the first place doesn’t get to know who the data was about. This is useful in scenarios where data has to be (temporarily or not) stored in a database that lies outside the personal datastore.
  • Authentication – if the company offers user authentication, this can be done via the personal data store. Passwords, two-factor authentication secrets and other means of authentication are personal data, and an important one as well. An organization may use a single-sign-on internally (e.g. Active Directory), but it doesn’t make sense to put customers there, too, so they are usually stored in a database. During authentication, the personal data store accepts all necessary credentials (username, password, 2FA code), and return a token to be used for subsequent calls or to be used a a session cookie token.
  • GDPR (or CCPA or similar) functionalities – e.g. export of all data about a person, forgetting a person. That’s an often overlooked problem, but “give me all data about me that you have” is an enormous issue with large companies that have dozens of systems. It’s next to impossible to extract the data in a sensible way from all the systems. Tracking all these requests is itself a requirement, so the personal data store can keep track of them to present to auditors if needed.

That’s all easier said than done. In organizations that have already many systems working alongside and processing personal data, migration can be costly. So it’s a good idea to introduce it as early as possible, and have a plan (even if it lasts for years) to move at least sensitive personal data to the well protected silo. This silo is a data engineering effort, a system refactoring effort and an organizational effort. The benefits, though, are reduced long-term cost and reduced risks for data breaches and non-compliance.

The post The Personal Data Store Pattern appeared first on Bozho's tech blog.

Remote Log Collection on Windows

Post Syndicated from Bozho original https://techblog.bozho.net/remote-log-collection-on-windows/

Every organization needs to collect logs from multiple sources in order to put them in either a log collector or SIEM (or a dedicated audit trail solution). And there are two options for that – using an agent and agentless.

Using an agent is easy – you install a piece of software on each machine that generates logs and it forwards them wherever needed. This is however not preferred by many organizations as it complicates things – upgrading to new versions, keeping track of dozens of configurations, and potentially impacting performance of the target machines.

So some organizations prefer to collect logs remotely, or use standard tooling, already present on the target machine. For Linux that’s typically syslog, where forwarding is configured. Logs can also be read remotely via SCP/SSH.

However, on Windows things are less straightforward. You need to access the Windows Event Log facility remotely, but there is barely a single place that describes all the required steps. This blogpost comes close, but I’d like to provide the full steps, as there are many, many things that one may miss. It is a best practice to use a non-admin, service account for that and you have to give multiple permissions to allow reading the event logs remotely.

There are also multiple ways to read the logs remotely:

  • Through the Event Viewer UI – it’s the simplest to get right, as only one domain group is required for access
  • Through Win32 native API calls (and DCOM) – i.e. EvtOpenSession and the related methods
  • Through PowerShell Get-WinEvent (Get-EventLog is a legacy cmdlet that doesn’t support remoting)
  • Through WMI directly (e.g. this or this. To be honest, I don’t know whether the native calls and the powershell commands don’t use WMI and/or CIM underneath as well – probably.

So, in order to get these options running, the following configurations have to be done:

  1. Allow the necessary network connections to the target machines (through network rules and firewall rules, if applicable)
  2. Go to Windows Firewall -> Inbound rules and enable the rules regarding “Remote log management”
  3. Create a service account and configure it in the remote collector. The other option is to have an account on the collector machine that is given the proper access, so that you can use the integrated AD authentication
  4. Add the account to the following domain groups: Event log readers, Distributed COM users. The linked article above mentions “Remote management users” as well, but that’s optional if you just want to read the logs
  5. Give the “Manage auditing and security log” privilege to the service account through group policies (GPO) or via “local security policy”. Find it under User Rights Assignment > Manage auditing and security log
  6. Give WMI access – open “wmimgmt” -> right click -> properties > Security -> Advanced and allow the service account to “Execute Methods”, “Provider Write”, “Enable Account”, “Remote Enable”. To be honest, I’m not sure exactly which folder that should be applied to, and applying it to the root may be too wide, so you’d have to experiment
  7. Give registry permissions: Regedit -> Local machine -> System\CurrentControlSet\Services\eventlog\Security -> right click -> permissions and add the service account. According to the linked post you also have to modify a particular registry entry, but that’s not required just for reading the log. This step is probably the most bizarre and unexpected one.
  8. Make sure you have DCOM rights. This comes automatically wit the DCOM group, but double check via DCOMCnfg -> right click -> COM security
  9. Grant permissions for the service account on c:\windows\system32\winevt. This step is not required for “simple” reading of the logs, but I’ve seen it in various places, so in some scenarios you might need to check it
  10. Make sure the application or service that is reading the logs remotely has sufficient permissions – it can usually run with admin privileges, because it’s on a separate, dedicated machine.
  11. Restart services – that is optional, but can be done just in case: Restart “Windows Remote Management (WS-Management)” and “Windows Event Log” on the target machine

As you can see, there are many things that you can miss, and there isn’t a single place in any documentation to list those steps (though there are good guides like this that go in a slightly different direction).

I can’t but make a high-level observation here – the need to do everything above is an example of how security measures can “explode” and become really hard to manage. There are many service, groups, privileges, policies, inbound rules and whatnot, instead of just “Allow remote log reading for this user”. I know it’s inherently complex, but maybe security products should make things simpler by providing recipes for typical scenarios. Following guides in some blog is definitely worse than running a predefined set of commands. And running the “Allow remote access to event log” recipe would do just what you need. Of course, knowing which recipe to run and how to parameterize it would require specific knowledge, but you can’t do security without trained experts.

The post Remote Log Collection on Windows appeared first on Bozho's tech blog.

Protecting JavaScript Files (From Magecart-Style Attacks)

Post Syndicated from Bozho original https://techblog.bozho.net/protecting-javascript-files-from-magecart-attacks/

Most web pages now consist of multiple JavaScript files that are included in various ways (via >script< or in some more dynamic fashion, bundled and minified or not). But since these scripts interact with everything on the page, they can be a security risk.

And Magecart showcased that risk – the group attacked multiple websites, including British Airways and Ticketmaster, and stole a few hundred thousand credit card numbers.

It is a simple attack where attacker inserts a malicious javascript snippet into a trusted javascript file, collects credit card details entered into payment forms and sends them to an attacker-owned website. Obviously, the easy part is writing the malicious javascript; the hard part is getting it on the target website.

Many websites rely on externally hosted assets (including scripts) – be it a CDN, or a dedicated asset server (as in the case of British Airways). These externally hosted assets may be vulnerable in several ways:

  • Asset servers may be less protected than the actual server, because they are just static assets, what could go wrong?
  • Credentials to access CDN configuration may be leaked which can lead to an attacker replacing the original source scripts with their own
  • Man-in-the-middle attacks are possible if the asset server is misconfigured (e.g. allowing TLS downgrade attack)
  • The external service (e.g. CND) that was previously trusted can go rogue – that’s unlikely with big providers, but smaller and cheaper ones are less predictable

Once the attackers have replaced the script, they are silently collecting data until they are caught. And this can be a long time.

So how to protect against those attacks? A typical advice is to introduce a content security policy, which will allow scripts from untrusted domains to be executed. This is a good idea, but doesn’t help in the scenario where a trusted domain is compromised. There are several main approaches, and I’ll summarize them below:

  • Subresource integrity – this is a browser feature that lets you specify the hash of a script file and validates that hash when the page loads. If it doesn’t match the hash of the actually loaded script, the script is blocked. This sounds great, but has several practical implications. First, it means you need to complicate your build pipeline so that it calculates the hashes of minified and bundled resources and inject those hashes in the page templates. It’s a tedious process, but it’s doable. Then there are the dynamically loaded scripts where you can’t use this feature, and there are the browsers that don’t support it fully (Edge, IE and Safari on mobile). And finally, if you don’t have a good build pipeline (which many small websites don’t), a very small legitimate change in the script can break your entire website.
  • Don’t use external services – that sounds straightforward but it isn’t always. CDNs exist for a reason and optimize your site loading speeds and therefore ranking, internal policies may require using a dedicated asset server, sometimes plugins (e.g. for WordPress) may fetch external resources. An exception to this rule is allowed if you somehow sandbox the third party script (e.g. via iframe as explained in the link above)
  • Secure all external servers properly – if you can do that, that’s great – upgrade the supported cipher suites, monitor for 0days, use only highly trusted CDNs. Regardless of anything, you should obviously always strive to do that. But it requires expertise and resources, which may not be available to every company and every team.

There is one more scenario that may sound strange – if an attacker hacks into your main application server(s), they can replace the scripts with whatever they want. It sounds strange at first, because if they have access to the server, it’s game over anyway. But it’s not always full access with RCE – might be a limited access. Credit card numbers are usually not stored in plain text in the database, so having access to the application server may not mean you have access to the credit card numbers. And changing the custom backend code to collect the data is much more unpredictable and time-consuming than just replacing the scripts with malicious ones. None of the options above protect against that (as in this case the attacker may be able to change the expected hash for the subresource integrity check)

Because of the limitations of the above approaches, at my company we decided to provide a tool to monitor your website for such attacks. It’s called Scriptinel.com (short for Script Sentinel) and is currently in early beta. It’s mainly targeted at small website owners who can’t get any of the above 3 points, but can be used for sophisticated websites as well.

What it does is straightforward – it scans a given URL, extracts all scripts from it (even the dynamic ones), and starts monitoring them for changes with periodic requests. If it discovers a change, it notifies the website owner so that they can react.

This means that the attacker may have a few minutes to collect data, but time is an important factor here – this is not a “SELECT *” data breach; it relies on customers using the website. So a few minutes minimizes the damage. And it doesn’t break your website (I guess we can have a script to include that blocks the page if scriptinel has found discrepancies). It also doesn’t require changes in the build process to include hashes. Of course, such a reactive approach is not perfect, especially if there is nobody to react, but monitoring is a good idea regardless of whether other approaches are used.

There is the issue of protected pages and pages that are not directly accessible via a GET request – e.g. a payment page. For that you can enter your javascript files individually, rather than having the tool scan the page. We can add a more sophisticated user journey scan, with specifying credentials and steps to reach the protected pages, but for now that seems unnecessary.

How does it solve the “main server compromised” problem? Well, nothing solves that perfectly, as the attacker can do changes that serve the legitimate version of the script to your monitoring servers (identifying them by IP) and the modified scripts to everyone else. This can be done on the compromised external asset servers as well (though not with leaked CDN credentials). However this implies the attacker knows that Scriptinel is used, knows the IP addresses of our scanners, and has gained sufficient control to server different versions based on IP. This raises the bar significantly, and can even be made impossible to pull off if we regularly change the IP addresses in a significantly large IP range.

Such functionality may be available in some enterprise security suites, though I’m not aware of it (if it exists somewhere, please let me know).

Overall, the problem is niche, but tough, and not solving it can lead to serious data breaches even if your database is perfectly protected. Scriptinel is a simple to use, good enough solution (and one that’s arguably better than the other options).

Good information security is the right combination of knowledge, implementation of best practices and tools to help you with that. And I maybe Scriptinel is one such tool.

The post Protecting JavaScript Files (From Magecart-Style Attacks) appeared first on Bozho's tech blog.

Let’s Annotate Our Methods With The Features They Implement

Post Syndicated from Bozho original https://techblog.bozho.net/lets-annotate-our-methods-with-the-features-they-implement/

Writing software consists of very little actual “writing”, and much more thinking, designing, reading, “digging”, analyzing, debugging, refactoring, aligning and meeting others.

The reading and digging part is where you try to understand what has been implemented before, why it has been implemented, and how it works. In larger projects it becomes increasingly hard to find what is happening and why – there are so many classes that interfere, and so many methods participate in implementing a particular feature.

That’s probably because there is a mismatch between the programming units (classes, methods) and the business logic units (features). Product owners want a “password reset” feature, and they don’t care if it’s done using framework configuration, custom code split in three classes, or one monolithic controller method that does that job.

This mismatch is partially addressed by the so called BDD (behaviour driven development), as business people can define scenarios in a formalized language (although they rarely do, it’s still up to the QAs or developers to write the tests). But having your tests organized around features and behaviours doesn’t mean the code is, and BDD doesn’t help in making your way through the codebase in search of why and how something is implemented.

Another issue is linking a piece of code to the issue tracking system. Source control conventions and hooks allow for setting the issue tracker number as part of the commit, and then when browsing the code, you can annotate the file and see the issue number. However, due the the many changes, even a very strict team will end up methods that are related to multiple issues and you can’t easily tell which is the proper one.

Yet another issue with the lack of a “feature” unit in programming languages is that you can’t trivially reuse existing projects to start a new one. We’ve all been there – you have a similar project and you want to get a skeleton to get thing running faster. And while there are many tools to help that (Spring Boot, Spring Roo, and other scaffolding utilities), they can rarely deliver what you need – you always have to tweak something, delete something, customize some configuration, as defaults are almost never practical.

And I have a simple proposal that will help with the issues above. As with any complex problem, simple ideas don’t solve everything, but are at least a step forward.

The proposal is in the title – let’s annotate our methods with the features they implement. Let’s have @Feature(name = "Forgotten password", issueTrackerCode="PROJ-123"). A method can implement multiple features, but that is generally discouraged by best practices (e.g. the single responsibility principle). The granularity of “feature” is something that has to be determined by each team and is the tricky part – sometimes an epic describes a feature, sometimes individual stories or even subtasks do. A definition of a feature should be agreed upon and every new team member should be told what to do and how to interpret it.

There is of course a lot of complexity, e.g. for generic methods like DAO methods, utility methods, or methods that are reused in too many places. But they also represent features, it’s just that these features are horizontal. “Data access layer” is a feature – a more technical one indeed, but it counts, and maybe deserves a story in the issue tracker.

Your features can actually be listed in one or several enums, grouped by type – business, horizontal, performance, etc. That way you can even compose features – e.g. account creation contains the logic itself, database access, a security layer.

How does such a proposal help?

  • Consciousnesses about the single responsibility of methods and that code should be readable
  • Provides a rationale for the existence of each method. Even if a proper comment is missing, the annotation will put a method (or a class) in context
  • Helps navigating code and fixing issues (if you can see all places where a feature is implemented, you are more likely to spot an issue)
  • Allows tools to analyze your features – amount, complexity, how chaotic a feature is spread across the code base, test coverage per feature, etc.
  • Allows tools to use existing projects for scaffolding for new ones – you specify the features you want to have, and they are automatically copied

At this point I’m supposed to give a link to a GitHub project for a feature annotation library. But it doesn’t make sense to have a single-annotation project. It can easily be part of guava or something similar Or can be manually created in each project. The complex part – the tools that will do the scanning and analysis, deserve separate projects, but unfortunately I don’t have time to write one.

But even without the tools, the concept of annotating methods with their high-level features is I think a useful one. Instead of trying to deduce why is this method here and what requirements does it have to implement (and were all necessary tests written at the time), such an annotation can come handy.

The post Let’s Annotate Our Methods With The Features They Implement appeared first on Bozho's tech blog.

JKS: Extending a Self-Signed Certificate

Post Syndicated from Bozho original https://techblog.bozho.net/jks-extending-a-self-signed-certificate/

Sometimes you don’t have a PKI in place but you still need a key and a corresponding certificate to sign stuff (outside of the TLS context). And after the certificate in initially generated jks file expires, you have few options – either generate an entirely new keypair, or somehow “extend” the existing certificate. This is useful mostly for testing and internal systems, but still worth mentioning.

Extending certificates is generally not possible – once they expire, they’re done. However, you can have a new certificate with the same private key and a longer period. This sounds like something that should be easy to do, but it turns it it isn’t that easy with keytool. Even with my favourite tool, keystore explorer, it’s not immediately possible.

In order to reuse the private key to have a new, longer certificate, you need to do the following:

  1. Export the private key (with keytool & openssl or through the keystore-explorer UI, which is much simpler)
  2. Make a certificate signing request (with keytool or through the keystore-explorer UI)
  3. Sign the request with the private key (i.e. self-signed)
  4. Import the certificate in the store to replace the old (expired) one

The last two steps seem to be not straightforward with keytool or keystore exporer. If you try to sign the request with your existing keystore keypair, the current certificate is used as the root of the chain (and you don’t want that). And you can’t remove the certificate and generate a new one.

So you need to use OpenSSL:

x509 -req -days 3650 -in req.csr -signkey private.key -sha256 -extfile extfile.cnf -out result.crt

The extfile.cnf is optional and is used if you want to specify extensions. E.g. for timestamping, the extension file looks like this:


After that “simply” create a new keystore and import the private key and the newly generated certificate. This is straightforward through the keystore-explorer UI, and much less easy through the command line.

You’ve noticed my preference for keytool-explorer. It is a great tool that makes working with keys and keystores easy and predictable, as opposed to command-line tools like keytool and openssl, which I’m sure nobody is able to use without googling every single command. Of course, if you have to do very specific or odd stuff, you’ll have to revert to command line, but for most operations the UI is sufficient (unless you have to automate it, in which case, obviously, use the CLI).

You’d rarely need to do what I’ve shown above, but in case you have to, I hope the hints above were useful.

The post JKS: Extending a Self-Signed Certificate appeared first on Bozho's tech blog.

Multiple Cache Configurations with Caffeine and Spring Boot

Post Syndicated from Bozho original https://techblog.bozho.net/multiple-cache-configurations-with-caffeine-and-spring-boot/

Caching is key for performance of nearly every application. Distributed caching is sometimes needed, but not always. In many cases a local cache would work just fine and there’s no need for the overhead and complexity of the distributed cache.

So, in many applications, including plain Spring and Spring Boot, you can use @Cacheable on any method and its result will be cached so that the next time the method is invoked, the cached result is returned.

Spring has some default cache manager implementations, but external libraries are always better and more flexible than simple implementations. Caffeine, for example is a high-performance Java cache library. And Spring Boot comes with a CaffeineCacheManager. So, ideally, that’s all you need – you just create a cache manager bean and you have caching for your @Cacheable annotated-methods.

However, the provided cache manager allows you to configure just one cache specification. Cache specifications include the expiry time, initial capacity, max size, etc. So all of your caches under this cache manager will be created with a single cache spec. The cache manager supports a list of predefined caches as well as dynamically created caches, but on both cases a single cache spec is used. And that’s rarely useful for production. Built-in cache managers are something you have to be careful with, as a general rule.

There are a few blogposts that tell you how to define custom caches with custom specs. However, these options do not support the dynamic, default cache spec usecase that the built-in manager supports. Ideally, you should be able to use any name in @Cacheable and automatically a cache should be created with some default spec, but you should also have the option to override that for specific caches.

That’s why I decided to use a simpler approach than defining all caches in code that allows for greater flexibility. It extends the CaffeineCacheManager to provide that functionality:

 * Extending Caffeine cache manager to allow flexible per-cache configuration
public class FlexibleCaffeineCacheManager extends CaffeineCacheManager implements InitializingBean {
    private Map<String, String> cacheSpecs = new HashMap<>();

    private Map<String, Caffeine<Object, Object>> builders = new HashMap<>();

    private CacheLoader cacheLoader;

    public void afterPropertiesSet() throws Exception {
        for (Map.Entry<String, String> cacheSpecEntry : cacheSpecs.entrySet()) {
            builders.put(cacheSpecEntry.getKey(), Caffeine.from(cacheSpecEntry.getValue()));

    protected Cache<Object, Object> createNativeCaffeineCache(String name) {
        Caffeine<Object, Object> builder = builders.get(name);
        if (builder == null) {
            return super.createNativeCaffeineCache(name);

        if (this.cacheLoader != null) {
            return builder.build(this.cacheLoader);
        } else {
            return builder.build();

    public Map<String, String> getCacheSpecs() {
        return cacheSpecs;

    public void setCacheSpecs(Map<String, String> cacheSpecs) {
        this.cacheSpecs = cacheSpecs;

    public void setCacheLoader(CacheLoader cacheLoader) {
        this.cacheLoader = cacheLoader;

In short, it create one caffeine builder per spec and uses that instead of the default builder when a new cache is needed.

Then a sample XML configuration would look like this:

<bean id="cacheManager" class="net.bozho.util.FlexibleCaffeineCacheManager">
    <property name="cacheSpecification" value="expireAfterWrite=10m"/>
    <property name="cacheSpecs">
            <entry key="statistics" value="expireAfterWrite=1h"/> 

With Java config it’s pretty straightforward – you just set the cacheSpecs map.

While Spring has already turned into a huge framework that provides all kinds of features, it hasn’t abandoned the design principles of extensibility.

Extending built-in framework classes is something that happens quite often, and it should be in everyone’s toolbox. These classes are created with extension in mind – you’ll notice that many methods in the CaffeineCacheManager are protected. So we should make use of that whenever needed.

The post Multiple Cache Configurations with Caffeine and Spring Boot appeared first on Bozho's tech blog.

Command-line SQL Client for IBM i 7.x (AS/400)

Post Syndicated from Bozho original https://techblog.bozho.net/command-line-sql-client-for-ibm-i-7-x-as-400/

In the category of “niche blogposts”, this is probably the “nichest”. But it might be useful, so I’ll share it.

Recently I had to interface an IBM i system (think “mainframe”, though not exactly) from a command-line-only Linux environment. The interesting part is that you can access the IBM i files through SQL syntax, as they are tabular in nature. However, I failed to find a proper tool for that. There are UI tools by IBM but they don’t work in a headless environment.

So I decided to write my own simple tool – a command line SQL client for IBM i 7.x. It relies on an IBM-provided library (which, by the way, fails in some cases in headless environments as well, as it tries to prompt for certain data using awt/swing). The tool can be used after building it with maven and by simply executing SQL queries after connecting:

java -jar as400-sql-client.jar <connectionString> <username> <password>

It can be can be fond here. Apart from being useful in navigating the IBM system, it relies in the jline project which allows you to create rich command line tools that support autocomplete, history, coloring, etc.

I hope that nobody will need this tool, but in the rare case that someone does need it, I hope to save them hours of struggle.

The post Command-line SQL Client for IBM i 7.x (AS/400) appeared first on Bozho's tech blog.

7 Questions To Ask Yourself About Your Code

Post Syndicated from Bozho original https://techblog.bozho.net/7-questions-to-ask-yourself-about-your-code/

I was thinking the other days – why writing good code is so hard? Why the industry still hasn’t got to producing quality software, despite years of efforts, best practices, methodologies, tools. And the answer to those questions is anything but simple. It involves economic incentives, market realities, deadlines, formal education, industry standards, insufficient number of developers on the market, etc. etc.

As an organization, in order to produce quality software, you have to do a lot. Setup processes, get your recruitment right, be able to charge the overhead of quality to your customers, and actually care about that.

But even with all the measures taken, you can’t guarantee quality code. First, because that’s subjective, but second, because it always comes down to the individual developers. And not simply whether they are capable of writing quality software, but also whether they are actually doing it.

And as a developer, you may fit the process and still produce mediocre code. This is why my thoughts took me to the code from the eyes of the developer, but in the context of software as a whole. Tools can automatically catch code styles issues, cyclomatic complexity, large methods, too many method parameters, circular dependencies, etc. etc. But even if you cover those, you are still not guaranteed to have produced quality software.

So I came up with seven questions that we as developers should ask ourselves each time we commit code.

  1. Is it correct? – does the code implement the specification. If there is no clear specification, did you do a sufficient effort to find out the expected behaviour. And is that behaviour tested somehow – by automated tests preferably, or at least by manual testing,.
  2. Is it complete? – does it take care of all the edge cases, regardless of whether they are defined in the spec or not. Many edge cases are technical (broken connections, insufficient memory, changing interfaces, etc.).
  3. Is it secure? – does it prevent misuse, does it follow best security practices, does it validate its input, does it prevent injections, etc. Is it tested to prove that it is secure against these known attacks. Security is much more than code, but the code itself can introduce a lot of vulnerabilities.
  4. Is it readable and maintainable? – does it allow other people to easily read it, follow it and understand it? Does it have proper comments, describing how a certain piece of code fits into the big picture, does it break down code in small, readable units.
  5. Is it extensible? – does it allow being extended with additional use cases, does it use the appropriate design patterns that allow extensibility, is it parameterizable and configurable, does it allow writing new functionality without breaking old one, does it cover a sufficient percentage of the existing functionality with tests so that changes are not “scary”.
  6. Is it efficient? – does work well under high load, does it care about algorithmic complexity (without being prematurely optimized), does it use batch processing, does it read avoid loading big chunks of data in memory at once, does it make proper use of asynchronous processing.
  7. Is it something to be proud of? – does it represent every good practice that your experience has taught you? Not every piece of code is glorious, as most perform mundane tasks, but is the code something to be proud of or something you’d hope nobody sees? Would you be okay to put it on GitHub?

I think we can internalize those questions. Will asking them constantly make a difference? I think so. Would we magically get quality software If every developer asked themselves these questions about their code? Certainly not. But we’d have better code, when combined with existing tools, processes and practices.

Quality software depends on many factors, but developers are one of the most important ones. Bad software is too often our fault, and by asking ourselves the right questions, we can contribute to good software as well.

The post 7 Questions To Ask Yourself About Your Code appeared first on Bozho's tech blog.

How to create secure software? Don’t blink! [talk]

Post Syndicated from Bozho original https://techblog.bozho.net/how-to-create-secure-software-dont-blink-talk/

Last week Acronis (famous for their TrueImage) organized a conference in Sofia about cybersecurity for developers and I was invited to give a talk.

It’s always hard to pick a topic for a talk on a developer conference that is not too narrowly focused — if you choose something too high level, you can be uesless to the audience and seen as a “bullshitter”; if you pick something too specific, half of the audience may be bored because it is not their area of work.

So I chose the middle ground — an overview talk with as much specifics as possible in it. I tried to tell interesting stories of security vulnerabilities to illustrate my points. The talk is split in several parts:

  • Purpose of attacks
  • Front-end vulnerabilities (examples and best practices)
  • Back-end vulnerabilities (examples and best practices)
  • Infrastructure vulnerabilities (examples and best practices)
  • Human factor vulnerabilities (examples and best practices)
  • Thoughts on how this fits into the bigger picture of software security

You can watch the 30 minutes video here:

If you would like to download my slides, click here. or view them at SlideShare:

The point is — security is hard and there are a million things to watch for and a million things that can go wrong. You should minimize risk by knowing and following as much best practices as possible. And you should not assume you are secure, as even the best companies make rookie mistakes.

The security mindset, which is partly formalized by secure coding practices, is at the core of having a secure software. Asking yourself constantly “what could go wrong” will make software more secure. It is a whole other topic of how to make all software more secure, not just the ones we are creating, but it is less technical and goes through the topics public policies, financial incentives to invest in security and so on.

For technical people it’s important to know how to make a focused effort to make a system more secure. And to apply what we know, because we may know a lot and postpone it for “some other sprint”.

And as a person from the audience asked me — is not blinking really the way? Of course not, that effort won’t be justified. But if we cover as much of the risks as possible, that will give us some time to blink.

The post How to create secure software? Don’t blink! [talk] appeared first on Bozho's tech blog.

Avoid Lists in Cassandra

Post Syndicated from Bozho original https://techblog.bozho.net/avoid-lists-in-cassandra/

Apache Cassandra is fast and scalable database which over the years became almost as easy to use as a traditional SQL database. At least on the surface.

You an use SQL-like queries, but they have a lot of limitations; you have a schema, but it’s not as flexible to modify it as in a SQL database; you have the same tabular structure with a primary key, but it’s more complicated due to the differentiation between partition key and sorting key. And there are a lot of underlying details that seem irrelevant at first, but are crucial for performance and data consistency, like tombstones, SSTable compaction and so on.

But I want to discuss the “list” column type, as recently we’ve had a very elusive issue with it. We are in the business of guaranteeing data integrity, and that’s why our records are not updated, ever. This is a good fit for Cassandra, as updates are tricky to get right. But on one of our deployments we noticed something strange – very rarely, the hash of the data in a particular entry out of millions would not match upon comparison with the indexed data. Upon investigation, we noticed that a column of type “list” got duplicate values. It was not an issue with the code, because in this particular case the code was always using Collections.singletonList(..)

It appears that Cassandra is trying to be clever and when it sees identical entries in a batch insert, instead of overriding one with the other, it tries to merge them, resulting in a list with duplicate entries. Accounts of the issue are reported here and here.

Now, batches are a difficult topic and one of those things that look straight-forward but aren’t. In most cases, batches are an anti-pattern. There are cases where batches are useful, but it’s more rare than expected. That’s because of the distributed nature of Cassandra. Another complication comes from whether you are using token-aware or toke-unaware client policy, i.e. whether your client knows where each record belongs in order to send the request to it. I won’t go into details about batches, as they are well explained in the two linked articles.

Back to lists – since in our case we don’t have identical records in a batch, the issue was probably manifested because of a network timeout where the client didn’t receive confirmation of the write and re-attempted sending the same statement again. Whether being in a batch or not affects it, I can’t be sure. But it’s probably safer to assume that it might happen with or without a batch. I.e. lists can be merged in unexpected situations.

This is a serious reason for not using lists at all. Additional arguments are given by Walmart

Sets should be preferred to Lists as Sets (and Maps) avoid read-before-write pattern for updates and deletes

And this is just for a small number of items. Using collections for a large number of items (e.g. thousands) is another issue, as you can’t load the items in portions – they are all read at once.

In a Java application, for example, you can easily substitution the List with a Set even if the underlying column is of type List and that would help temporarily avoid the issues – data may still be duplicated in the database, but at least the application will work with unique values. Have in mind though, that ordering is not guaranteed by the Java Set, so if it matters for your logic, make sure you order by some well-defined comparison criteria.

The general advice of “avoid lists” (and “avoid batches”) paints an accurate picture of Cassandra. It looks straightforward to use, but once you get to production, you may realize there were some suboptimal design decisions.

The post Avoid Lists in Cassandra appeared first on Bozho's tech blog.

Certificate Transparency Verification in Java

Post Syndicated from Bozho original https://techblog.bozho.net/certificate-transparency-verification-in-java/

So I had this naive idea that it would be easy to do certificate transparency verification as part of each request in addition to certificate validity checks (in Java).

With half of the weekend sacrificed, I can attest it’s not that trivial. But what is certificate transparency? In short – it’s a publicly available log of all TLS certificates in the world (which are still called SSL certificates even though SSL is obsolete). You can check if a log is published in that log and if it’s not, then something is suspicious, as CAs have to push all of their issued certificates to the log. There are other use-cases, for example registering for notifications for new certificates for your domains to detect potentially hijacked DNS admin panels or CAs (Facebook offers such a tool for free).

What I wanted to do is the former – make each request from a Java application verify the other side’s certificate in the certificate transparency log. It seems that this is not available out of the box (if it is, I couldn’t find it. In one discussion about JEP 244 it seems that the TLS extension related to certificate transparency was discussed, but I couldn’t find whether it’s supported in the end).

I started by thinking you could simply get the certificate, and check its inclusion in the log by the fingerprint of the certificate. That would’ve been too easy – the logs to allow for checking by hash, however it’s not the fingerprint of a certificate, but instead a signed certificate timestamp – a signature issued by the log prior to inclusion. To quote the CT RFC:

The SCT (signed certificate timestamp) is the log’s promise to incorporate the certificate in the Merkle Tree

A merkle tree is a very cool data structure that allows external actors to be convinced that something is within the log by providing an “inclusion proof” which is much shorter than the whole log (thus saving a lot of bandwidth). In fact the coolness of merkle trees is why I was interested in certificate transparency in the first place (as we use merkle trees in my current log-oriented company)

So, in order to check for inclusion, you have to somehow obtain the SCT. I initially thought it would be possible with the Certificate Transparency Java library, but you can’t. Once you have it, you can use the client to check it in the log, but obtaining it is harder. (Note: for server-side verification it’s fine to query the log via HTTP; browsers, however, use DNS queries in order to preserve the anonymity of users).

Obtaining the SCT can be done in three ways, depending on what the server and/or log and/or CA have chosen to support: the SCT can be included in the certificate, or it can be provided as a TLS extension during the TLS handshake, or can be included in the TLS stapling response, again during the handshake. Unfortunately, the few certificates that I checked didn’t have the SCT stored within them, so I had to go to a lower level and debug the TLS handshake.

I enabled TLS hadnshake verbose output, and lo and behold – there was nothing there. Google does include SCTs as a TLS extension (according to Qualys), but the Java output didn’t say anything about it.

Fortunately (?) Google has released Conscrypt – a Java security provider based Google’s fork of OpenSSL. Things started to get messy…but I went for it, included Conscrypt and registered it as a security provider. I had to make a connection using the Conscrypt TrustManager (initialized with all the trusted certs in the JDK):

KeyStore trustStore = KeyStore.getInstance("JKS");
trustStore.load(new FileInputStream(System.getenv("JAVA_HOME") + "/lib/security/cacerts"), "changeit".toCharArray());
ctx.init(null,new TrustManager[] {new TrustManagerImpl(trustStore, 
    null, null, null, logStore, null, 
    new StrictCTPolicy())}, new SecureRandom());

URL url = new URL("https://google.com");
HttpsURLConnection conn = (HttpsURLConnection) url.openConnection();

And of course it didn’t work initially, because Conscrypt doesn’t provide implementations of some core interfaces needed – the CTLogStore and CTPolicy classes. The CTLogStore actually is the important bit that holds information about all the known logs (I still find it odd to call a “log provider” simply “log”, but that’s the accepted terminology). There is a list of known logs, in JSON form, which is cool, except it took me a while to figure (with external help) what are exactly those public keys. What are they – RSA, ECC? How are they encoded? You can’t find that in the RFC, nor in the documentation. It can be seen here that it’s ” DER encoding of the SubjectPublicKeyInfo ASN.1 structure “. Ugh.

BouncyCastle to the rescue. My relationship with BouncyCastle is a love-hate one. I hate how unintuitive it is and how convoluted its APIs are, but I love that it has (almost) everything cryptography-related that you may ever need. After some time wasted with trying to figure how exactly to get that public key converted to a PublicKey object, I found that using PublicKeyFactory.createKey(Base64.getDecoder().decode(base64Key)); gives you the parameters of whatever algorithm is used – it can return Elliptic curve key parameters or RSA key parameters. You just have to then wrap them in another class and pass them to another factory (typical BouncyCastle), and hurray, you have the public key.

Of course now Google’s Conscrypt didn’t work again, because after the transformations the publicKey’s encoded version was not identical to the original bytes, and so the log ID calculation was wrong. But I fixed that by some reflection, and finally, it worked – the certificate transparency log was queried and the certificate was shown to be valid and properly included in the log.

The whole code can be found here. And yes, it uses several security providers, some odd BouncyCastle APIs and some simple implementations that are missing in Google’s provider. Known certificates may be cached so that repeated calls to the log are not performed, but that’s beyond the scope of my experiment.

Certificate transparency seems like a thing that’s core to the internet nowadays. And yet, it’s so obscure and hard to work with.

Why the type of public key in the list is not documented (they should at least put an OID next to the public key, because as it turns out, not all logs use elliptic curves – two of them use RSA). Probably there’s a good explanation, but why include the SCT in the log rather than the fingerprint of the certificate? Why not then mandate inclusion of the SCT in the certificate, which would require no additional configuration of the servers and clients, as opposed to including it in the TLS handshake, which does require upgrades?

As far as I know, the certificate transparency initiative is now facing scalability issues because of the millions of Let’s encrypt certificates out there. Every log (provider) should serve the whole log to everyone that requests it. It is not a trivial thing to solve, and efforts are being put in that direction, but no obvious solution is available at the moment.

And finally, if Java doesn’t have an easy way to do that, with all the crypto libraries available, I wonder what’ the case for other languages. Do they support certificate transparency or they need upgrades?

And maybe we’re all good because browsers supports it, but browsers are not the only thing that makes HTTP requests. API calls are a massive use-case and if they can be hijacked, the damage can be even bigger than individual users being phished. So I think more effort should be put in two things:
1. improving the RFC and 2. improving the programming ecosystem. I hope this post contributes at least a little bit.

The post Certificate Transparency Verification in Java appeared first on Bozho's tech blog.

Integrating Applications As Heroku Add-Ons

Post Syndicated from Bozho original https://techblog.bozho.net/integrating-applications-as-heroku-add-ons/

Heroku is a popular Platform-as-a-Service provider and it offers vendors the option to be provided as add-ons. Add-ons can be used by Heroku customers in different ways, but a typical scenario would be “Start a database”, “Start an MQ”, or “Start a logging solution”. After you add the add-on to your account, you can connect to the chosen database, MQ, logging solution or whatever.

Integrating as Heroku add-on is allegedly simple, and Heroku provides good documentation on how to do it. However, there are some pitfalls and so I’d like to share my experience in providing our services (Sentinel Trails and SentinelDB) as Heroku add-ons.

Both are SaaS (one is a logging solution, the other one – a cloud datastore), and so when a Heroku customer wants to add it to their account, we have to just create an account for them on our end.

In order to integrate with Heroku, you need to implement several endpoints:

  • provisioning – the initial creation of the resources (= account)
  • plan change – since Heroku supports multiple subscription plans, this should also be reflected on your end
  • deprovisioning – if a user stops using your service, you may want to free some resources
  • SSO – allows users to log in your service by clicking an icon in the Heroku console.

Implementing these endpoints following the tutorial should be straightforward, but it isn’t exactly. Hence I’m sharing our Spring MVC controller that handles it – you can check it here.

A few important bits:

  • You may choose not to obtain a token if you don’t plan to interact with the Heroku API further.
  • We are registering the user with a fake email in the form of <resourceId>@heroku.com. However, you may choose to use the token to fetch the emails of team members and collaborators, as described here.
  • The most important piece of data is the resource_id – store it in your users (or organizations) table and consider adding an index to be able to retrieve records by it quickly.
  • Return your keys and secrets as part of the provisioning request. They will be set as environment variables in Heroku
  • All of the requests are made from the Heroku servers to your server directly, except the SSO call. It is invoked in the browsers and so you should set the session cookie/token in the response. That way the user will be logged in your service.
  • When you generate your addon manifest, make sure you update the endpoint URLs

After you’re done, the alpha version appears in the marketplace (e.g. here and here). You should then have some alpha users to test the add-ons before it can be visible in the marketplace.

Integrating SaaS solutions with existing cloud providers is a good thing, and I’m happy that Heroku provides an automated way to do that. (AWS, for example, also has a marketplace, but integration there feels a bit strange and unpolished (I’ve hit some issues that were manually resolved by the AWS team).

Since many companies are choosing IaaS or PaaS for their services, having the ability to easily integrate an add-on service is very useful. I’d even go further and propose some level standardization for cloud add-ons, but I guess time will tell if we really need it, or we can spare a few days per provider.

The post Integrating Applications As Heroku Add-Ons appeared first on Bozho's tech blog.

Types of Data Breaches and How To Prevent Them

Post Syndicated from Bozho original https://techblog.bozho.net/types-of-data-breaches-and-how-to-prevent-them/

Data breaches happen practically every day. Personal, including financial and medical data leak to cyber criminals as well as intelligence agencies. Some notable breaches include the Equifax breach, where dozens of personal data fields were leaked, and the recent Marriott breach, where passports, credit cards and locations of people at a given time were breached.

I’ve been doing some data protection consultancy as well as working on a data protection product and decided to classify the types of data breaches and give recommendations on how they can be addressed. We don’t always get to know how exactly the breaches happen, but from what is published in news articles and post-mortems, we can have a good overview on the breach landscape.

Control over target server – if an attacker is able to connect to a target server and gains full or partial control on it, they can do anything, including running SELECT * FROM ... , copying files, etc. How do attackers gain such control? In many ways, most notably RCE (remote code execution) vulnerabilities and weak admin authentication.

How to prevent it? Follow best security practices – regularly update libraries and software to get security patches, do not run native commands from within the application layer, open only necessary ports (80 and 443) to the outside world, configure 2-factor authentication for administrator login. Aim at having an intrusion detection / prevention system. Encrypt your data, and make the encryption as granular as possible for the most sensitive data (e.g. for SentinelDB we utilize per-record encryption) to avoid SELECT * breaches.

SQL injections – this is a rookie mistake that unfortunately still happens. It allows attackers to manipulate your SQL queries and inject custom bits in them that allows them to extract more data than they are supposed to.

How to prevent it? Use prepared statements for your queries. Never ever concatenate user input in order to construct queries. Run regular code reviews and use code inspection tools to catch such instances.

Unencrypted backups – the main system may be well protected, but attackers are usually after the weak spots. Storing backups might be such – if you store unencrypted backups that are accessible via weak authentication (e.g. over FTP via username/password), then someone may try to attack this weaker spot. Even if the backup is encrypted, the key can be placed alongside it, which makes the encryption practically useless.

How to prevent it? Encrypt you backups, store them in a way that’s as strongly protected as your servers (e.g. 2FA, internal-network/VPN only), and have your decryption key in a hardware security module (or equivalent, e.g. AWS KMS).

Personal data in logs – another weak spot other than the backups may be your logs. They usually lie on separate servers, and are not as well guarded. That’s usually okay, since logs don’t contain personal information, but sometimes they do. I recently stumbled upon a large company’s website that had their directory structure unprotected and they kept their access logs files alongside their static resources. In addition to that, they passed personal information as GET parameters, so you could get a lot of information by just getting the access logs. Needless to say, I did a responsible disclosure and the issue was fixed, but it was a potential breach.

How to prevent it? Don’t store personal information in logs. Avoid submitting forms with a GET method. Regularly review the code to check whether personal data is not logged. Make sure your logs are stored in a way as protected as your production servers and your backups. It could be a cloud service, it could be a local installation of an open source package, but don’t overlook the security of the log collection system.

Data pushed to unprotected storage – a recent Alteryx/Experian leak was just that – data placed on a (somewhat) public S3 bucket was breached. If you place personal data in weakly protected public stores (AWS S3, file sharing services, FTPs), then you are waiting for trouble to happen.

How to prevent it? Don’t put personal data publicly. How to prevent that from happening – always review your S3 buckets and FTP servers policies. Have internal procedures that disallow sharing personal data without protecting it with at least a password shared by a side-channel (messenger/sms).

Unrestricted API calls – that’s what caused the Facebook-Cambridge Analytics issue. No matter how secure your servers are, if you expose the data through your API without access restriction, rate-limiting, fraud-detection, audit trail, then your security is no use – someone will “scrape” your data through the API.

How to prevent it? Do not expose too much personal data over public or easily accessible APIs. Vet API users and inform your users whenever their data is being shared with third parties, via API or otherwise.

Internal actor – all of the woes above can happen due to poor security or due to internal actors. Even if your network is well guarded, an admin can go rogue and leak the data. For many reasons, nonincluding financial. An privileged internal actor has access to perform SELECT *, can decrypt the backups, can pretend to be a trusted API partner.

How to prevent it? Good operational security. A single sentence like that may sound easy, but it’s not. I don’t have a full list of things that have to be in place to guard against internal breaches – there are technical, organizational and legal measures to be taken. Have unmodifiable audit trail. Have your Intrusion prevention system (or logging solution) also detect anomalous internal behaviour. Have procedures that require two admins to work together in order to log in (e.g. split key) to the most. If the data is sensitive, do background checks on the privileged admins. And many more things that fall into the “operational security” umbrella.

Man-in-the-middle attacks – MITM can be used to extract data from active users only. It works on website without HTTPS, or in case the attacker has somehow installed a wildcard certificate on the target machine (and before you say that’s too unlikely – it happens way too often to be ignored). In case of a successful MITM attack, the attacker can extract all data that’s being transferred.

How to prevent it? First – use HTTPS. Always. Redirect HTTP to HTTPS. Use HSTS. Use certificate pinning if you control the updates of the application (e.g. through an app store). The root certificate attack unfortunately cannot be circumvented. Sorry, just hope that your users haven’t installed such shitty software. Fortunately, this won’t lead to massive breaches, only data of active users that are being targeted may leak.

JavaScript injection / XSS – if somehow an attacker can inject javascript into your website, they can collect data being entered. This is what happened in the recent British Airways breach. A remember a potential attack on NSW (Australia) elections, where the piwick analytics script was loaded from an external server that was vulnerable to a TLS downgrade attack which allowed an attacker to replace the script and thus interfere with the election registration website.

How to prevent it? Follow the XSS protection cheat sheet by OWASP. Don’t include scripts from dodgy third party domains. Make sure third party domains, including CDNs, have a good security level (e.g. run Qualys SSL test).

Leaked passwords from other websites – one of the issues with incorrect storage of passwords is password reuse. Even if you store passwords properly, a random online store may not and if your users use the same email and password there, an attacker may try to steal their data from your site. Not all accounts will be compromised, but the more popular your service is, the more accounts will be affected.

How to avoid it? There’s not much you can do to make other websites store passwords correctly. But you can encourage the use of pass phrases , you can encourage 2-factor authentication in case of sensitive data, or you can avoid having passwords at all and use an external OAuth/OpenID provider (this has its own issues, but they may be smaller than those of password reuse). Also have some rate-limiting in place so that a single IP (or an IP range) is not able to try and access many accounts consecutively.

Employees sending emails with unprotected excel sheets – especially non-technical organizations and non-technical employees tend to just want to get their job done, so they may send large excel sheets with personal data to colleagues or partners in other companies. Then once someone’s email account or server is breached, the data gets breached as well.

How to prevent it? Have internal procedures against sending personal data in excel sheets, or at least have people zip them and send passwords through a side channel (messenger/sms). You can have an organization-wide software that scans outgoing emails for attachments with excel sheets that contain personal data and have these email blocked.

Data breaches are prevented by having good information security. And information security is hard. And it’s the right combination of security practices and security products that minimize the risk of incidents. Many organizations choose not to focus on infosec, as it’s not their core business or they estimate that the risk is worth it, viewing breaches, internal actors manipulating data and other incidents as something that can’t happen to them. Until it happens.

The post Types of Data Breaches and How To Prevent Them appeared first on Bozho's tech blog.

Resources on Distributed Hash Tables

Post Syndicated from Bozho original https://techblog.bozho.net/resources-on-distributed-hash-tables/

Distributed p2p technologies have always been fascinating to me. Bittorrent is cool not because you can download pirated content for free, but because it’s an amazing piece of technology.

At some point I read and researched a lot about how DHTs (distributed hash tables) work. DHTs are not part of the original bittorrent protocol, but after trackers were increasingly under threat to be closed for copyright infringment, “trackerless” features were added to the protocol. A DHT is distributed among all peers and holds information about which peer holds what data. Once you are connected to a peer, you can query it for their knowledge on who has what.

During my research (which was with no particular purpose) I took a note on many resources that I thought useful for understanding how DHTs work and possibly implementing something ontop of them in the future. In fact, a DHT is a “shared database”, “just like” a blockchain. You can’t trust it as much, but proving digital events does not require a blockchain anyway. My point here is – there is a lot more cool stuff to distributed / p2p systems than blockchain. And maybe way more practical stuff.

It’s important to note that the DHT used in BitTorrent is Kademlia. You’ll see a lot about it below.

Anyway, the point of this post is to share the resources that I collected. For my own reference and for everyone who wants to start somewhere on the topic of DHTs.

I hope the list is interesting and useful. It’s not trivial to think of other uses of DHTs, but simply knowing about them and how they work is a good thing.

The post Resources on Distributed Hash Tables appeared first on Bozho's tech blog.

Automate Access Control for User-Specific Entities

Post Syndicated from Bozho original https://techblog.bozho.net/automate-access-control-for-user-specific-entities/

Practically every web application is supposed to have multiple users and each user has some data – posts, documents, messages, whatever. And the most obvious thing to do is to protect these entities from being obtained by users that are not the rightful owners of these resources.

Unfortunately, this is not the easiest thing to do. I don’t mean it’s hard, it’s just not as intuitive as simply returning the resources. When you are your /record/{recordId} endpoint, a database query for the recordId is the immediate thing you do. Only then comes the concern of checking whether this record belongs to the currently authenticated user.

Frameworks don’t give you a hand here, because this access control and ownership logic is domain-specific. There’s no obvious generic way to define the ownership. It depends on the entity model and the relationships between entities. In some cases it can be pretty complex, involving a lookup in a join table (for many-to-many relationships).

But you should automate this, for two reasons. First, manually doing these checks on every endpoint/controller method is tedious and makes the code ugly. Second, it’s easier to forget to add these checks, especially if there are new developers.

You can do these checks in several places, all the way to the DAO, but in general you should fail as early as possible, so these checks should be on a controller (endpoint handler) level. In the case of Java and Spring, you can use annotations and a HandlerInterceptor to automate this. In case of any other language or framework, there are similar approaches available – some pluggable way to describe the ownership relationship to be checked.

Below is an example annotation to put on each controller method:

public @interface VerifyEntityOwnership {
    String entityIdParam() default "id";
    Class<?> entityType();

Then you define the interceptor (which, of course, should be configured to be executed)

public class VerifyEntityOwnershipInterceptor extends HandlerInterceptorAdapter {

    private static final Logger logger = LoggerFactory.getLogger(VerifyEntityOwnershipInterceptor.class);
    private OrganizationService organizationService;

    private MessageService MessageService;
    private UserService userService;
    public boolean preHandle(HttpServletRequest request, HttpServletResponse response, Object handler) {

        Authentication authentication = SecurityContextHolder.getContext().getAuthentication();
        // assuming spring-security with a custom authentication token type
        if (authentication instanceof ApiAuthenticationToken) {
            AuthenticationData authenticationData = ((ApiAuthenticationToken) authentication).getAuthenticationData();

            UUID clientId = authenticationData.getClientId();
            HandlerMethod handlerMethod = (HandlerMethod) handler;
            VerifyEntityOwnership annotation = handlerMethod.getMethodAnnotation(VerifyEntityOwnership.class);
            if (annotation == null) {
                logger.warn("No VerifyEntityOwnership annotation found on method {}", handlerMethod.getMethod().getName());
                return true;
            String entityId = getParam(request, annotation.entityIdParam());
            if (entityId != null) {
                if (annotation.entityType() == User.class) {
                    User user = userService.get(entityId);
                    if (!user.getClientId().equals(clientId)) {
                       return false;
                } else if (annotation.entityType() == Message.class) {
                    Message record = messageService.get(entityId);
                    if (!message.getClientId().equals(clientId) {
                        return false;
                } // .... more

        return true;
    private String getParam(HttpServletRequest request, String paramName) {
        String value = request.getParameter(paramName);
        if (value != null) {
            return value;
        Map<String, String> pathVariables = (Map<String, String>) request.getAttribute(HandlerMapping.URI_TEMPLATE_VARIABLES_ATTRIBUTE);
        return pathVariables.get(paramName);

You see that this presumes the need for custom logic per type. If your model is simple, you can make that generic – make all your entities implement some `Owned interface with getClientId() method that all of them define. Then simply have a dao.get(id, entityClass); and avoid having entity-specific logic.

Notice the warning that gets printed when there is no annotation on a method – this is there to indicate that you might have forgotten to add one. Some endpoints may not require ownership check – for them you can have a special @IgnoreEntityOwnership annotation. The point is to make a conscious decision to not verify the ownership, rather than to forget about it and introduce a security issue.

What I’m saying might be obvious. But I’ve seen many examples of this omission, including production government projects. And as I said, frameworks don’t force you to consider that aspect, because they can’t do it in a generic way – web frameworks are usually not concerned with your entity model, and your ORM is not concerned with your controllers. There are comprehensive frameworks that handle all of these aspects, but even they don’t have generic mechanisms for that (at least not that I’m aware of).

Security includes applying a set of good practices and principles to a system. But it also includes procedures and automations that help developers and admins in not omitting something that they are generally aware of, but happen to forget every now and then. And the less tedious a security principle is to apply, the more likely it will be consistently applied.

The post Automate Access Control for User-Specific Entities appeared first on Bozho's tech blog.