Tag Archives: Developer tips

Restoring Cassandra Priam Backup With sstableloader

Post Syndicated from Bozho original https://techblog.bozho.net/restoring-cassandra-priam-backup-with-sstableloader/

I’ve previously written about setting up Cassandra and Priam for backup and cluster management. The example that I gave for backup restore there, however, is not applicable in every situation – it may not work on a completely separate cluster, for example. Or in case of partial restore to just one table, rather than the whole database.

In such cases you may choose to do a restore using the sstableloader utility. It has a straightforward syntax:

sudo sstableloader -d, -ts /etc/cassandra/conf/truststore.jks \
   -ks /etc/cassandra/conf/node.jks -f /etc/cassandra/conf/cassandra.yaml  \

If you look at your Priam-generated backup, it looks like you can just copy the files (e.g. via s3 aws cp on AWS) for the particular tables and sstableloader import them. There’s a catch, however. In order to save space, Priam is using Snappy to compress all of the files. So if you try to feed them to any Cassandra utility, it will complain that they are corrupted.

So you have to decompress them before using sstableloader or anything else. But how? Well, Priam offers a service for that – you call it by passing the absolute path to a compressed file and the absolute path to where the uncompressed should be placed and it does the simple job of streaming the original through a decompressor. For decompressing an entire backup, I’ve written a python script. It assumes a certain structure, but you can parameterize it to make it more flexible. Here’s the code (excuse my non-idiomatic Python, I’m only using it for simple scripting):

#! /usr/bin/env python
# python script used to pass each backup file through the decompression facility of Priam (using Snappy)
# so that it can be used with sstableloader for restore
import os
import requests

rootdir = '/home/ec2-user/backup'
target = '/home/ec2-user/keyspace'

for subdir, dirs, files in os.walk(rootdir):
    for file in files:
        fullpath = os.path.join(subdir, file)
        parent = os.path.join(fullpath, os.pardir)
        table = os.path.basename(os.path.abspath(parent))
        targetdir = target + "/" + table + "/"
        if not os.path.exists(targetdir):

        url = 'http://localhost:8080/Priam/REST/v1/cassadmin/decompress?in=' + fullpath + '&out=' + target + "/" + table + "/" + file

Now you have decompressed backup files that you can restore using sstableloader. It may take some time if you have a lot of data, and you should not run the restore at the same time a snapshot backup is performed, as it may fail (was warned by the documentation)

As a general note here, it’s very important to have backups but it’s much more important to be able to restore from them. A backup is useless if you don’t have a restore procedure. And simply having the tools available (e.g. Priam) doesn’t mean you can a restore procedure ready to execute. You should be doing test restores on active staging data as well as full restores on an empty, newly formed cluster, as there are different restore scenarios.

The post Restoring Cassandra Priam Backup With sstableloader appeared first on Bozho's tech blog.

The Personal Data Store Pattern

Post Syndicated from Bozho original https://techblog.bozho.net/the-personal-data-store-pattern/

With the recent trend towards data protection and privacy, as well as the requirements of data protection regulations like GDPR and CCPA, some organizations are trying to reorganize their personal data so that it has a higher level of protection.

One path that I’ve seen organizations take is to apply the (what I call) “Personal data store” pattern. That is, to extract all personal data from existing systems and store it in a single place, where it’s accessible via APIs (or in some cases directly through the database). The personal data store is well guarded, audited, has proper audit trail and anomaly detection, and offers privacy-preserving features.

It makes sense to focus one’s data protection efforts predominantly in one place rather than scatter it across dozens of systems. Of course it’s far from trivial to migrate so much data from legacy systems to a new module and then upgrade them to still be able to request and use it when needed. That’s why in some cases the pattern is applied only to sensitive data – medical, biometric, credit cards, etc.

For the sake of completeness, there’s something else called “personal data stores” and it means an architecture where the users themselves store their own data in order to be in control. While this is nice in theory, in practice very few users have the capacity to do so, and while I admire the Solid project, for example, I don’t think it is viable pattern for many organizations, as in many cases users don’t directly interact with the company, but the company still processes large amounts of their personal data.

So, the personal data store pattern is an architectural approach to personal data protection. It can be implemented as a “personal data microservice”, with CRUD operations on predefined data entities, an external service can be used (e.g. SentinelDB, a project of mine), or it can just be a centralized database that has some proxy in front of it to control the access patterns. You an imagine it as externalizing your application’s “users” table and its related tables.

It sounds a little bit like a data warehouse for personal data, but the major difference is that it’s used for operational data, rather than (just) analysis and reporting. All (or most) of your other applications/microservices interact constantly with the personal data store whenever they need to access or update (or “forget”) personal data.

Some of the main features of such a personal data store, the combination of which protect against data breaches, in my view, include:

  • Easy to use interface (e.g. RESTful web services or simply SQL) – systems that integrate with the personal data store should be built in a way that a simple DAO layer implementation gets swapped and then data that was previously accessed form a local database is now obtained from the personal data store. This is not always easy, as ORM technologies add a layer of complexity.
  • High level of general security – servers protected with 2FA, access control, segregated networks, restricted physical access, firewalls, intrusion prevention systems, etc. The good things is that it’s easier to apply all the best practices applied to a single system instead of applying it (and keeping it that way) to every system.
  • Encryption – but not just “data at rest” encryption; especially sensitive data can and should be encrypted with well protected and rotated keys. That way the “honest but curious” admin won’t be able to extract anything form the underlying database
  • Audit trail – all infosec and data protection standards and regulations focus on accountability and traceability. There should not be a way to extract or modify personal data without leaving a trace (and ideally, that trace should be protected as well)
  • Anomaly detection – checking if there is something strange/anomalous in the data access patterns. Such strange access patterns can mean a data breach is happening, and the personal data store can actively block it. There is a lot of software out there that does anomaly detection on network traffic, but it’s much better if the rules (or machine learning) are domain-specific. “Monitor for increased traffic to those servers” is one thing, but it’s much better to be able to say “monitor for out-of-the ordinary accesses to personal data of such and such kind”
  • Pseudonymization – many systems that need the personal data don’t actually need to know who it is about. That includes marketing, including outsourcing to 3rd parties, reporting functionalities, etc. So the personal data store can return data that does not allow a person do be identified, but a pseudo-ID instead. That way, when updates are made back to the personal data store, they can still refer to a particular person, via the pseudonymous ID, but the application that extracted the data in the first place doesn’t get to know who the data was about. This is useful in scenarios where data has to be (temporarily or not) stored in a database that lies outside the personal datastore.
  • Authentication – if the company offers user authentication, this can be done via the personal data store. Passwords, two-factor authentication secrets and other means of authentication are personal data, and an important one as well. An organization may use a single-sign-on internally (e.g. Active Directory), but it doesn’t make sense to put customers there, too, so they are usually stored in a database. During authentication, the personal data store accepts all necessary credentials (username, password, 2FA code), and return a token to be used for subsequent calls or to be used a a session cookie token.
  • GDPR (or CCPA or similar) functionalities – e.g. export of all data about a person, forgetting a person. That’s an often overlooked problem, but “give me all data about me that you have” is an enormous issue with large companies that have dozens of systems. It’s next to impossible to extract the data in a sensible way from all the systems. Tracking all these requests is itself a requirement, so the personal data store can keep track of them to present to auditors if needed.

That’s all easier said than done. In organizations that have already many systems working alongside and processing personal data, migration can be costly. So it’s a good idea to introduce it as early as possible, and have a plan (even if it lasts for years) to move at least sensitive personal data to the well protected silo. This silo is a data engineering effort, a system refactoring effort and an organizational effort. The benefits, though, are reduced long-term cost and reduced risks for data breaches and non-compliance.

The post The Personal Data Store Pattern appeared first on Bozho's tech blog.

Remote Log Collection on Windows

Post Syndicated from Bozho original https://techblog.bozho.net/remote-log-collection-on-windows/

Every organization needs to collect logs from multiple sources in order to put them in either a log collector or SIEM (or a dedicated audit trail solution). And there are two options for that – using an agent and agentless.

Using an agent is easy – you install a piece of software on each machine that generates logs and it forwards them wherever needed. This is however not preferred by many organizations as it complicates things – upgrading to new versions, keeping track of dozens of configurations, and potentially impacting performance of the target machines.

So some organizations prefer to collect logs remotely, or use standard tooling, already present on the target machine. For Linux that’s typically syslog, where forwarding is configured. Logs can also be read remotely via SCP/SSH.

However, on Windows things are less straightforward. You need to access the Windows Event Log facility remotely, but there is barely a single place that describes all the required steps. This blogpost comes close, but I’d like to provide the full steps, as there are many, many things that one may miss. It is a best practice to use a non-admin, service account for that and you have to give multiple permissions to allow reading the event logs remotely.

There are also multiple ways to read the logs remotely:

  • Through the Event Viewer UI – it’s the simplest to get right, as only one domain group is required for access
  • Through Win32 native API calls (and DCOM) – i.e. EvtOpenSession and the related methods
  • Through PowerShell Get-WinEvent (Get-EventLog is a legacy cmdlet that doesn’t support remoting)
  • Through WMI directly (e.g. this or this. To be honest, I don’t know whether the native calls and the powershell commands don’t use WMI and/or CIM underneath as well – probably.

So, in order to get these options running, the following configurations have to be done:

  1. Allow the necessary network connections to the target machines (through network rules and firewall rules, if applicable)
  2. Go to Windows Firewall -> Inbound rules and enable the rules regarding “Remote log management”
  3. Create a service account and configure it in the remote collector. The other option is to have an account on the collector machine that is given the proper access, so that you can use the integrated AD authentication
  4. Add the account to the following domain groups: Event log readers, Distributed COM users. The linked article above mentions “Remote management users” as well, but that’s optional if you just want to read the logs
  5. Give the “Manage auditing and security log” privilege to the service account through group policies (GPO) or via “local security policy”. Find it under User Rights Assignment > Manage auditing and security log
  6. Give WMI access – open “wmimgmt” -> right click -> properties > Security -> Advanced and allow the service account to “Execute Methods”, “Provider Write”, “Enable Account”, “Remote Enable”. To be honest, I’m not sure exactly which folder that should be applied to, and applying it to the root may be too wide, so you’d have to experiment
  7. Give registry permissions: Regedit -> Local machine -> System\CurrentControlSet\Services\eventlog\Security -> right click -> permissions and add the service account. According to the linked post you also have to modify a particular registry entry, but that’s not required just for reading the log. This step is probably the most bizarre and unexpected one.
  8. Make sure you have DCOM rights. This comes automatically wit the DCOM group, but double check via DCOMCnfg -> right click -> COM security
  9. Grant permissions for the service account on c:\windows\system32\winevt. This step is not required for “simple” reading of the logs, but I’ve seen it in various places, so in some scenarios you might need to check it
  10. Make sure the application or service that is reading the logs remotely has sufficient permissions – it can usually run with admin privileges, because it’s on a separate, dedicated machine.
  11. Restart services – that is optional, but can be done just in case: Restart “Windows Remote Management (WS-Management)” and “Windows Event Log” on the target machine

As you can see, there are many things that you can miss, and there isn’t a single place in any documentation to list those steps (though there are good guides like this that go in a slightly different direction).

I can’t but make a high-level observation here – the need to do everything above is an example of how security measures can “explode” and become really hard to manage. There are many service, groups, privileges, policies, inbound rules and whatnot, instead of just “Allow remote log reading for this user”. I know it’s inherently complex, but maybe security products should make things simpler by providing recipes for typical scenarios. Following guides in some blog is definitely worse than running a predefined set of commands. And running the “Allow remote access to event log” recipe would do just what you need. Of course, knowing which recipe to run and how to parameterize it would require specific knowledge, but you can’t do security without trained experts.

The post Remote Log Collection on Windows appeared first on Bozho's tech blog.

Protecting JavaScript Files (From Magecart-Style Attacks)

Post Syndicated from Bozho original https://techblog.bozho.net/protecting-javascript-files-from-magecart-attacks/

Most web pages now consist of multiple JavaScript files that are included in various ways (via >script< or in some more dynamic fashion, bundled and minified or not). But since these scripts interact with everything on the page, they can be a security risk.

And Magecart showcased that risk – the group attacked multiple websites, including British Airways and Ticketmaster, and stole a few hundred thousand credit card numbers.

It is a simple attack where attacker inserts a malicious javascript snippet into a trusted javascript file, collects credit card details entered into payment forms and sends them to an attacker-owned website. Obviously, the easy part is writing the malicious javascript; the hard part is getting it on the target website.

Many websites rely on externally hosted assets (including scripts) – be it a CDN, or a dedicated asset server (as in the case of British Airways). These externally hosted assets may be vulnerable in several ways:

  • Asset servers may be less protected than the actual server, because they are just static assets, what could go wrong?
  • Credentials to access CDN configuration may be leaked which can lead to an attacker replacing the original source scripts with their own
  • Man-in-the-middle attacks are possible if the asset server is misconfigured (e.g. allowing TLS downgrade attack)
  • The external service (e.g. CND) that was previously trusted can go rogue – that’s unlikely with big providers, but smaller and cheaper ones are less predictable

Once the attackers have replaced the script, they are silently collecting data until they are caught. And this can be a long time.

So how to protect against those attacks? A typical advice is to introduce a content security policy, which will allow scripts from untrusted domains to be executed. This is a good idea, but doesn’t help in the scenario where a trusted domain is compromised. There are several main approaches, and I’ll summarize them below:

  • Subresource integrity – this is a browser feature that lets you specify the hash of a script file and validates that hash when the page loads. If it doesn’t match the hash of the actually loaded script, the script is blocked. This sounds great, but has several practical implications. First, it means you need to complicate your build pipeline so that it calculates the hashes of minified and bundled resources and inject those hashes in the page templates. It’s a tedious process, but it’s doable. Then there are the dynamically loaded scripts where you can’t use this feature, and there are the browsers that don’t support it fully (Edge, IE and Safari on mobile). And finally, if you don’t have a good build pipeline (which many small websites don’t), a very small legitimate change in the script can break your entire website.
  • Don’t use external services – that sounds straightforward but it isn’t always. CDNs exist for a reason and optimize your site loading speeds and therefore ranking, internal policies may require using a dedicated asset server, sometimes plugins (e.g. for WordPress) may fetch external resources. An exception to this rule is allowed if you somehow sandbox the third party script (e.g. via iframe as explained in the link above)
  • Secure all external servers properly – if you can do that, that’s great – upgrade the supported cipher suites, monitor for 0days, use only highly trusted CDNs. Regardless of anything, you should obviously always strive to do that. But it requires expertise and resources, which may not be available to every company and every team.

There is one more scenario that may sound strange – if an attacker hacks into your main application server(s), they can replace the scripts with whatever they want. It sounds strange at first, because if they have access to the server, it’s game over anyway. But it’s not always full access with RCE – might be a limited access. Credit card numbers are usually not stored in plain text in the database, so having access to the application server may not mean you have access to the credit card numbers. And changing the custom backend code to collect the data is much more unpredictable and time-consuming than just replacing the scripts with malicious ones. None of the options above protect against that (as in this case the attacker may be able to change the expected hash for the subresource integrity check)

Because of the limitations of the above approaches, at my company we decided to provide a tool to monitor your website for such attacks. It’s called Scriptinel.com (short for Script Sentinel) and is currently in early beta. It’s mainly targeted at small website owners who can’t get any of the above 3 points, but can be used for sophisticated websites as well.

What it does is straightforward – it scans a given URL, extracts all scripts from it (even the dynamic ones), and starts monitoring them for changes with periodic requests. If it discovers a change, it notifies the website owner so that they can react.

This means that the attacker may have a few minutes to collect data, but time is an important factor here – this is not a “SELECT *” data breach; it relies on customers using the website. So a few minutes minimizes the damage. And it doesn’t break your website (I guess we can have a script to include that blocks the page if scriptinel has found discrepancies). It also doesn’t require changes in the build process to include hashes. Of course, such a reactive approach is not perfect, especially if there is nobody to react, but monitoring is a good idea regardless of whether other approaches are used.

There is the issue of protected pages and pages that are not directly accessible via a GET request – e.g. a payment page. For that you can enter your javascript files individually, rather than having the tool scan the page. We can add a more sophisticated user journey scan, with specifying credentials and steps to reach the protected pages, but for now that seems unnecessary.

How does it solve the “main server compromised” problem? Well, nothing solves that perfectly, as the attacker can do changes that serve the legitimate version of the script to your monitoring servers (identifying them by IP) and the modified scripts to everyone else. This can be done on the compromised external asset servers as well (though not with leaked CDN credentials). However this implies the attacker knows that Scriptinel is used, knows the IP addresses of our scanners, and has gained sufficient control to server different versions based on IP. This raises the bar significantly, and can even be made impossible to pull off if we regularly change the IP addresses in a significantly large IP range.

Such functionality may be available in some enterprise security suites, though I’m not aware of it (if it exists somewhere, please let me know).

Overall, the problem is niche, but tough, and not solving it can lead to serious data breaches even if your database is perfectly protected. Scriptinel is a simple to use, good enough solution (and one that’s arguably better than the other options).

Good information security is the right combination of knowledge, implementation of best practices and tools to help you with that. And I maybe Scriptinel is one such tool.

The post Protecting JavaScript Files (From Magecart-Style Attacks) appeared first on Bozho's tech blog.

Let’s Annotate Our Methods With The Features They Implement

Post Syndicated from Bozho original https://techblog.bozho.net/lets-annotate-our-methods-with-the-features-they-implement/

Writing software consists of very little actual “writing”, and much more thinking, designing, reading, “digging”, analyzing, debugging, refactoring, aligning and meeting others.

The reading and digging part is where you try to understand what has been implemented before, why it has been implemented, and how it works. In larger projects it becomes increasingly hard to find what is happening and why – there are so many classes that interfere, and so many methods participate in implementing a particular feature.

That’s probably because there is a mismatch between the programming units (classes, methods) and the business logic units (features). Product owners want a “password reset” feature, and they don’t care if it’s done using framework configuration, custom code split in three classes, or one monolithic controller method that does that job.

This mismatch is partially addressed by the so called BDD (behaviour driven development), as business people can define scenarios in a formalized language (although they rarely do, it’s still up to the QAs or developers to write the tests). But having your tests organized around features and behaviours doesn’t mean the code is, and BDD doesn’t help in making your way through the codebase in search of why and how something is implemented.

Another issue is linking a piece of code to the issue tracking system. Source control conventions and hooks allow for setting the issue tracker number as part of the commit, and then when browsing the code, you can annotate the file and see the issue number. However, due the the many changes, even a very strict team will end up methods that are related to multiple issues and you can’t easily tell which is the proper one.

Yet another issue with the lack of a “feature” unit in programming languages is that you can’t trivially reuse existing projects to start a new one. We’ve all been there – you have a similar project and you want to get a skeleton to get thing running faster. And while there are many tools to help that (Spring Boot, Spring Roo, and other scaffolding utilities), they can rarely deliver what you need – you always have to tweak something, delete something, customize some configuration, as defaults are almost never practical.

And I have a simple proposal that will help with the issues above. As with any complex problem, simple ideas don’t solve everything, but are at least a step forward.

The proposal is in the title – let’s annotate our methods with the features they implement. Let’s have @Feature(name = "Forgotten password", issueTrackerCode="PROJ-123"). A method can implement multiple features, but that is generally discouraged by best practices (e.g. the single responsibility principle). The granularity of “feature” is something that has to be determined by each team and is the tricky part – sometimes an epic describes a feature, sometimes individual stories or even subtasks do. A definition of a feature should be agreed upon and every new team member should be told what to do and how to interpret it.

There is of course a lot of complexity, e.g. for generic methods like DAO methods, utility methods, or methods that are reused in too many places. But they also represent features, it’s just that these features are horizontal. “Data access layer” is a feature – a more technical one indeed, but it counts, and maybe deserves a story in the issue tracker.

Your features can actually be listed in one or several enums, grouped by type – business, horizontal, performance, etc. That way you can even compose features – e.g. account creation contains the logic itself, database access, a security layer.

How does such a proposal help?

  • Consciousnesses about the single responsibility of methods and that code should be readable
  • Provides a rationale for the existence of each method. Even if a proper comment is missing, the annotation will put a method (or a class) in context
  • Helps navigating code and fixing issues (if you can see all places where a feature is implemented, you are more likely to spot an issue)
  • Allows tools to analyze your features – amount, complexity, how chaotic a feature is spread across the code base, test coverage per feature, etc.
  • Allows tools to use existing projects for scaffolding for new ones – you specify the features you want to have, and they are automatically copied

At this point I’m supposed to give a link to a GitHub project for a feature annotation library. But it doesn’t make sense to have a single-annotation project. It can easily be part of guava or something similar Or can be manually created in each project. The complex part – the tools that will do the scanning and analysis, deserve separate projects, but unfortunately I don’t have time to write one.

But even without the tools, the concept of annotating methods with their high-level features is I think a useful one. Instead of trying to deduce why is this method here and what requirements does it have to implement (and were all necessary tests written at the time), such an annotation can come handy.

The post Let’s Annotate Our Methods With The Features They Implement appeared first on Bozho's tech blog.

JKS: Extending a Self-Signed Certificate

Post Syndicated from Bozho original https://techblog.bozho.net/jks-extending-a-self-signed-certificate/

Sometimes you don’t have a PKI in place but you still need a key and a corresponding certificate to sign stuff (outside of the TLS context). And after the certificate in initially generated jks file expires, you have few options – either generate an entirely new keypair, or somehow “extend” the existing certificate. This is useful mostly for testing and internal systems, but still worth mentioning.

Extending certificates is generally not possible – once they expire, they’re done. However, you can have a new certificate with the same private key and a longer period. This sounds like something that should be easy to do, but it turns it it isn’t that easy with keytool. Even with my favourite tool, keystore explorer, it’s not immediately possible.

In order to reuse the private key to have a new, longer certificate, you need to do the following:

  1. Export the private key (with keytool & openssl or through the keystore-explorer UI, which is much simpler)
  2. Make a certificate signing request (with keytool or through the keystore-explorer UI)
  3. Sign the request with the private key (i.e. self-signed)
  4. Import the certificate in the store to replace the old (expired) one

The last two steps seem to be not straightforward with keytool or keystore exporer. If you try to sign the request with your existing keystore keypair, the current certificate is used as the root of the chain (and you don’t want that). And you can’t remove the certificate and generate a new one.

So you need to use OpenSSL:

x509 -req -days 3650 -in req.csr -signkey private.key -sha256 -extfile extfile.cnf -out result.crt

The extfile.cnf is optional and is used if you want to specify extensions. E.g. for timestamping, the extension file looks like this:


After that “simply” create a new keystore and import the private key and the newly generated certificate. This is straightforward through the keystore-explorer UI, and much less easy through the command line.

You’ve noticed my preference for keytool-explorer. It is a great tool that makes working with keys and keystores easy and predictable, as opposed to command-line tools like keytool and openssl, which I’m sure nobody is able to use without googling every single command. Of course, if you have to do very specific or odd stuff, you’ll have to revert to command line, but for most operations the UI is sufficient (unless you have to automate it, in which case, obviously, use the CLI).

You’d rarely need to do what I’ve shown above, but in case you have to, I hope the hints above were useful.

The post JKS: Extending a Self-Signed Certificate appeared first on Bozho's tech blog.

Multiple Cache Configurations with Caffeine and Spring Boot

Post Syndicated from Bozho original https://techblog.bozho.net/multiple-cache-configurations-with-caffeine-and-spring-boot/

Caching is key for performance of nearly every application. Distributed caching is sometimes needed, but not always. In many cases a local cache would work just fine and there’s no need for the overhead and complexity of the distributed cache.

So, in many applications, including plain Spring and Spring Boot, you can use @Cacheable on any method and its result will be cached so that the next time the method is invoked, the cached result is returned.

Spring has some default cache manager implementations, but external libraries are always better and more flexible than simple implementations. Caffeine, for example is a high-performance Java cache library. And Spring Boot comes with a CaffeineCacheManager. So, ideally, that’s all you need – you just create a cache manager bean and you have caching for your @Cacheable annotated-methods.

However, the provided cache manager allows you to configure just one cache specification. Cache specifications include the expiry time, initial capacity, max size, etc. So all of your caches under this cache manager will be created with a single cache spec. The cache manager supports a list of predefined caches as well as dynamically created caches, but on both cases a single cache spec is used. And that’s rarely useful for production. Built-in cache managers are something you have to be careful with, as a general rule.

There are a few blogposts that tell you how to define custom caches with custom specs. However, these options do not support the dynamic, default cache spec usecase that the built-in manager supports. Ideally, you should be able to use any name in @Cacheable and automatically a cache should be created with some default spec, but you should also have the option to override that for specific caches.

That’s why I decided to use a simpler approach than defining all caches in code that allows for greater flexibility. It extends the CaffeineCacheManager to provide that functionality:

 * Extending Caffeine cache manager to allow flexible per-cache configuration
public class FlexibleCaffeineCacheManager extends CaffeineCacheManager implements InitializingBean {
    private Map<String, String> cacheSpecs = new HashMap<>();

    private Map<String, Caffeine<Object, Object>> builders = new HashMap<>();

    private CacheLoader cacheLoader;

    public void afterPropertiesSet() throws Exception {
        for (Map.Entry<String, String> cacheSpecEntry : cacheSpecs.entrySet()) {
            builders.put(cacheSpecEntry.getKey(), Caffeine.from(cacheSpecEntry.getValue()));

    protected Cache<Object, Object> createNativeCaffeineCache(String name) {
        Caffeine<Object, Object> builder = builders.get(name);
        if (builder == null) {
            return super.createNativeCaffeineCache(name);

        if (this.cacheLoader != null) {
            return builder.build(this.cacheLoader);
        } else {
            return builder.build();

    public Map<String, String> getCacheSpecs() {
        return cacheSpecs;

    public void setCacheSpecs(Map<String, String> cacheSpecs) {
        this.cacheSpecs = cacheSpecs;

    public void setCacheLoader(CacheLoader cacheLoader) {
        this.cacheLoader = cacheLoader;

In short, it create one caffeine builder per spec and uses that instead of the default builder when a new cache is needed.

Then a sample XML configuration would look like this:

<bean id="cacheManager" class="net.bozho.util.FlexibleCaffeineCacheManager">
    <property name="cacheSpecification" value="expireAfterWrite=10m"/>
    <property name="cacheSpecs">
            <entry key="statistics" value="expireAfterWrite=1h"/> 

With Java config it’s pretty straightforward – you just set the cacheSpecs map.

While Spring has already turned into a huge framework that provides all kinds of features, it hasn’t abandoned the design principles of extensibility.

Extending built-in framework classes is something that happens quite often, and it should be in everyone’s toolbox. These classes are created with extension in mind – you’ll notice that many methods in the CaffeineCacheManager are protected. So we should make use of that whenever needed.

The post Multiple Cache Configurations with Caffeine and Spring Boot appeared first on Bozho's tech blog.

Command-line SQL Client for IBM i 7.x (AS/400)

Post Syndicated from Bozho original https://techblog.bozho.net/command-line-sql-client-for-ibm-i-7-x-as-400/

In the category of “niche blogposts”, this is probably the “nichest”. But it might be useful, so I’ll share it.

Recently I had to interface an IBM i system (think “mainframe”, though not exactly) from a command-line-only Linux environment. The interesting part is that you can access the IBM i files through SQL syntax, as they are tabular in nature. However, I failed to find a proper tool for that. There are UI tools by IBM but they don’t work in a headless environment.

So I decided to write my own simple tool – a command line SQL client for IBM i 7.x. It relies on an IBM-provided library (which, by the way, fails in some cases in headless environments as well, as it tries to prompt for certain data using awt/swing). The tool can be used after building it with maven and by simply executing SQL queries after connecting:

java -jar as400-sql-client.jar <connectionString> <username> <password>

It can be can be fond here. Apart from being useful in navigating the IBM system, it relies in the jline project which allows you to create rich command line tools that support autocomplete, history, coloring, etc.

I hope that nobody will need this tool, but in the rare case that someone does need it, I hope to save them hours of struggle.

The post Command-line SQL Client for IBM i 7.x (AS/400) appeared first on Bozho's tech blog.

7 Questions To Ask Yourself About Your Code

Post Syndicated from Bozho original https://techblog.bozho.net/7-questions-to-ask-yourself-about-your-code/

I was thinking the other days – why writing good code is so hard? Why the industry still hasn’t got to producing quality software, despite years of efforts, best practices, methodologies, tools. And the answer to those questions is anything but simple. It involves economic incentives, market realities, deadlines, formal education, industry standards, insufficient number of developers on the market, etc. etc.

As an organization, in order to produce quality software, you have to do a lot. Setup processes, get your recruitment right, be able to charge the overhead of quality to your customers, and actually care about that.

But even with all the measures taken, you can’t guarantee quality code. First, because that’s subjective, but second, because it always comes down to the individual developers. And not simply whether they are capable of writing quality software, but also whether they are actually doing it.

And as a developer, you may fit the process and still produce mediocre code. This is why my thoughts took me to the code from the eyes of the developer, but in the context of software as a whole. Tools can automatically catch code styles issues, cyclomatic complexity, large methods, too many method parameters, circular dependencies, etc. etc. But even if you cover those, you are still not guaranteed to have produced quality software.

So I came up with seven questions that we as developers should ask ourselves each time we commit code.

  1. Is it correct? – does the code implement the specification. If there is no clear specification, did you do a sufficient effort to find out the expected behaviour. And is that behaviour tested somehow – by automated tests preferably, or at least by manual testing,.
  2. Is it complete? – does it take care of all the edge cases, regardless of whether they are defined in the spec or not. Many edge cases are technical (broken connections, insufficient memory, changing interfaces, etc.).
  3. Is it secure? – does it prevent misuse, does it follow best security practices, does it validate its input, does it prevent injections, etc. Is it tested to prove that it is secure against these known attacks. Security is much more than code, but the code itself can introduce a lot of vulnerabilities.
  4. Is it readable and maintainable? – does it allow other people to easily read it, follow it and understand it? Does it have proper comments, describing how a certain piece of code fits into the big picture, does it break down code in small, readable units.
  5. Is it extensible? – does it allow being extended with additional use cases, does it use the appropriate design patterns that allow extensibility, is it parameterizable and configurable, does it allow writing new functionality without breaking old one, does it cover a sufficient percentage of the existing functionality with tests so that changes are not “scary”.
  6. Is it efficient? – does work well under high load, does it care about algorithmic complexity (without being prematurely optimized), does it use batch processing, does it read avoid loading big chunks of data in memory at once, does it make proper use of asynchronous processing.
  7. Is it something to be proud of? – does it represent every good practice that your experience has taught you? Not every piece of code is glorious, as most perform mundane tasks, but is the code something to be proud of or something you’d hope nobody sees? Would you be okay to put it on GitHub?

I think we can internalize those questions. Will asking them constantly make a difference? I think so. Would we magically get quality software If every developer asked themselves these questions about their code? Certainly not. But we’d have better code, when combined with existing tools, processes and practices.

Quality software depends on many factors, but developers are one of the most important ones. Bad software is too often our fault, and by asking ourselves the right questions, we can contribute to good software as well.

The post 7 Questions To Ask Yourself About Your Code appeared first on Bozho's tech blog.

How to create secure software? Don’t blink! [talk]

Post Syndicated from Bozho original https://techblog.bozho.net/how-to-create-secure-software-dont-blink-talk/

Last week Acronis (famous for their TrueImage) organized a conference in Sofia about cybersecurity for developers and I was invited to give a talk.

It’s always hard to pick a topic for a talk on a developer conference that is not too narrowly focused — if you choose something too high level, you can be uesless to the audience and seen as a “bullshitter”; if you pick something too specific, half of the audience may be bored because it is not their area of work.

So I chose the middle ground — an overview talk with as much specifics as possible in it. I tried to tell interesting stories of security vulnerabilities to illustrate my points. The talk is split in several parts:

  • Purpose of attacks
  • Front-end vulnerabilities (examples and best practices)
  • Back-end vulnerabilities (examples and best practices)
  • Infrastructure vulnerabilities (examples and best practices)
  • Human factor vulnerabilities (examples and best practices)
  • Thoughts on how this fits into the bigger picture of software security

You can watch the 30 minutes video here:

If you would like to download my slides, click here. or view them at SlideShare:

The point is — security is hard and there are a million things to watch for and a million things that can go wrong. You should minimize risk by knowing and following as much best practices as possible. And you should not assume you are secure, as even the best companies make rookie mistakes.

The security mindset, which is partly formalized by secure coding practices, is at the core of having a secure software. Asking yourself constantly “what could go wrong” will make software more secure. It is a whole other topic of how to make all software more secure, not just the ones we are creating, but it is less technical and goes through the topics public policies, financial incentives to invest in security and so on.

For technical people it’s important to know how to make a focused effort to make a system more secure. And to apply what we know, because we may know a lot and postpone it for “some other sprint”.

And as a person from the audience asked me — is not blinking really the way? Of course not, that effort won’t be justified. But if we cover as much of the risks as possible, that will give us some time to blink.

The post How to create secure software? Don’t blink! [talk] appeared first on Bozho's tech blog.

Avoid Lists in Cassandra

Post Syndicated from Bozho original https://techblog.bozho.net/avoid-lists-in-cassandra/

Apache Cassandra is fast and scalable database which over the years became almost as easy to use as a traditional SQL database. At least on the surface.

You an use SQL-like queries, but they have a lot of limitations; you have a schema, but it’s not as flexible to modify it as in a SQL database; you have the same tabular structure with a primary key, but it’s more complicated due to the differentiation between partition key and sorting key. And there are a lot of underlying details that seem irrelevant at first, but are crucial for performance and data consistency, like tombstones, SSTable compaction and so on.

But I want to discuss the “list” column type, as recently we’ve had a very elusive issue with it. We are in the business of guaranteeing data integrity, and that’s why our records are not updated, ever. This is a good fit for Cassandra, as updates are tricky to get right. But on one of our deployments we noticed something strange – very rarely, the hash of the data in a particular entry out of millions would not match upon comparison with the indexed data. Upon investigation, we noticed that a column of type “list” got duplicate values. It was not an issue with the code, because in this particular case the code was always using Collections.singletonList(..)

It appears that Cassandra is trying to be clever and when it sees identical entries in a batch insert, instead of overriding one with the other, it tries to merge them, resulting in a list with duplicate entries. Accounts of the issue are reported here and here.

Now, batches are a difficult topic and one of those things that look straight-forward but aren’t. In most cases, batches are an anti-pattern. There are cases where batches are useful, but it’s more rare than expected. That’s because of the distributed nature of Cassandra. Another complication comes from whether you are using token-aware or toke-unaware client policy, i.e. whether your client knows where each record belongs in order to send the request to it. I won’t go into details about batches, as they are well explained in the two linked articles.

Back to lists – since in our case we don’t have identical records in a batch, the issue was probably manifested because of a network timeout where the client didn’t receive confirmation of the write and re-attempted sending the same statement again. Whether being in a batch or not affects it, I can’t be sure. But it’s probably safer to assume that it might happen with or without a batch. I.e. lists can be merged in unexpected situations.

This is a serious reason for not using lists at all. Additional arguments are given by Walmart

Sets should be preferred to Lists as Sets (and Maps) avoid read-before-write pattern for updates and deletes

And this is just for a small number of items. Using collections for a large number of items (e.g. thousands) is another issue, as you can’t load the items in portions – they are all read at once.

In a Java application, for example, you can easily substitution the List with a Set even if the underlying column is of type List and that would help temporarily avoid the issues – data may still be duplicated in the database, but at least the application will work with unique values. Have in mind though, that ordering is not guaranteed by the Java Set, so if it matters for your logic, make sure you order by some well-defined comparison criteria.

The general advice of “avoid lists” (and “avoid batches”) paints an accurate picture of Cassandra. It looks straightforward to use, but once you get to production, you may realize there were some suboptimal design decisions.

The post Avoid Lists in Cassandra appeared first on Bozho's tech blog.

Certificate Transparency Verification in Java

Post Syndicated from Bozho original https://techblog.bozho.net/certificate-transparency-verification-in-java/

So I had this naive idea that it would be easy to do certificate transparency verification as part of each request in addition to certificate validity checks (in Java).

With half of the weekend sacrificed, I can attest it’s not that trivial. But what is certificate transparency? In short – it’s a publicly available log of all TLS certificates in the world (which are still called SSL certificates even though SSL is obsolete). You can check if a log is published in that log and if it’s not, then something is suspicious, as CAs have to push all of their issued certificates to the log. There are other use-cases, for example registering for notifications for new certificates for your domains to detect potentially hijacked DNS admin panels or CAs (Facebook offers such a tool for free).

What I wanted to do is the former – make each request from a Java application verify the other side’s certificate in the certificate transparency log. It seems that this is not available out of the box (if it is, I couldn’t find it. In one discussion about JEP 244 it seems that the TLS extension related to certificate transparency was discussed, but I couldn’t find whether it’s supported in the end).

I started by thinking you could simply get the certificate, and check its inclusion in the log by the fingerprint of the certificate. That would’ve been too easy – the logs to allow for checking by hash, however it’s not the fingerprint of a certificate, but instead a signed certificate timestamp – a signature issued by the log prior to inclusion. To quote the CT RFC:

The SCT (signed certificate timestamp) is the log’s promise to incorporate the certificate in the Merkle Tree

A merkle tree is a very cool data structure that allows external actors to be convinced that something is within the log by providing an “inclusion proof” which is much shorter than the whole log (thus saving a lot of bandwidth). In fact the coolness of merkle trees is why I was interested in certificate transparency in the first place (as we use merkle trees in my current log-oriented company)

So, in order to check for inclusion, you have to somehow obtain the SCT. I initially thought it would be possible with the Certificate Transparency Java library, but you can’t. Once you have it, you can use the client to check it in the log, but obtaining it is harder. (Note: for server-side verification it’s fine to query the log via HTTP; browsers, however, use DNS queries in order to preserve the anonymity of users).

Obtaining the SCT can be done in three ways, depending on what the server and/or log and/or CA have chosen to support: the SCT can be included in the certificate, or it can be provided as a TLS extension during the TLS handshake, or can be included in the TLS stapling response, again during the handshake. Unfortunately, the few certificates that I checked didn’t have the SCT stored within them, so I had to go to a lower level and debug the TLS handshake.

I enabled TLS hadnshake verbose output, and lo and behold – there was nothing there. Google does include SCTs as a TLS extension (according to Qualys), but the Java output didn’t say anything about it.

Fortunately (?) Google has released Conscrypt – a Java security provider based Google’s fork of OpenSSL. Things started to get messy…but I went for it, included Conscrypt and registered it as a security provider. I had to make a connection using the Conscrypt TrustManager (initialized with all the trusted certs in the JDK):

KeyStore trustStore = KeyStore.getInstance("JKS");
trustStore.load(new FileInputStream(System.getenv("JAVA_HOME") + "/lib/security/cacerts"), "changeit".toCharArray());
ctx.init(null,new TrustManager[] {new TrustManagerImpl(trustStore, 
    null, null, null, logStore, null, 
    new StrictCTPolicy())}, new SecureRandom());

URL url = new URL("https://google.com");
HttpsURLConnection conn = (HttpsURLConnection) url.openConnection();

And of course it didn’t work initially, because Conscrypt doesn’t provide implementations of some core interfaces needed – the CTLogStore and CTPolicy classes. The CTLogStore actually is the important bit that holds information about all the known logs (I still find it odd to call a “log provider” simply “log”, but that’s the accepted terminology). There is a list of known logs, in JSON form, which is cool, except it took me a while to figure (with external help) what are exactly those public keys. What are they – RSA, ECC? How are they encoded? You can’t find that in the RFC, nor in the documentation. It can be seen here that it’s ” DER encoding of the SubjectPublicKeyInfo ASN.1 structure “. Ugh.

BouncyCastle to the rescue. My relationship with BouncyCastle is a love-hate one. I hate how unintuitive it is and how convoluted its APIs are, but I love that it has (almost) everything cryptography-related that you may ever need. After some time wasted with trying to figure how exactly to get that public key converted to a PublicKey object, I found that using PublicKeyFactory.createKey(Base64.getDecoder().decode(base64Key)); gives you the parameters of whatever algorithm is used – it can return Elliptic curve key parameters or RSA key parameters. You just have to then wrap them in another class and pass them to another factory (typical BouncyCastle), and hurray, you have the public key.

Of course now Google’s Conscrypt didn’t work again, because after the transformations the publicKey’s encoded version was not identical to the original bytes, and so the log ID calculation was wrong. But I fixed that by some reflection, and finally, it worked – the certificate transparency log was queried and the certificate was shown to be valid and properly included in the log.

The whole code can be found here. And yes, it uses several security providers, some odd BouncyCastle APIs and some simple implementations that are missing in Google’s provider. Known certificates may be cached so that repeated calls to the log are not performed, but that’s beyond the scope of my experiment.

Certificate transparency seems like a thing that’s core to the internet nowadays. And yet, it’s so obscure and hard to work with.

Why the type of public key in the list is not documented (they should at least put an OID next to the public key, because as it turns out, not all logs use elliptic curves – two of them use RSA). Probably there’s a good explanation, but why include the SCT in the log rather than the fingerprint of the certificate? Why not then mandate inclusion of the SCT in the certificate, which would require no additional configuration of the servers and clients, as opposed to including it in the TLS handshake, which does require upgrades?

As far as I know, the certificate transparency initiative is now facing scalability issues because of the millions of Let’s encrypt certificates out there. Every log (provider) should serve the whole log to everyone that requests it. It is not a trivial thing to solve, and efforts are being put in that direction, but no obvious solution is available at the moment.

And finally, if Java doesn’t have an easy way to do that, with all the crypto libraries available, I wonder what’ the case for other languages. Do they support certificate transparency or they need upgrades?

And maybe we’re all good because browsers supports it, but browsers are not the only thing that makes HTTP requests. API calls are a massive use-case and if they can be hijacked, the damage can be even bigger than individual users being phished. So I think more effort should be put in two things:
1. improving the RFC and 2. improving the programming ecosystem. I hope this post contributes at least a little bit.

The post Certificate Transparency Verification in Java appeared first on Bozho's tech blog.

Integrating Applications As Heroku Add-Ons

Post Syndicated from Bozho original https://techblog.bozho.net/integrating-applications-as-heroku-add-ons/

Heroku is a popular Platform-as-a-Service provider and it offers vendors the option to be provided as add-ons. Add-ons can be used by Heroku customers in different ways, but a typical scenario would be “Start a database”, “Start an MQ”, or “Start a logging solution”. After you add the add-on to your account, you can connect to the chosen database, MQ, logging solution or whatever.

Integrating as Heroku add-on is allegedly simple, and Heroku provides good documentation on how to do it. However, there are some pitfalls and so I’d like to share my experience in providing our services (Sentinel Trails and SentinelDB) as Heroku add-ons.

Both are SaaS (one is a logging solution, the other one – a cloud datastore), and so when a Heroku customer wants to add it to their account, we have to just create an account for them on our end.

In order to integrate with Heroku, you need to implement several endpoints:

  • provisioning – the initial creation of the resources (= account)
  • plan change – since Heroku supports multiple subscription plans, this should also be reflected on your end
  • deprovisioning – if a user stops using your service, you may want to free some resources
  • SSO – allows users to log in your service by clicking an icon in the Heroku console.

Implementing these endpoints following the tutorial should be straightforward, but it isn’t exactly. Hence I’m sharing our Spring MVC controller that handles it – you can check it here.

A few important bits:

  • You may choose not to obtain a token if you don’t plan to interact with the Heroku API further.
  • We are registering the user with a fake email in the form of <resourceId>@heroku.com. However, you may choose to use the token to fetch the emails of team members and collaborators, as described here.
  • The most important piece of data is the resource_id – store it in your users (or organizations) table and consider adding an index to be able to retrieve records by it quickly.
  • Return your keys and secrets as part of the provisioning request. They will be set as environment variables in Heroku
  • All of the requests are made from the Heroku servers to your server directly, except the SSO call. It is invoked in the browsers and so you should set the session cookie/token in the response. That way the user will be logged in your service.
  • When you generate your addon manifest, make sure you update the endpoint URLs

After you’re done, the alpha version appears in the marketplace (e.g. here and here). You should then have some alpha users to test the add-ons before it can be visible in the marketplace.

Integrating SaaS solutions with existing cloud providers is a good thing, and I’m happy that Heroku provides an automated way to do that. (AWS, for example, also has a marketplace, but integration there feels a bit strange and unpolished (I’ve hit some issues that were manually resolved by the AWS team).

Since many companies are choosing IaaS or PaaS for their services, having the ability to easily integrate an add-on service is very useful. I’d even go further and propose some level standardization for cloud add-ons, but I guess time will tell if we really need it, or we can spare a few days per provider.

The post Integrating Applications As Heroku Add-Ons appeared first on Bozho's tech blog.

Types of Data Breaches and How To Prevent Them

Post Syndicated from Bozho original https://techblog.bozho.net/types-of-data-breaches-and-how-to-prevent-them/

Data breaches happen practically every day. Personal, including financial and medical data leak to cyber criminals as well as intelligence agencies. Some notable breaches include the Equifax breach, where dozens of personal data fields were leaked, and the recent Marriott breach, where passports, credit cards and locations of people at a given time were breached.

I’ve been doing some data protection consultancy as well as working on a data protection product and decided to classify the types of data breaches and give recommendations on how they can be addressed. We don’t always get to know how exactly the breaches happen, but from what is published in news articles and post-mortems, we can have a good overview on the breach landscape.

Control over target server – if an attacker is able to connect to a target server and gains full or partial control on it, they can do anything, including running SELECT * FROM ... , copying files, etc. How do attackers gain such control? In many ways, most notably RCE (remote code execution) vulnerabilities and weak admin authentication.

How to prevent it? Follow best security practices – regularly update libraries and software to get security patches, do not run native commands from within the application layer, open only necessary ports (80 and 443) to the outside world, configure 2-factor authentication for administrator login. Aim at having an intrusion detection / prevention system. Encrypt your data, and make the encryption as granular as possible for the most sensitive data (e.g. for SentinelDB we utilize per-record encryption) to avoid SELECT * breaches.

SQL injections – this is a rookie mistake that unfortunately still happens. It allows attackers to manipulate your SQL queries and inject custom bits in them that allows them to extract more data than they are supposed to.

How to prevent it? Use prepared statements for your queries. Never ever concatenate user input in order to construct queries. Run regular code reviews and use code inspection tools to catch such instances.

Unencrypted backups – the main system may be well protected, but attackers are usually after the weak spots. Storing backups might be such – if you store unencrypted backups that are accessible via weak authentication (e.g. over FTP via username/password), then someone may try to attack this weaker spot. Even if the backup is encrypted, the key can be placed alongside it, which makes the encryption practically useless.

How to prevent it? Encrypt you backups, store them in a way that’s as strongly protected as your servers (e.g. 2FA, internal-network/VPN only), and have your decryption key in a hardware security module (or equivalent, e.g. AWS KMS).

Personal data in logs – another weak spot other than the backups may be your logs. They usually lie on separate servers, and are not as well guarded. That’s usually okay, since logs don’t contain personal information, but sometimes they do. I recently stumbled upon a large company’s website that had their directory structure unprotected and they kept their access logs files alongside their static resources. In addition to that, they passed personal information as GET parameters, so you could get a lot of information by just getting the access logs. Needless to say, I did a responsible disclosure and the issue was fixed, but it was a potential breach.

How to prevent it? Don’t store personal information in logs. Avoid submitting forms with a GET method. Regularly review the code to check whether personal data is not logged. Make sure your logs are stored in a way as protected as your production servers and your backups. It could be a cloud service, it could be a local installation of an open source package, but don’t overlook the security of the log collection system.

Data pushed to unprotected storage – a recent Alteryx/Experian leak was just that – data placed on a (somewhat) public S3 bucket was breached. If you place personal data in weakly protected public stores (AWS S3, file sharing services, FTPs), then you are waiting for trouble to happen.

How to prevent it? Don’t put personal data publicly. How to prevent that from happening – always review your S3 buckets and FTP servers policies. Have internal procedures that disallow sharing personal data without protecting it with at least a password shared by a side-channel (messenger/sms).

Unrestricted API calls – that’s what caused the Facebook-Cambridge Analytics issue. No matter how secure your servers are, if you expose the data through your API without access restriction, rate-limiting, fraud-detection, audit trail, then your security is no use – someone will “scrape” your data through the API.

How to prevent it? Do not expose too much personal data over public or easily accessible APIs. Vet API users and inform your users whenever their data is being shared with third parties, via API or otherwise.

Internal actor – all of the woes above can happen due to poor security or due to internal actors. Even if your network is well guarded, an admin can go rogue and leak the data. For many reasons, nonincluding financial. An privileged internal actor has access to perform SELECT *, can decrypt the backups, can pretend to be a trusted API partner.

How to prevent it? Good operational security. A single sentence like that may sound easy, but it’s not. I don’t have a full list of things that have to be in place to guard against internal breaches – there are technical, organizational and legal measures to be taken. Have unmodifiable audit trail. Have your Intrusion prevention system (or logging solution) also detect anomalous internal behaviour. Have procedures that require two admins to work together in order to log in (e.g. split key) to the most. If the data is sensitive, do background checks on the privileged admins. And many more things that fall into the “operational security” umbrella.

Man-in-the-middle attacks – MITM can be used to extract data from active users only. It works on website without HTTPS, or in case the attacker has somehow installed a wildcard certificate on the target machine (and before you say that’s too unlikely – it happens way too often to be ignored). In case of a successful MITM attack, the attacker can extract all data that’s being transferred.

How to prevent it? First – use HTTPS. Always. Redirect HTTP to HTTPS. Use HSTS. Use certificate pinning if you control the updates of the application (e.g. through an app store). The root certificate attack unfortunately cannot be circumvented. Sorry, just hope that your users haven’t installed such shitty software. Fortunately, this won’t lead to massive breaches, only data of active users that are being targeted may leak.

JavaScript injection / XSS – if somehow an attacker can inject javascript into your website, they can collect data being entered. This is what happened in the recent British Airways breach. A remember a potential attack on NSW (Australia) elections, where the piwick analytics script was loaded from an external server that was vulnerable to a TLS downgrade attack which allowed an attacker to replace the script and thus interfere with the election registration website.

How to prevent it? Follow the XSS protection cheat sheet by OWASP. Don’t include scripts from dodgy third party domains. Make sure third party domains, including CDNs, have a good security level (e.g. run Qualys SSL test).

Leaked passwords from other websites – one of the issues with incorrect storage of passwords is password reuse. Even if you store passwords properly, a random online store may not and if your users use the same email and password there, an attacker may try to steal their data from your site. Not all accounts will be compromised, but the more popular your service is, the more accounts will be affected.

How to avoid it? There’s not much you can do to make other websites store passwords correctly. But you can encourage the use of pass phrases , you can encourage 2-factor authentication in case of sensitive data, or you can avoid having passwords at all and use an external OAuth/OpenID provider (this has its own issues, but they may be smaller than those of password reuse). Also have some rate-limiting in place so that a single IP (or an IP range) is not able to try and access many accounts consecutively.

Employees sending emails with unprotected excel sheets – especially non-technical organizations and non-technical employees tend to just want to get their job done, so they may send large excel sheets with personal data to colleagues or partners in other companies. Then once someone’s email account or server is breached, the data gets breached as well.

How to prevent it? Have internal procedures against sending personal data in excel sheets, or at least have people zip them and send passwords through a side channel (messenger/sms). You can have an organization-wide software that scans outgoing emails for attachments with excel sheets that contain personal data and have these email blocked.

Data breaches are prevented by having good information security. And information security is hard. And it’s the right combination of security practices and security products that minimize the risk of incidents. Many organizations choose not to focus on infosec, as it’s not their core business or they estimate that the risk is worth it, viewing breaches, internal actors manipulating data and other incidents as something that can’t happen to them. Until it happens.

The post Types of Data Breaches and How To Prevent Them appeared first on Bozho's tech blog.

Resources on Distributed Hash Tables

Post Syndicated from Bozho original https://techblog.bozho.net/resources-on-distributed-hash-tables/

Distributed p2p technologies have always been fascinating to me. Bittorrent is cool not because you can download pirated content for free, but because it’s an amazing piece of technology.

At some point I read and researched a lot about how DHTs (distributed hash tables) work. DHTs are not part of the original bittorrent protocol, but after trackers were increasingly under threat to be closed for copyright infringment, “trackerless” features were added to the protocol. A DHT is distributed among all peers and holds information about which peer holds what data. Once you are connected to a peer, you can query it for their knowledge on who has what.

During my research (which was with no particular purpose) I took a note on many resources that I thought useful for understanding how DHTs work and possibly implementing something ontop of them in the future. In fact, a DHT is a “shared database”, “just like” a blockchain. You can’t trust it as much, but proving digital events does not require a blockchain anyway. My point here is – there is a lot more cool stuff to distributed / p2p systems than blockchain. And maybe way more practical stuff.

It’s important to note that the DHT used in BitTorrent is Kademlia. You’ll see a lot about it below.

Anyway, the point of this post is to share the resources that I collected. For my own reference and for everyone who wants to start somewhere on the topic of DHTs.

I hope the list is interesting and useful. It’s not trivial to think of other uses of DHTs, but simply knowing about them and how they work is a good thing.

The post Resources on Distributed Hash Tables appeared first on Bozho's tech blog.

Automate Access Control for User-Specific Entities

Post Syndicated from Bozho original https://techblog.bozho.net/automate-access-control-for-user-specific-entities/

Practically every web application is supposed to have multiple users and each user has some data – posts, documents, messages, whatever. And the most obvious thing to do is to protect these entities from being obtained by users that are not the rightful owners of these resources.

Unfortunately, this is not the easiest thing to do. I don’t mean it’s hard, it’s just not as intuitive as simply returning the resources. When you are your /record/{recordId} endpoint, a database query for the recordId is the immediate thing you do. Only then comes the concern of checking whether this record belongs to the currently authenticated user.

Frameworks don’t give you a hand here, because this access control and ownership logic is domain-specific. There’s no obvious generic way to define the ownership. It depends on the entity model and the relationships between entities. In some cases it can be pretty complex, involving a lookup in a join table (for many-to-many relationships).

But you should automate this, for two reasons. First, manually doing these checks on every endpoint/controller method is tedious and makes the code ugly. Second, it’s easier to forget to add these checks, especially if there are new developers.

You can do these checks in several places, all the way to the DAO, but in general you should fail as early as possible, so these checks should be on a controller (endpoint handler) level. In the case of Java and Spring, you can use annotations and a HandlerInterceptor to automate this. In case of any other language or framework, there are similar approaches available – some pluggable way to describe the ownership relationship to be checked.

Below is an example annotation to put on each controller method:

public @interface VerifyEntityOwnership {
    String entityIdParam() default "id";
    Class<?> entityType();

Then you define the interceptor (which, of course, should be configured to be executed)

public class VerifyEntityOwnershipInterceptor extends HandlerInterceptorAdapter {

    private static final Logger logger = LoggerFactory.getLogger(VerifyEntityOwnershipInterceptor.class);
    private OrganizationService organizationService;

    private MessageService MessageService;
    private UserService userService;
    public boolean preHandle(HttpServletRequest request, HttpServletResponse response, Object handler) {

        Authentication authentication = SecurityContextHolder.getContext().getAuthentication();
        // assuming spring-security with a custom authentication token type
        if (authentication instanceof ApiAuthenticationToken) {
            AuthenticationData authenticationData = ((ApiAuthenticationToken) authentication).getAuthenticationData();

            UUID clientId = authenticationData.getClientId();
            HandlerMethod handlerMethod = (HandlerMethod) handler;
            VerifyEntityOwnership annotation = handlerMethod.getMethodAnnotation(VerifyEntityOwnership.class);
            if (annotation == null) {
                logger.warn("No VerifyEntityOwnership annotation found on method {}", handlerMethod.getMethod().getName());
                return true;
            String entityId = getParam(request, annotation.entityIdParam());
            if (entityId != null) {
                if (annotation.entityType() == User.class) {
                    User user = userService.get(entityId);
                    if (!user.getClientId().equals(clientId)) {
                       return false;
                } else if (annotation.entityType() == Message.class) {
                    Message record = messageService.get(entityId);
                    if (!message.getClientId().equals(clientId) {
                        return false;
                } // .... more

        return true;
    private String getParam(HttpServletRequest request, String paramName) {
        String value = request.getParameter(paramName);
        if (value != null) {
            return value;
        Map<String, String> pathVariables = (Map<String, String>) request.getAttribute(HandlerMapping.URI_TEMPLATE_VARIABLES_ATTRIBUTE);
        return pathVariables.get(paramName);

You see that this presumes the need for custom logic per type. If your model is simple, you can make that generic – make all your entities implement some `Owned interface with getClientId() method that all of them define. Then simply have a dao.get(id, entityClass); and avoid having entity-specific logic.

Notice the warning that gets printed when there is no annotation on a method – this is there to indicate that you might have forgotten to add one. Some endpoints may not require ownership check – for them you can have a special @IgnoreEntityOwnership annotation. The point is to make a conscious decision to not verify the ownership, rather than to forget about it and introduce a security issue.

What I’m saying might be obvious. But I’ve seen many examples of this omission, including production government projects. And as I said, frameworks don’t force you to consider that aspect, because they can’t do it in a generic way – web frameworks are usually not concerned with your entity model, and your ORM is not concerned with your controllers. There are comprehensive frameworks that handle all of these aspects, but even they don’t have generic mechanisms for that (at least not that I’m aware of).

Security includes applying a set of good practices and principles to a system. But it also includes procedures and automations that help developers and admins in not omitting something that they are generally aware of, but happen to forget every now and then. And the less tedious a security principle is to apply, the more likely it will be consistently applied.

The post Automate Access Control for User-Specific Entities appeared first on Bozho's tech blog.

Scaling Horizontally on AWS [talk]

Post Syndicated from Bozho original https://techblog.bozho.net/scaling-horizontally-on-aws-talk/

On a recent conference (HackConf) I gave a talk where I tried to summarize how to do deployment and horizontal scaling on AWS. It is an overview of AWS resources (instance, load balancers, auto-scaling groups, security groups) as well as how to use CloudFormation to script your stack.

It briefly mentions the application layer and how it should look like (because another talk on the same conference was focused on that part). My point here is summarized as: ““You cannot scale an unscalable application”. But the talk continues to discuss AWS specific things, although many of them have their nearly identical counterparts in other IaaS providers (e.g. Google Cloud, Azure).

The video of the talk can be seen here:

And the slides are here:

As someone summarized on twitter: “That talk is approximately a year worth of learning experience with AWS in 40 minutes”. This is a benefit and a drawback, as it might be too condensed and too shallow, but I think I’ve covered important bits with enough depth for a starting point.

One of my points was that for simpler setups you don’t need fancy tools and platforms (docker, kubernetes, etc.) — as you’ll have to use bash anyway, you can go with just bash + CloudFormation and have a perfectly good, highly-available, blue-green deployment setup.

The other main points where: “think about your infrastructure as code”, and “consider all your resources dispensable, as they will surely die at some point”. Overall, I hope the talk is useful for everyone using or planning to use AWS, or any other IaaS provider.

The post Scaling Horizontally on AWS [talk] appeared first on Bozho's tech blog.

Typical Workarounds For Compliant Logs

Post Syndicated from Bozho original https://techblog.bozho.net/typical-workarounds-for-compliant-logs/

You may think you have logs. Chances are, you can rely on them only for tracing exceptions and debugging. But you can’t rely on them for compliance, forensics, or any legal matter. And that may be none of your concern as an engineer, but it is one of those important non-functional requirements that, if not met, are ultimately our fault.

Because of my audit trail startup, I’m obviously both biased and qualified to discuss logs (I’ve previously described what audit trail / audit logs are and how they can be maintained). And while they are a only a part of the security aspects, and certainly not very exciting, they are important, especially from a legal standpoint. That’s why many standards and laws – including ISO 27001, PCI-DSS, HIPAA, SOC 2, PSD2, GDPR – require audit trail functionality. And most of them have very specific security requirements.

PCI-DSS has a bunch of sections with audit trail related requirements:

“Malicious users often attempt to alter audit logs to hide their actions, and a record of access allows an organization to trace any inconsistencies or potential tampering of the logs to an individual account. [..]”

10.5 Secure audit trails so they cannot be altered. Often a malicious individual who has entered the network will attempt to edit the audit logs in order to hide their activity. Without adequate protection of audit logs, their completeness, accuracy, and integrity cannot be guaranteed, and the audit logs can be rendered useless as an investigation tool after a compromise.

Use file-integrity monitoring or change-detection software on logs to ensure that existing log data cannot be changed without generating alerts (although new data being added should not cause an alert).

ISO 27001 Annex A also speaks about protecting the audit trail against tampering

A.12.4.2 Protection of log information
Logging facilities and log information shall be protected against tampering and unauthorized access

From my experience, sadly, logs are rarely fully compliant. However, auditors are mostly fine with that and certification is given, even though logs can be tampered with. I decided to collect and list the typical workarounds the the secure, tamper-evident/tamper-protected audit logs.

  • We don’t need them secured – you almost certainly do. If you need to be compliant – you do. If you don’t need to be compliant, but you have a high-value / high-impact system, you do. If it doesn’t have to be compliant and it’s low-value or low-impact, than yes. You don’t need much security for that anyway (but be careful to not underestimate the needed security)
  • We store it in multiple places with different access – this is based on the assumption that multiple administrators won’t conspire to tamper with the logs. And you can’t guarantee that, of course. But even if you are sure they won’t, you can’t prove that to external parties. Imagine someone sues you and you provide logs as evidence. If the logs are not tamper-evident, the other side can easily claim you have fabricated logs and make them inadmissible evidence.
  • Our system is so complicated, nobody knows how to modify data without breaking our integrity checks – this is the “security through obscurity” approach and it will probably work well … until it doesn’t.
  • We store it with external provider – external log providers usually claim they provide compliance. And they do provide many aspects of compliance, mainly operational security around the log collection systems. Besides, in that case you (or your admins) can’t easily modify the externally stored records. Some providers may give you the option to delete records, which isn’t great for audit trail. The problem with such services is that they keep the logs operational for relatively short periods of time and then export them in a form that can be tampered with. Furthermore, you can’t really be sure that the logs are not tampered with. Yes, the provider is unlikely to care about your logs, but having that as the main guarantee doesn’t sound perfect.
  • We are using trusted timestamping – and that’s great, it covers many aspects of the logs integrity. AWS is timestamping their CloudTrail log files and it’s certainly a good practice. However, it comes short in just one scenario – someone deleting an entire timestamped file. And because it’s a whole file, rather than record-by-record, you won’t know which record exactly was targeted. There’s another caveat – if the TSA is under your control, you can backdate timestamps and therefore can’t prove that you didn’t fabricate logs.

These approaches are valid and are somewhere on a non-zero point on the compliance spectrum. Is having four copies of the data accessible to different admins better than having just one? Yup. Is timestamping with a local TSA better than not timestamping? Sure. Is using an external service more secure than using a local one? Yes. Are they sufficient to get certified? Apparently yes. Is this the best that can be done? No. Do you need the best? Sometimes yes.

Most of these measures are organizational rather than technical, and organizational measures can be reversed or circumvented much more easily than technical ones. And I may be too paranoid, but when I was a government advisor, I had to be paranoid when it comes to security.

And what do I consider a non-workaround? There is a lot of research around tamper-evident logging, tamper-evident data structures, merkle trees, hash chains. Here’s just one example. It should have been mainstream by now, but it isn’t. We’ve apparently settled on suboptimal procedures (even when it comes to cost), and we’ve interpreted the standards loosely, looking for a low-hanging fruit. And that’s fine for some organizations. I guess.

It takes time for a certain good practice to become mainstream, and it has to be obvious. Tamper-evident logging isn’t obvious. We had gradually become more aware about properly storing passwords, using TLS, multi-factor authentication. But security is rarely a business priority, as seen in reports about what drives security investments (it’s predominantly “compliance”).

As a practical conclusion – if you are going to settle for a workaround, at least choose a better one. Not having audit trail or not making any effort to protect it from tampering should be out of the question.

The post Typical Workarounds For Compliant Logs appeared first on Bozho's tech blog.

A Caveat With AWS Shared Resources

Post Syndicated from Bozho original https://techblog.bozho.net/a-caveat-with-aws-shared-resources/

Recently I’ve been releasing a new build, as usual utilizing a blue-green deployment by switching the DNS record to point to the load balancer of the previously “spare” group. But before I switched the DNS, I checked the logs of the newly launched version and noticed something strange – continuous HTTP errors from our web frameworks (Spring MVC) that a certain endpoint does not support the HTTP method.

The odd thing was – I didn’t have such an endpoint at all. I enabled further logging and it turned out that the request URL was not about my domain at all. The spare group, not yet having traffic directed at it, was receiving requests pointed at a completely different domain, which I didn’t own.

I messaged the domain owner, as well as AWS, to inform them of the issue. The domain owner said they have no idea what that is and that they don’t have any unused or forgotten AWS resources. AWS, however, responded as follows:

The ELB service scales dynamically as traffic demand changes, therefore when scaling occurs, the ELB service will take IP addresses from the AWS unused public IP address pool and assign them to the ELB nodes that are provisioned for you. The foreign domain name you see here in your case, likely belongs to another AWS customer who’s AWS resource is no longer using one of the IP addresses that your ELB node now has as it was released to the AWS unused IP pool at some stage, their web clients are very likely excessively caching DNS for these DNS names (not respecting DNS TTL), or their own DNS servers are configured with static entries and are therefore communicating with an IP address that now belongs to your ELB. The ELB adding and removing IPs from Route53 is briefly described in [Link 1] and the TTL attached to the DNS name is 60 seconds. Provided that clients respect the TTL, there should be no such issues.

I can simply ignore the traffic, but what happens if I’m in this role – after a burst my IP gets released, but some client (or some intermediate DNS resolver) has cached the information for longer than instructed. Then requests to my service, including passwords, API keys, etc. will be forwarded to someone else.

Using HTTPS might help in case of browsers, as the the certificate of the new load balancer will not match my domain, but in case of other tools that don’t perform this validation or have it cached, HTTPS won’t help, unless there’s certificate pinning implemented.

AWS say they can’t fix that at the load balancer, but they actually can, by keeping a mapping between IPs, owners and Host headers. It won’t be trivial, but it’s worth exploring in case my experience is not an exceptional scenario. Whether it’s worth fixing if HTTPS solves it – probably not.

So this is yet another reason to always use HTTPS and to force HTTPS if connection is made over HTTP. But also a reminder to not do clever client-side IP caching (let the DNS resolvers handle that) and to always verify the server certificate.

The post A Caveat With AWS Shared Resources appeared first on Bozho's tech blog.

Writing Big JSON Files With Jackson

Post Syndicated from Bozho original https://techblog.bozho.net/writing-big-json-files-with-jackson/

Sometimes you need to export a lot of data to JSON to a file. Maybe it’s “export all data to JSON”, or the GDPR “Right to portability”, where you effectively need to do the same.

And as with any big dataset, you can’t just fit it all in memory and write it to a file. It takes a while, it reads a lot of entries from the database and you need to be careful not to make such exports overload the entire system, or run out of memory.

Luckily, it’s fairly straightforward to do that, with a the help Jackson’s SequenceWriter and optionally of piped streams. Here’s how it would look like:

    private ObjectMapper jsonMapper = new ObjectMapper();
    private ExecutorService executorService = Executors.newFixedThreadPool(5);

    public ListenableFuture<Boolean> export(UUID customerId) {
        try (PipedInputStream in = new PipedInputStream();
                PipedOutputStream pipedOut = new PipedOutputStream(in);
                GZIPOutputStream out = new GZIPOutputStream(pipedOut)) {
            Stopwatch stopwatch = Stopwatch.createStarted();

            ObjectWriter writer = jsonMapper.writer().withDefaultPrettyPrinter();

            try(SequenceWriter sequenceWriter = writer.writeValues(out)) {
                Future<?> storageFuture = executorService.submit(() ->
                       storageProvider.storeFile(getFilePath(customerId), in));

                int batchCounter = 0;
                while (true) {
                    List<Record> batch = readDatabaseBatch(batchCounter++);
                    for (Record record : batch) {

                // wait for storing to complete

            logger.info("Exporting took {} seconds", stopwatch.stop().elapsed(TimeUnit.SECONDS));

            return AsyncResult.forValue(true);
        } catch (Exception ex) {
            logger.error("Failed to export data", ex);
            return AsyncResult.forValue(false);

The code does a few things:

  • Uses a SequenceWriter to continuously write records. It is initialized with an OutputStream, to which everything is written. This could be a simple FileOutputStream, or a piped stream as discussed below. Note that the naming here is a bit misleading – writeValues(out) sounds like you are instructing the writer to write something now; instead it configures it to use the particular stream later.
  • The SequenceWriter is initialized with true, which means “wrap in array”. You are writing many identical records, so they should represent an array in the final JSON.
  • Uses PipedOutputStream and PipedInputStream to link the SequenceWriter to a an InputStream which is then passed to a storage service. If we were explicitly working with files, there would be no need for that – simply passing a FileOutputStream would do. However, you may want to store the file differently, e.g. in Amazon S3, and there the putObject call requires an InputStream from which to read data and store it in S3. So, in effect, you are writing to an OutputStream which is directly written to an InputStream, which, when attampted to be read from, gets everything written to another OutputStream
  • Storing the file is invoked in a separate thread, so that writing to the file does not block the current thread, whose purpose is to read from the database. Again, this would not be needed if simple FileOutputStream was used.
  • The whole method is marked as @Async (spring) so that it doesn’t block execution – it gets invoked and finishes when ready (using an internal Spring executor service with a limited thread pool)
  • The database batch reading code is not shown here, as it varies depending on the database. The point is, you should fetch your data in batches, rather than SELECT * FROM X.
  • The OutputStream is wrapped in a GZIPOutputStream, as text files like JSON with repetitive elements benefit significantly from compression

The main work is done by Jackson’s SequenceWriter, and the (kind of obvious) point to take home is – don’t assume your data will fit in memory. It almost never does, so do everything in batches and incremental writes.

The post Writing Big JSON Files With Jackson appeared first on Bozho's tech blog.