Tag Archives: Developer tips

Seven Legacy Integration Patterns

Post Syndicated from Bozho original https://techblog.bozho.net/seven-legacy-integration-patterns/

If we have to integrate two (or more) systems nowadays, we know – we either use an API or, more rarely, some message queue.

Unfortunately, many systems in the world do not support API integration. And many more a being created as we speak, that don’t have APIs. So when you inevitably have to integrate with them, you are left with imperfect choices to make. Below are seven patterns to integrate with legacy systems (or not-so-legacy systems that are built in legacy ways).

Initially I wanted to use “bad integration patterns”. But if you don’t have other options, they are not bad – they are inevitable. What’s bad is that fact that so many systems continue to be built without integration in mind.

I’ve seen all of these, on more than one occasion. And I’ve heard many more stories about them. Unfortunately, they are not the exception (fortunately, they are also not the rule, at least not anymore).

  1. Files on FTP – one application uploads files (XML, CSV, other) to an FTP (or other shared resources) and the other ones reads them via a scheduled job, parses them and optionally spits a response – either in the same FTP, or via email. Sharing files like that is certainly not ideal in terms of integration – you don’t get real-time status of your request, and other aspects are trickier to get right – versioning, high availability, authentication, security, traceability (audit trail).
  2. Shared database – two applications sharing the same database may sound like a recipe for disaster, but it’s not uncommon to see it in the wild. If you are lucky, one application will be read-only. But breaking changes to the database structure and security concerns are major issues. You can only use this type of integration is you expose your database directly to another application, which you normally don’t want to do.
  3. Full daily dump – instead of sharing an active database, some organizations do a full dump of their data every day or week and provide to to the other party for import. Obvious data privacy issues exist with that, as it’s a bad idea to have full dumps of your data flying around (in some cases on DVDs or portable HDDs), in addition to everything mention below – versioning, authentication, etc.
  4. Scraping – when an app has no API, it’s still possible to extract data from it or push data to it – via the user interface. With web applications that’s easier, as they “speak” HTML and HTTP. With desktop apps, screen scraping has emerged as an option. The so-called RPA software (Robotic process automation) relies on all types of scraping to integrate legacy systems. It’s very fragile and requires complicated (and sometimes expensive) tooling to get right. Not to mention the security aspect, which requires storing credentials in non-hashed form somewhere in order to let the scraper login.
  5. Email – when the sending or receiving system don’t support other forms of integration, email comes as a last resort. If you can trigger something by connecting a mailbox or if an email is produced after some event happens in the sending system, this may be all you need to integrate. Obviously, email is a very bad means of integration – it’s unstructured, it can fail for many reasons, and it’s just not meant for software integration. You can attach structured data, if you want to get extra inventive, but if you can get both ends to support the same format, you can probably get them extended to support proper APIs.
  6. Adapters – you can develop a custom module that has access to the underlying database, but exposes a proper API. That’s an almost acceptable solution, as you can have a properly written (sort-of) microservice independent of the original application and other system won’t know they are integrating with a legacy piece of software. It’s tricky to get it right in some cases, however, as you have to understand well the state space of the database. Read-only is easy, writing is much harder or next to impossible.
  7. Paper – no, I’m not making this up. There are cases where one organizations prints some data and then the other organization (or department) receives the paper documents (by mail or otherwise) and inputs them in their system. Expensive projects exist out there that aim to remove the paper component and introduce actual integration, as paper-based input is error-prone and slow. The few legitimate scenarios for a paper-based step is when you need an extra security and the paper trail, combined with the fact that paper is effectively airgapped, may give you that. But even then it shouldn’t be the only transport layer.

If you need to do any of the above, it’s usually because at least one of the system is stuck and can’t be upgraded. It’s either too legacy to touch, or the vendor is gone, or adding an API is “not on their roadmap” and would be too expensive.

If you are building a system, always provide an API. Some other system will have to integrate with it, sooner or later. It’s not sustainable to build close systems and delay the integration question for when it’s needed. Assume it’s always needed.

Fancy ESBs may be able to patch things quickly with one of the approaches above and integrate the “unintegratable”, but heavy reliance on an ESB is an indicator of too many legacy or low-quality systems.

But simply having an API doesn’t cut it either. If you don’t support versioning and backward-compatible APIs, you’ll be in an even more fragile state, as you’ll be breaking existing integrations as you progress.

Enterprise integration is tricky. But, as with many things in software, it’s best handled in the applications that we build. If we build them right, things are much easier. Otherwise, organizations have to revert to the legacy approaches mentioned above and introduce complexity, fragility, security and privacy risks and a general feeling of low-quality that has to be supported by increasingly unhappy people.

The post Seven Legacy Integration Patterns appeared first on Bozho's tech blog.

Using Multiple Dynamic Caches With Spring

Post Syndicated from Bozho original https://techblog.bozho.net/using-multiple-dynamic-caches-with-spring/

In a third post about cache managers in spring (over a long period of time), I’d like to expand on the previous two by showing how to configure multiple cache managers that dynamically create caches.

Spring has CompositeCacheManager which, in theory, should allow using more than one cache manager. It works by asking the underlying cache managers whether they have a cache with the requested name or not. The problem with that is when you need dynamically-created caches, based on some global configuration. And that’s the common scenario, when you don’t want to manually define caches, but instead want to just add @Cacheable and have spring (and the underlying cache manager) create the cache for you with some reasonable defaults.

That’s great until you need to have more than one cache managers. For example – one for local cache and one for a distributed cache. In many cases a distributed cache is needed; however not all method calls need to be distributed – some can be local to the instance that handles it and you don’t want to burden your distributed cache with stuff that can be kept locally. Whether you can configure a distributed cache provider to designate some cache to be local, even though it’s handled by the distributed cache provider – maybe, but I don’t guarantee it will be trivial.

So, faced with that issue, I had to devise some simple mechanism of designating some caches as “distributed” and some as “local”. Using CompositeCacheManager alone would not do it, so I extended the distributed cache manager (in this case, Hazelcast, but it can be done with any provider):

/**
 * Hazelcast cache manager that handles only cache names with a specified prefix for distributed caches
 */
public class OptionalHazelcastCacheManager extends HazelcastCacheManager {

    private static final String DISTRIBUTED_CACHE_PREFIX = "d:";

    public OptionalHazelcastCacheManager(HazelcastInstance hazelcast) {
        super(hazelcast);
    }

    @Override
    public Cache getCache(String name) {
        if (name == null || !name.startsWith(DISTRIBUTED_CACHE_PREFIX)) {
            return null;
        }

        return super.getCache(name);
    }
}

And the corresponding composite cache manager configuration:

    <bean id="cacheManager" class="org.springframework.cache.support.CompositeCacheManager">
        <property name="cacheManagers">
            <list>
                <bean id="hazelcastCacheManager" class="com.yourcompany.util.cache.OptionalHazelcastCacheManager">
                    <constructor-arg ref="hazelcast" />
                </bean>

                <bean id="caffeineCacheManager" class="com.yourcompany.util.cache.FlexibleCaffeineCacheManager">
                    <property name="cacheSpecification" value="expireAfterWrite=10m"/>
                    <property name="cacheSpecs">
                        <map>
                            <entry key="statistics" value="expireAfterWrite=1h"/>
                        </map>
                    </property>
                </bean>
            </list>
        </property>
    </bean>

That basically means that any cache with a name starting with d: (for “distributed”) should be handled by the distributed cache manager. Otherwise, proceed to the next cache manager (Caffeine in this case). So when you want to define a method with a cacheable result, you have to decide whether it’s @Cacheable("d:cachename") or just @Cacheable("cachename")

That’s probably one of many ways to approach that issue, but I like it for its simplicity. Caching is hard (distributed caching even more so), and while we are lucky to have Spring abstract most of that, we sometimes have to handle special cases ourselves.

The post Using Multiple Dynamic Caches With Spring appeared first on Bozho's tech blog.

Six Things I Learned from Mr. Robot

Post Syndicated from Bozho original https://techblog.bozho.net/six-things-i-learned-from-mr-robot/

Mr. Robot is an amazingly accurate series about a hacker (Elliot Alderson) and his, uhm, undertakings. The series is consulted by many cybersecurity experts and so every hack that happens is actually properly executed, using real tools and commands that do exactly what an infosec expert would expect. Nothing shown on screen is the usual bullshit TV hacking. And that is awesome and scary in a number of ways.

Obviously, this post is full of spoilers, so SPOILER ALERT, if you haven’t watched it, go and do that and then continue reading. Another disclaimer I have to make is that you should use the tools and techniques below only for ethical hacking, penetration testing and other legal activities.

The series has many hacks happening, from connecting to a neighbor’s WiFi, guessing people’s social media passwords and installing malware via flash drives, to complex social engineering plots, an HSM hack, the use of steganography and other advanced stuff. There’s even a tool pack with everything that’s used in the series. I’m fairly well-versed in the cybersecurity domain so even though I haven’t performed any hacks, most of the things happening on screen were familiar. But I did learn some interesting things from the series, which I’d like to share, as they sparked my interest and made me read a bit more. Many hacking scenes require you to pause and try to read what’s on screen, but that makes it even more fun.

  • proxychains – many tools are in use in the series, and Kali Linux is often used. But I wasn’t familiar with one particular tool that I think is crucial. Proxychains. I know, I’m a n00b to have never heard about it. What it does is basically running every command via a SOCKS or HTTP proxy. By default it uses Tor, but as Elliot knows that the NSA might be watching the Tor exit nodes, he switches to a regular proxy to execute the main attack in Season 1. proxychains curl https://google.com would open Google via the pre-configured proxy. This is very important to avoid being caught, and makes attribution and forensics much harder. To be fair, I didn’t see exactly where Elliot uses proxychains, but several people claim he uses it. Even if it’s not on the show, I learned about it after reading about the hacks, so it counts.
  • shred -z – I know and I’ve used the shred command. However, I didn’t know it has a “-z” argument which “add a final overwrite with zeros to hide shredding”. I guess there are more of these minor tool options here and there, but that’s an important point about covering one’s tracks. If you don’t do the zeroing, eventually, with proper forensics, they might discover that something was there (and was shredded).
  • How encryption ransomware works – ransomware that encrypts all your files and then asks for bitcoin in order to decrypt them is something fairly known. However, I didn’t know exactly how these ransomeware programs work – e.g. what they use for encryption. It appears that Darlene may have modified CryptoLocker in order to encrypt E Corp’s data. I was interested by that because at the end of season 3 Elliot finds a key that can undo the hack that rendered E Corp’s data unusable and he sends an RSA private key (via ProtonMail). Why would they encrypt something with RSA if they explicitly said they used AES? Well, it turns out that Cryptolocker (and others) work by generating one AES key per file, encrypting the file, then encrypting the AES key with the RSA public key and storing the encrypted AES key alongside the file. That’s a sensible approach because the private key never leaves the attacker and can’t be intercepted. But using asymmetric encryption to encrypt large volumes is not a good idea (it’s slow). So whenever asymmetric encryption is used, it’s always used to encrypt a symmetric key that is then used to encrypt/decrypt the actual data. And the fact that they reused existing ransomware rather than creating something from scratch allowed Mr. Robot to keep the RSA private key (that can be used to decrypt everything) rather than throwing it away.
  • UPS devices explode – I guess I’m just not good with hardware. I used to own a UPS, and I know it’s a battery and batteries sometimes malfunction (hello, Samsung), but the fact that the firmware controls important physical parameters that could in theory lead to the UPS exploding was not something I had thought about (shame on me, especially given that for one semester I’ve been taking chemistry classes). The answers in this thread are interesting – basically it’s absolutely expected for a UPS to explode (just not all of them at the same time). In the end, it’s very unlikely that the particular hack would succeed, but exploding UPSs are a reality. And that is an important point – you can’t trust your hardware either.
  • IMSI catchers – while I’m aware that mobile communication standards (SS7) are insecure, but I didn’t know exactly how. I’ve worked on telecom projects, but more on the business logic/billing side, so I was unaware of devices available to be used for MITM attacks. IMSI catchers allow intercepting one’s mobile phone communication based on the fact that mobile phones try to optimize their coverage combined with the lack of cell authentication. The bottom line here is you can’t rely on plain calls or SMS to be secure. So don’t use SMS for 2FA (shown as weak in season 4). Basically, use Signal.
  • HSM “cloning” – this is а complicated scene that shows how Angela copies keys from an HSM. It’s explained in detail here and here. The premise of the hack was that Elliot had upgraded the UPS firmware to only accept updates that are signed by E Corp’s private key. That means that you can’t compromise them unless you own the keys; however the keys are stored on a hardware security module that doesn’t allow them to be copied. I knew what an HSM is and what it does; what I didn’t know is the procedure for performing an HSM backup. It turns out they used an actual SafeNet HSM to backup the original one, which then allowed them to sign whatever they wanted on behalf of E Corp. The private keys never left the HSMs, of course (which is the whole point of HSMs), but by backing it up on one that’s controlled by the adversary, they effectively “leaked” the keys. The HSM backup procedure is complex and requires USB keys to unlock certain stages of the process. Normally these keys should be stored in safes (and not lying around, as depicted), but not strictly following procedures is not uncommon in organizations.

The scary thing about the series is that although it’s fiction, everything is realistic. Every hack depicted can happen in real life, and probably has happened. The social engineering parts may be a bit optimistic, but they are entirely plausible. The “wildest” hacks require some suspension of disbelief but even they are theoretically possible at least. The feeling that everything can be hacked and exploited is, unfortunately, realistic. You get that same feeling after watching a few DEF CON or CCC talks. And that’s why cybersecurity is important. Because things that are not hacked yet are not safe, they are just not interesting enough.

The post Six Things I Learned from Mr. Robot appeared first on Bozho's tech blog.

AWS Alarms for Application Errors

Post Syndicated from Bozho original https://techblog.bozho.net/aws-alarms-for-application-errors/

Monitoring is key for any real-world application. You have to know what’s happening and be alerted in real time if something wrong is happening. AWS has CloudWatch for that, and gives you a lot of metrics automatically. But there are some that you have to define yourself. And then you need to define proper alarms.

Here I’ll focus on hour:

  • High number of application errors
  • High number of application warnings
  • High number of 5xx errors on the load balancer
  • High number of 4xx errors on the load balancer

First, the prerequisites:

  • You need to be using CloudFormation to automate everything. You can create all of those things manually, but automation is a big plus
  • If using CloudFormation, you’d preferably have a sub-stack for configuring alarms
  • You need to be collecting your logs with CloudWatch logs

If you are not using CloudWatch logs, here’s a simple config file and script to enable them:

{
  "agent": {
    "metrics_collection_interval": 10,
    "region": "eu-west-1",
    "logfile": "/opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log"
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "{{logPath}}",
            "log_group_name": "{{logGroupName}}",
            "log_stream_name": "{instance_id}",
            "timestamp_format": "%Y-%m-%d %H:%M:%S"
          }
        ]
      }
    }
  }
}
# install AWS CloudWatch monitor
mkdir cloud-watch-agent
cd cloud-watch-agent
wget https://s3.amazonaws.com/amazoncloudwatch-agent/linux/amd64/latest/AmazonCloudWatchAgent.zip
unzip AmazonCloudWatchAgent.zip
./install.sh

aws s3 cp s3://$BUCKET_NAME/cloudwatch-agent-config.json /var/config/cloudwatch-agent-config.json
sed -i -- 's|{{logPath}}|/var/log/application.log|g' /var/config/cloudwatch-agent-config.json
sed -i -- 's|{{logGroupName}}|app_node|g' /var/config/cloudwatch-agent-config.json

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/var/config/cloudwatch-agent-config.json -s

Now you have to define two things: Log metrics and alarms. The cloudformation code below creates both:

    "HighAppErrorsNotification": {
      "Type": "AWS::CloudWatch::Alarm",
      "Properties": {
        "AlarmActions": [
          {
            "Ref": "NotificationTopicId"
          }
        ],
        "InsufficientDataActions": [
          {
            "Ref": "NotificationTopicId"
          }
        ],
        "AlarmDescription": "Notify if there are too many application errors",
        "ComparisonOperator": "GreaterThanOrEqualToThreshold",
        "EvaluationPeriods": "1",
        "MetricName": "ApplicationErrors",
        "Namespace": "LogMetrics",
        "Period": "900",
        "Statistic": "Sum",
        "Threshold": "5",
        "TreatMissingData": "ignore"
      }
    },
    "ErrorMetricFilter": {
      "Type": "AWS::Logs::MetricFilter",
      "Properties": {
        "LogGroupName": "app_node",
        "FilterPattern": "ERROR",
        "MetricTransformations": [
          {
            "DefaultValue": 0,
            "MetricValue": "1",
            "MetricNamespace": "LogMetrics",
            "MetricName": "ApplicationErrors"
          }
        ]
      }
    },

If you need to do that manually, go to the CloudWatch logs homepage, select the log group (app_node) and use the button “Create metric filter” ontop. It lets you specify the pattern to look for (“ERROR” in this case). When you have that ready, you can create an Alarm based on it, through the Alarms -> Create alarm. Lookup the metric by name and select it to trigger the alarm (in the example above, it gets triggered if there are more than 5 errors within 900 seconds)

You can then create an identical alarm for warnings (pattern to look for: “WARN”). The threshold there might be higher, e.g. 10 or 20. But that depends on your application logging patterns.

Then there’s the error 5xx load balancer alarms. In CloudFormation it would look like this:

   "TooMany5xxErrorsWebAppAlarmNotification": {
      "Type": "AWS::CloudWatch::Alarm",
      "Properties": {
        "AlarmActions": [
          {
            "Ref": "NotificationTopicId"
          }
        ],
        "InsufficientDataActions": [
          {
            "Ref": "NotificationTopicId"
          }
        ],
        "AlarmDescription": "Notify if there are too many 5xx errors",
        "ComparisonOperator": "GreaterThanOrEqualToThreshold",
        "Dimensions": [
          {
            "Name": "LoadBalancer",
            "Value": {
              "Ref": "WebAppALBId"
            }
          }
        ],
        "TreatMissingData": "notBreaching",
        "EvaluationPeriods": "1",
        "MetricName": "HTTPCode_Target_5XX_Count",
        "Namespace": "AWS/ApplicationELB",
        "Period": "60",
        "Statistic": "Sum",
        "Threshold": "2"
      }
    }

You can again create that manually – look for the HTTPCode_Target_5XX_Count metric in the metric selection screen for the alarm. You have several options there, the most straightforward is to select the per AppELB metric. And again, the same approach can be used for 4xx errors (HTTPCode_Target_5XX_Count).

Getting this running with CloudFormation (and even manually) is not as straighforward as it seems. The right combination of metric names, namespaces and values is not obvious and the relevant documentation is not the first thing that pops up. So I decided to share something that works, as it may take some time experimenting before getting it to that state.

But even outside of CloudFormation or AWS context, monitoring and alerting in case of a high number of application errors, warnings and HTTP errors is a must. And automating the creation of those alarms is the recommended approach.

The post AWS Alarms for Application Errors appeared first on Bozho's tech blog.

Automating the Change of WordPress Permalinks

Post Syndicated from Bozho original https://techblog.bozho.net/automating-the-change-of-wordpress-permalinks/

WordPress is powering 35% of website. And while it may not be seen as very complex or interesting, it is one of the most prevalent technologies of our time. And many developers, even if they are not working with PHP, have to support some WordPress installation (e.g. a blog like this one). And unfortunately, there are still basic things that you can’t easily do. Plugins do help, but they don’t always have the proper functionality.

Today I had to change permalinks on a website. From https://example.com/post-name to https://example.com/blog/post-name. And WordPress allows that, except there’s a problem – when you change it, the old links stop working (404). Not only that ruins your SEO, any previous share of your post will not work. I’ve pointed that out a long time ago when Spring broke their documentation links. Bottomline: you don’t want that to happen.

Instead, you want to do 301 redirect (and not 302, which appears to also break your SEO). But all tutorials that are easily found online assume you can manually configure redirection through some plugin. But if you have 100-200 posts, that’s a lot of tedious work. There are plugins that allegedly monitor your posts for changes and create the redirects automatically. That may work if you manually edit a post, but it didn’t work for me when changing the permalink format settings.

So how to do it in an automated way and without disrupting your website? You’d need a little bit more than a plugin – namely SQL and Regex. Here are the steps:

  1. Install and activate the Redirection plugin
  2. Open your blog’s hosting admin page and navigate to the PhpMyAdmin to browse your MySQL database. That’s the typical setup. If you are using another database admin tool, you’ll know what to do
  3. Run a query to fetch the post name of all published posts: SELECT post_name FROM `wpgk_posts` where post_type = "post" and post_status = "publish" LIMIT 200 (if you have more than 200 posts, increase the limit; if you don’t put a limit, PhpMyAdmin puts a default one)
  4. Expand the retrieved text (with the big T), select all records with the checkbox below and click “Copy to keyboard”
  5. Paste in Notepad++ (or any regex-supporting text editor, or if you are running Linux, just create a file and then you’ll be able to do regex replaces wit sed)
  6. Replace all tabs that you may have copied (\t)
  7. Then replace (.+) with /$1/,/blog/$1/ (if you are redirecting /postname to /blog/postname, otherwise you have to write the appropriate regex)
  8. Change the permalink format settings
  9. Import the resulting CSV file in the Redirection plugin. It interprets it as: source,target. That way you have redirected all your existing posts to their new permalink (the redirects are created in the chosen group). Note the trailing slash. Just in case, you may repeat the same operation to produce another CSV without the trailing slashes, but the canonical WordPress URL is with the trailing slash.

It feels a bit awkward that such an obvious things that probably happens often would need such a non-trivial set of steps to do automatically. But software, no matter how user-friendly and “no code”, will require knowledge of basic things like SQL and Regex. And maybe that’s good.

The post Automating the Change of WordPress Permalinks appeared first on Bozho's tech blog.

An AWS Elasticsearch Post-Mortem

Post Syndicated from Bozho original https://techblog.bozho.net/aws-elasticsearch-post-mortem/

So it happened that we had a production issue on the SaaS version of LogSentinel – our Elasticsearch stopped indexing new data. There was no data loss, as elasticsearch is just a secondary storage, but it caused some issues for our customers (they could not see the real-time data on their dashboards). Below is a post-mortem analysis – what happened, why it happened, how we handled it and how we can prevent it.

Let me start with a background of how the system operates – we accept audit trail entries (logs) through a RESTful API (or syslog), and push them to a Kafka topic. Then the Kafka topic is consumed to store the data in the primary storage (Cassandra) and index it for better visualization and analysis in Elasticsearch. The managed AWS Elasticsearch service was chosen because it saves you all the overhead of cluster management, and as a startup we want to minimize our infrastructure management efforts. That’s a blessing and a curse, as we’ll see below.

We have alerting enabled on many elements, including the Elasticsearch storage space and the number of application errors in the log files. This allows us to respond quickly to issues. So the “high number of application errors” alarm triggered. Indexing was blocked due to FORBIDDEN/8/index write. We have a system call that enables it, so I tried to run it, but after less than a minute it was blocked again. This meant that our Kafka consumers failed to process the messages, which is fine, as we have a sufficient message retention period in Kafka, so no data can be lost.

I investigated the possible reasons for such a block. And there are two, according to Amazon – increased JVM memory pressure and low disk space. I checked the metrics and everything looked okay – JVM memory pressure was barely reaching 70% (and 75% is the threshold), and there was more than 200GiB free storage. There was only one WARN in the elasticsearch application logs (it was “node failure”, but after that there were no issues reported)

There was another strange aspect of the issue – there were twice as many nodes as configured. This usually happens during upgrades, as AWS is using blue/green deployment for Elasticsearch, but we haven’t done any upgrade recently. These additional nodes usually go away after a short period of time (after the redeployment/upgrade is ready), but they wouldn’t go away in this case.

Being unable to SSH into the actual machine, being unable to unblock the indexing through Elasticsearch means, and being unable to shut down or restart the nodes, I raised a ticket with support. And after a few ours and a few exchanged messages, the problem was clear and resolved.

The main reason for the issue is 2-fold. First, we had a configuration that didn’t reflect the cluster status – we had assumed a bit more nodes and our shared and replica configuration meant we have unassigned replicas (more on shards and replicas here and here). The best practice is to have nodes > number of replicas, so that each node gets one replica (plus the main shard). Having unassigned shard replicas is not bad per se, and there are legitimate cases for it. Our can probably be seen as misconfiguration, but not one with immediate negative effects. We chose those settings in part because it’s not possible to change some settings in AWS after a cluster is created. And opening and closing indexes is not supported.

The second issue is AWS Elasticsearch logic for calculating free storage in their circuit breaker that blocks indexing. So even though there were 200+ GiB free space on each of the existing nodes, AWS Elasticsearch thought we were out of space and blocked indexing. There was no way for us to see that, as we only see the available storage, not what AWS thinks is available. So, the calculation gets the total number of shards+replicas and multiplies it by the per-shard storage. Which means unassigned replicas that do not take actual space are calculated as if they take up space. That logic is counterintuitive (if not plain wrong), and there is hardly a way to predict it.

This logic appears to be triggered when blue/green deployment occurs – so in normal operation the actual remaining storage space is checked, but during upgrades, the shard-based check is triggered. That has blocked the entire cluster. But what triggered the blue/green deployment process?

We occasionally need access to Kibana, and because of our strict security rules it is not accessible to anyone by default. So we temporarily change the access policy to allow access from our office IP(s). This change is not expected to trigger a new deployment, and has never lead to that. AWS documentation, however, states:

In most cases, the following operations do not cause blue/green deployments: Changing access policy, Changing the automated snapshot hour, If your domain has dedicated master nodes, changing data instance count.
There are some exceptions. For example, if you haven’t reconfigured your domain since the launch of three Availability Zone support, Amazon ES might perform a one-time blue/green deployment to redistribute your dedicated master nodes across Availability Zones.

There are other exceptions, apparently, and one of them happened to us. That lead to the blue/green deployment, which in turn, because of our flawed configuration, triggered the index block based on the odd logic to assume unassigned replicas as taking up storage space.

How we fixed it – we recreated the index with fewer replicas and started a reindex (it takes data from the primary source and indexes it in batches). That reduced the size taken and AWS manually intervened to “unstuck” the blue/green deployment. Once the problem was known, the fix was easy (and we have to recreate the index anyway due to other index configuration changes). It’s appropriate to (once again) say how good AWS support is, in both fixing the issue and communicating it.

As I said in the beginning, this did not mean there’s data loss because we have Kafka keep the messages for a sufficient amount of time. However, once the index was writable, we expected the consumer to continue from the last successful message – we have specifically written transactional behaviour that committed the offsets only after successful storing in the primary storage and successful indexing. Unfortunately, the kafka client we are using had auto-commit turned on that we have overlooked. So the consumer has skipped past the failed messages. They are still in Kafka and we are processing them with a separate tool, but that showed us that our assumption was wrong and the fact that the code calls “commit” doesn’t actually mean something.

So, the morals of the story:

  • Monitor everything. Bad things happen, it’s good to learn about them quickly.
  • Check your production configuration and make sure it’s adequate to the current needs. Be it replicas, JVM sizes, disk space, number of retries, auto-scaling rules, etc.
  • Be careful with managed cloud services. They save a lot of effort but also take control away from you. And they may have issues for which your only choice is contacting support.
  • If providing managed services, make sure you show enough information about potential edge cases. An error console, an activity console, or something, that would allow the customer to know what is happening.
  • Validate your assumptions about default settings of your libraries. (Ideally, libraries should warn you if you are doing something not expected in the current state of configuration)
  • Make sure your application is fault-tolerant, i.e. that failure in one component doesn’t stop the world and doesn’t lead to data loss.

To sum it up, a rare event unexpectedly triggered a blue/green deployment, where a combination of flawed configuration and flawed free space calculation resulted in an unwritable cluster. Fortunately, no data is lost and at least I learned something.

The post An AWS Elasticsearch Post-Mortem appeared first on Bozho's tech blog.

Running a Safe Database Cluster in AWS With Auto-Scaling Groups

Post Syndicated from Bozho original https://techblog.bozho.net/running-a-safe-database-cluster-in-aws-with-auto-scaling-groups/

When you have to run a scalable application on AWS, your database must also be scalable. It’s easier to scale the stateless application layer, where each node is mostly disposable – even if a node in a 3-node cluster fails, you can just fire up another one and nobody notices.

The database layer is stateful and therefore there’s a risk to lose data. Having just a single node is not an option, as a node can always go down and that would mean downtime. So you need multiple nodes in a cluster to make sure your application is highly available and fault tolerant (I won’t go into the differences in terminology).

What database am I talking about? It doesn’t matter. It can be a SQL or a NoSQL database – each has some form of clustering available. Whether it’s active-active or active-passive.

Now, for AWS in particular, you can choose RDS (or another managed option), which will handle it for you. But if there’s no managed option (e.g. Cassandra) or you don’t feel the managed option is giving you enough control, or is more expensive, or the version you require is not available, you have to manage the database layer yourself. I won’t go into the details of how to configure the database-specific clustering – you should check the documentation of the particular database for that. I’ll try to give some tips how to safely run your infrastructure that supports the database cluster.

And here come auto-scaling groups. They allow you have have a group of identical nodes (based on a launch configuration) and the ASG makes sure you always have at least X healthy nodes by starting new nodes if existing ones fail( they can automatically kill unhealthy nodes (that is, nodes that do not respond to automated healthchecks)).

That’s awesome for application nodes, but it can be an issue with database nodes. You don’t necessarily want your database node killed if it gets unresponsive for a while. That’s why below I’ve compiled a list of tips to avoid pitfalls. Unfortunately, many of them are not available through CloudFormation, so you have to do them manually. And document them so that you don’t forget in case you need to recreate the stack:

  • Set the minimum number of nodes to 1. It protects from accidentally setting the “Desired” count to 0 as part of experimenting with other, unrelated ASGs
  • Make sure you have enabled termination protection for each instance as well as scale-in termination protection per ASG.
  • In the ASG settings there’s “Suspended Processes”. Make sure you suspend “Terminate” and “ReplaceUnhealthy”.
  • Make sure that in your LaunchConfiguration the EBS volume is not deleted in case of termination. Why do you need that, given that you have disabled all termination options? Well, termination can occasionally happen due to issues with the underlying host, or a node may be scheduled for decommission
  • If at some point you need to restore from an EBS volume, 1. let the ASG spawn a new node 2. temporarily add “Launch” to the suspended actions 3. Detach the root volume of the node 4. attach the old EBS volume to /dev/xvda 5. start the node.
  • Setup a lifecycle policy (through CloudFormation or manually) to do a backup on the database EBS volumes. Make sure you set the proper tags to the volumes (and this can only be done manually)
  • Make sure the ASG can spawn instances in multiple availability zones (in case one goes down)

If you follow that, your auto-scaling groups will not behave exactly as auto-scaling groups. You can still configure automatically increasing the number of nodes in case of increased load, but the rest of the features are rarely a good idea for database layers – you’d prefer to resolve your database issues on existing machines, even if stopped temporarily, rather than just spawn new ones.

But you should embrace failure. Even with all termination protections, you have to assume everything may fail and die and you should have a clear path to restoring your nodes.

The post Running a Safe Database Cluster in AWS With Auto-Scaling Groups appeared first on Bozho's tech blog.

Where Is This Coming From?

Post Syndicated from Bozho original https://techblog.bozho.net/where-is-this-coming-from/

In enterprise software the top one question you have to answer as a developer almost every day is “Where is this coming from?”. When trying to fix bugs, when developing new features, when refactoring. You have to be able to trace the code flow and to figure out where a certain value is coming from.

And the bigger the codebase is, the more complicated it is to figure out where something (some value or combination of values) is coming from. In theory it’s either from the user interface or from the database, but we all know it’s always more complicated. I learned that very early in my career when I had to navigate a huge telecom codebase in order to implement features and fix bugs that were in total a few dozens line of code.

Answering the question means navigating the code easily, debugging and tracing changes to the values passed around. And while that seems obvious, it isn’t so obvious in the grand scheme of things.

Frameworks, architectures, languages, coding styles and IDEs that obscure the answer to the question “where is this coming from?” make things much worse – for the individual developer and for the project in general. Let me give a few examples.

Scala, for which I have mixed feelings, gives you a lot of cool features. And some awful ones, like implicits. An implicit is something like a global variable, except there are nested implicit scopes. When you need some of those global variables, so just add the “implicit” keyword and you get the value from the inner-most scope available that matches the type of the parameter you want to set. And in larger projects it’s not trivial to chase where has that implicit value been set. It can take hours of debugging to figure out why something has a particular value, only to figure out some unrelated part of the code has touched the relevant implicits. That makes it really hard to trace where stuff is coming from and therefore is bad for enterprise codebases, at least for me.

Another Scala feature is partially applied functions. You have a function foo(a, b, c) (that’s not the correct syntax, of course). You have one parameter known at some point, and the other two parameters known at a later point. So you can call the function partially and pass the resulting partially applied function to the next function, and so on until you have the other arguments available. So you can do bar(foo(a)) which means that in bar(..) you can call foo(b, c). Of course, at that point, answering the question “where did the value of a come from” is harder to answer. The feature is really cool if used properly (I’ve used it, and was proud about it), but it should be limited to smaller parts of the codebase. If you start tossing partially applied functions all over the place, it becomes a mess. And unfortunately, I’ve seen that as well.

Enough about Scala, the microservices architecture (which I also have mixed feeling about) also complicates the ability of a developer to trace what’s happening. If for a given request you invoke 3-4 external systems, which both return data and manipulate data, it becomes much harder to debug your application. Instead of putting a breakpoint or doing a call hierarchy, you have to track the parameters of each interaction with each microservice. It’s news to nobody that microservices are harder to debug but I just wanted to put that in the context of answering the “where is this coming from” question.

Dynamic typing is another example. I’ve included that as part of my arguments why I prefer static typing. Java IDEs have “Call hierarchy”. Which is the single most useful IDE functionality for large enterprise software (for me even more important than the refactoring functionality). You really can trace every bit of possible code flow, not only in your codebase, but also in your dependencies, which often hide the important details (chances are, you’ll be putting breakpoints and inspecting 3rd party code rather often). Dynamic typing doesn’t give you the ability to do that properly. doSomething called on an unknown-at-compile-time type can be any method with that name. And tracing where stuff is coming from becomes much harder.

Code generation is something that I’ve always avoided. It takes input from text files (in whatever language they are) and generates code, turning the question “where is this coming from” to “why has this been generated that way”.

Message queues and async programming in general – message passing obscures the source and destination of a given piece of data; a message queue adds complexity to the communication between modules. With microservices you at least have API calls, with queues, you have multiple abstractions between the sender and recipient (exchanges, topics, queues). And that’s a general drawback of asynchrounous programming – that there’s something in between the program flow that does “async magic” and spits something on the other end – but is it transformed, is it delayed, is it lost and retried, is it still waiting?

By all these examples I’m not saying you should not use message queues, code generation, dynamic languages, microservices or Scala (though for some I’d really advice against). All of these things have their strengths, and they have been chosen exactly for those strengths. A message queue was probably chosen because you want to really decouple producer and consumer. Scala was chosen for its expressiveness. Microservices were chosen because a monolith had become really hard to manage with multiple teams and multiple languages.

But we should try to minimize the “damage” of not being able to easily trace the program flow and not being able to quickly answer “where is this coming from”. Impose a “no-implicits” rule on your scala code base. Use code-generation for simpler components (e.g. DTOs for protobuf). Use message queues with predictable message/queue/topic/exchange names and some slightly verbose debug logging. Make sure your microservices have corresponding SDKs with consistent naming and that they can be run locally without additional effort to ease debugging.

It is expected that the bigger and more complex a project is, the harder it will be to trace where stuff is going. But do try to make it as easy as possible, even if it costs a little extra effort in the design and coding phase. You’ll be designing and writing that feature for a week. And you (and others) will be supporting and expanding it for the next 10 years.

The post Where Is This Coming From? appeared first on Bozho's tech blog.

One-Month of Microsoft DKIM Failure and Thoughts on Technical Excellence

Post Syndicated from Bozho original https://techblog.bozho.net/one-month-of-microsoft-dkim-failure-and-thoughts-on-technical-excellence/

Last month I published an article about getting email settings right. I had recently configured DKIM on our company domain and was happy to share the experience. However, the next day I started receiving error DMARC reports saying that our Office365-backed emails are failing DKIM which can mean they are going in spam.

That was quite surprising as I have followed every step in the Microsoft guide. Let me first share a bit about that as well, as it will contribute to my final point.

In order to enable DKIM, on a normal email provider, you just have to get the DKIM selector value from some settings screen and copy them in your DNS admin dashboard (in a new ._domainkey. record). Microsoft, instead, make you do the following:

  • Connect to Exchange Powershell
  • Oh, you have 2FA enabled? Sorry, that doesn’t work – follow this tutorial instead
  • Oh, that fails when downloading their custom application? An obscure stackoverflow answer tells you that you should not download it with Firefox or Chrome – it fails to run then. You should download it with Internet Explorer (or Edge) and then it works. Just Microsoft.
  • Now that you are connected, follow the DKIM guide
  • In addition to powershell, you should also go in the Exchange admin in your Office365 admin, and enable DKIM

Yes, you can imagine you can waste some time here. But it finally works, and you set the CNAME records and assume that what they wrote – namely, that the TXT records to which the CNAME records point, are generated on Microsoft’s end (for onmicrosoft.com), and so anyone trying to validate a signature will pick the right public key from there and everything will be perfect. Well, no. Here’s an excerpt from a test mail I sent to my Gmail in December:

DKIM: ‘FAIL’ with domain logsentinel.com
ARC-Authentication-Results: i=2; mx.google.com;
dkim=temperror (no key for signature) [email protected] header.s=selector1

I immediately reported the issue in a way that I think is quite clear, although succint:

Hello, I configured DKIM following your tutorial (https://docs.microsoft.com/en-us/microsoft-365/security/office-365-security/use-dkim-to-validate-outbound-email), however the TXT records at your end that are supposed to be generated are not generated – so me setting a CNAME records points to a missing TXT record. I tried disabling and reenabling, and rotating the keys, but the TXT record is still missing, which means my emails get in spam folders, because the DMARC policy epxects a DKIM record. Can you please fix that and create the required TXT records?

What followed was 35 days of long phone calls and repetitive emails with me trying to explain what the issue is and Microsoft support not getting it. As a sidenote, Office365 support center doesn’t support tickets with so much communication – it doesn’t have pagination so I can’t see the original note online (only in my inbox).

On the 5th minute of my first-of-several 20 minute calls my wife, who was around, already knew exactly what the issue is (she’s quite technical, yes), but somehow Microsoft support didn’t get it. We went through multiple requests for me providing DMARC reports (and just the gmail report wasn’t enough, it had to be from 2 others as well), screenshots of my DNS administration screen, screenshots to prove the record is missing (but mxtoolbox is some third party tool they can’t trust), even me sharing my computer for a remote desktop session (no way, company policy doesn’t allow random semi-competent people to touch my machine). I ended up providing an nslookup (a Microsoft tool) screenshot, to show that the record is missing.

In the meantime I tried key rotation again form the exchange admin UI. That got stuck for a week, so I raised a separate ticket. While we were communication on that ticket as well, rotation finished, but the keys didn’t change. What changed was (that I later figured) the active selector. We’ll get to that later.

After the initial few days I executed Get-DkimSigningConfig | fl to get the DKIM record values and just created TXT records with them (instead of CNAME), so that our mails don’t get in spam. Of course, having DKIM fail doesn’t automatically mean you’ll get in spam, even though our DMARC policy says so, but it’s a risk one shouldn’t take. With the TXT records set everything was working okay, so I didn’t have an urgent problem anymore. That allowed me to continue slowly trying to resolve the issue with Microsoft support, for another month.

Two weeks ago they acknowledged something is missing on their end. Miracle! So the last few days of the horror were about me having to set the CNAME records back (and get rid of my TXT records) so that they can fix the missing records on their end. I know, that sounds ridiculous – you don’t need a CNAME to point to your TXT in order to get that TXT to appear, but who knows what obscure and shitty programs and procedures they have, so I complied. What followed was a request for more screenshots and more DKIM reports.

No. You have nslookup. No screenshot of mine is better than aн nslookup. The TTL for my entries is 300s, so there should be no issue with caching (and screenshots won’t help with that either).

Then finally, after having spoken to or emailed probably 5-6 different people, there was an email that made sense:

Good Afternoon,

As per the public documentation you followed (https://docs.microsoft.com/en-us/microsoft-365/security/office-365-security/use-dkim-to-validate-outbound-email) it is mentioned to have TXT records created in Microsoft DNS server for both Selector CNAME Records. we agree with that.

Our product engineer team has confirmed that there is a design level change was rolled out and it is still not published in public documentation. To be precise, we don’t require TXT record for both CNAME Records published. The TXT record will be published only for the Active Selector.

LastChecked 2020-01-01 07:40:58Z
KeyCreationTime 2019-12-28 18:37:38Z
RotateOnDate 2020-01-01 18:37:38Z
SelectorBeforeRotateOnDate selector1
SelectorAfterRotateOnDate selector2

Above info confirms that the active selector is “Selector2” and we have the TXT record published. So we don’t have to wait for the TXT record for Selector1 in our scenario.

Also we require the Selector1 CNAME record in your DNS, whenever next rotate happen. Microsoft will publish the other TXT record. This shouldn’t affect your environment at any chance.

Alright. So, when I triggered key rotation, I fixed the results of the original bug that lead to no records present. The rotation made the active selector “selector2” and generated its TXT record. Selector1 is currently missing, but it’s inactive, so that is not an issue. What “active selector” means is an interesting question. It’s nowhere in the documentation, but there’s this article from 2016 that sheds some light. Microsoft has to swap the selectors on each rotation. According to the article rotation happens every week, but that’s no longer the case – since my manual rotation there hasn’t been any automatic one.

So to summarize – there was an initial bug that happened who knows why, then my stuck-for-a-week manually triggered rotation fixed it, but because they haven’t documented the actual process, there was no way for me to know that the missing TXT record is fine and will (hopefully) be generated on the next rotation (which is my current pending question for them – when will that happen so that I check if it worked).

To highlight the issues here. First, a bug. Second, missing/outdated documentation. Third, incompetent (though friendly) support. Fourth, a convoluted procedure to activate a basic feature. And that’s for not getting emails in spam in an email product.

I can expand those issues to any Microsoft cloud offering. We’ve been doing some testing on Azure and it’s a mess. At first an account didn’t even work. No reason, just fails. I had to use another account. The UI is buggy. Provisioning resources takes ages with no estimate. Documentation is far from perfect. Tools are meh.

And yet, Microsoft is growing its cloud business. Allegedly, they are the top cloud provider in the world. they aren’t, obviously, but through bundling Office365 and Azure in the same reporting, they appear bigger than AWS. Azure is most likely smaller than AWS. But still, Satya Nadella has brought Microsoft to market success again. Because technical excellence doesn’t matter in enterprise sales. You have your sales reps, nice conferences, probably some kickbacks, (alleged) easy integration with legacy Windows stuff that matters for most enterprise customers. And for the decision maker that chooses Microsoft over Amazon or Google the argument of occasional bugs, poor documentation or bad support is irrelevant.

But it is relevant for the end results. For the engineers and the products they build. Unfortunately, tech companies have prioritized alleged business value over technical excellence so much that they can no longer deliver the business value. If the outgoing email of an entire organization goes in spam because Microsoft has a bug in Office365 DKIM, where’s the business value of having a managed email provider? If the people-hours saved form going from on-prem to Azure are wasted because of poor technology, where’s the business value? Probably decision makers don’t factor that in, but they should.

I feel technical excellence has been in decline even in big tech companies, let alone smaller ones. And while that has lead to projects gone out of budged, poor quality and poor security, the fact that the systems aren’t that critical has made it possible for the tech sector to get away with that. And even grow.

But let me tell you of big player in a different industry that stopped caring about technical excellence. Boeing. This article tells the story of shifting Boeing from an engineering organization to one that cuts on quality expenses and outsources core capabilities. That lead to the death of people. Their plane crashed because of that shift in priorities. That lead to a steep decline in sales, and a moderate decline in stock and a bailout.

If you de-prioritize technical excellence, and you are in a critical sector, you lose. If you are an IT vendor, you can get away with it, for a while. Maybe forever? I hope not, otherwise everything in tech will be broken for a long time. And digital transformation will be a painful process around bugs and poor tools.

The post One-Month of Microsoft DKIM Failure and Thoughts on Technical Excellence appeared first on Bozho's tech blog.

One-Time Passwords Do Not Provide Non-Repudiation

Post Syndicated from Bozho original https://techblog.bozho.net/one-time-passwords-do-not-provide-non-repudiation/

The title is obvious and it could’ve been a tweet rather than a blogpost. But let me expand.

OTP, or one-time password, used to be mainstream with online banking. You get a dedicated device that generates a 6-digit code which you enter into your online banking in order to login or confirm a transaction. The device (token) is airgapped (with no internet connection), and has a preinstalled secret key that cannot be leaked. This key is symmetric and is used to (depending on the algorithm used) encrypt the current time, strip most of the ciphertext and turn the rest into 6 digits. The server owns the same secret key, does the same operation and compares the resulting 6 digits with the ones provided. If you want a more precise description of (T)OTP – check wikipedia.

Non-repudiation is a property of, in this case, a cryptographic scheme, that means the author of a message cannot deny the authorship. In the banking scenario, this means you cannot dispute that it was indeed you who entered those 6 digits (because, allegedly, only you possess the secret key because only you possess the token).

Hardware tokens are going out of fashion and are being replaced by mobile apps that are much easier to use and still represent a hardware device. There is one major difference, though – the phone is connected to the internet, which introduces additional security risks. But if we try to ignore them, then it’s fine to have the secret key on a smartphone, especially if it supports secure per-app storage, not accessible by 3rd party apps, or even hardware security module with encryption support, which the latest ones do.

Software “tokens” (mobile apps) often rely on OTPs as well, even though they don’t display the code to the user. This makes little sense, as OTPs are short precisely because people need to enter them. If you send an OTP via a RESTful API, why not just send the whole ciphertext? Or better – do a digital signature on a server-provided challenge.

But that’s not the biggest problem with OTPs. They don’t give the bank (or whoever decides to use them as a means to confirm transactions) non-repudiation. Because OTPs rely on a shared secret. This means the bank, and any (curious) admin knows the shared secret. Especially if the shared secret is dynamically generated based on some transaction-related data, as I’ve seen in some cases. A bank data breach or a malicious insider can easily generate the same 6 digits as you, with your secure token (dedicated hardware or mobile).

Of course, there are exceptions. Apparently there’s ITU-T X.1156 – a non-repudiation framework for OTPs. It involves trusted 3rd parties, though, which is unlikely a bank scenario. There are also HSMs (server-side hardware security modules) that can store secret keys without any option to leak them to the outside world that support TOTP natively. So the device generates the next 6 digits. Someone with access to the HSM can still generate such a sequence, but hopefully audit trail would indicate that this happened, and it is far less of an issue than just a database breach. Obviously (or, hopefully?) secrets are not stored in plaintext even outside of HSMs, but chances are they are wrapped with a master key in an HSM. Which means that someone can bulk-decrypt them and leak them.

And this “can” is very important. No matter how unlikely it is, due to internal security policies and whatnot, there is no cryptographic non-repudation for OTPs. A customer can easily deny entering the code to confirm a transaction. Whether this will hold in court is a complicated legal issue, where the processes in the bank can be audited to see if a breach is possible, whether any of the admins has a motive to abuse their privilege and so on, whether there’s logs of the user activities that can be linked to their device based on network logs from internet carriers, and so on. So you can achieve some limited level of non-repudiation through proper procedures. But it is not much different than authorizing the transaction with your password (at least it is not a shared secret, hopefully).

But why bother with all of that if you have a mobile app with internet connection? You can just create a digital signature on a unique per-transaction challenge and you get proper non-repudiation. The phone can generate a private key on enrollment and send its public key to the server, which can later be used to verify the signature. You don’t have the bandwidth limitation of the human brain that lead to the requirement of 6 digits, so the signature length can be arbitrarily long (even GPRS can handle digital signatures).

Using the right tool for the job doesn’t have just technical implications, it also has legal and business implications. If you can achieve non-repudiation with standard public-key cryptography, if your customers have phones with secure hardware modules and you are doing online mobile-app-based authentication and authorization anyway, just drop the OTP.
.

The post One-Time Passwords Do Not Provide Non-Repudiation appeared first on Bozho's tech blog.

Getting Email Sending Settings Right

Post Syndicated from Bozho original https://techblog.bozho.net/getting-email-sending-settings-right/

Email. The most common means of internet communication, that has been around for ages and nobody has been able to replace it. And yet, it is so hard to get it right in terms of configuration so that messages don’t get sent in spam.

Setting up your own email server is a nightmare, of course, but even cloud email providers don’t let you skip any of the steps – you have to know them and you have to do them. And if you miss something, business will suffer, as a portion of your important emails will get in spam. The sad part is that even if you get all of them right, emails can still get in spam, but this is on the spam filters and there’s little you can do about it. (I dread the moment when an outgoing server will run the email through standard spam filters with standard sets of configurations to see if it would get flagged as spam or not)

This won’t be the first “how to configure email” overview article that mentions all the pitfalls, but most of the ones I see omit some important details. But if you’ve seen one, you are probably familiar with what has to be done – configure SPF, DKIM and DMARC. But doing that in practice is trickier, as this meme (by a friend of mine) implies:

So, you have an organization that wants to send email from its “example.com” domain. Most tutorials assume that you want to send email from one server, which is almost never the case. You need it for the corporate emails (which would in many cases be a hosted or cloud MS Exchange, or Google Suite), you need it for system emails from your applications (one or more of them), which use either another internal server or a cloud email provider, e.g. Amazon SES, you also need it for your website, which uses the hosting provider email server, and you need it for your email campaigns, e.g. via Mailchimp or Sendgrid.

All of the email providers mentioned above have some parts of the picture in their documentation but it doesn’t work in combination. As most providers wrongly assume they are the only one. Their examples assume that and their automated verifications assume that – e.g. Microsoft checks if your SPF record matches exactly what they provide, rather than checking if their servers are allowed by your more complex SPF record which includes all of the above providers.

So let’s get to the individual items you have to configure. Most of them are DNS records, which explains why a technical person in each organization has to do it manually, rather than each service pushing it there automatically after some API authentication:

  • SPF (Sender Policy Framework) – a DNS recrod that lists the permitted senders (IP addresses) and an instruction flag on what to do with those that don’t match. In the typical scenario you need to include multiple senders’ policies rather than listing IP addresses, as they can change. E.g. in order to use Office365, you have to add include:spf.protection.outlook.com. Note that this should be a TXT record, but some DNS providers support a special type of record – SPF. So some older software may expect an SPF header, which means you should support both records with identical values. The syntax is straighforward but sometimes tricky, so you can use a tool to generate and validate it.
  • DKIM (DomainKeys Identified Mail) – a DNS record that lets email senders sign their emails. The DNS record includes the public key used to verify the signature. Why is it needed if there’s SPF? Among other things (like non-repudiation), because with SPF the From header can still be spoofed. Not that DKIM always helps with that, but in combination with DMARC it does. How does DKIM work in multi-sender scenario? You have multiple DKIM selectors which means multiple TXT records. Usually every provider will recommend its own selector (e.g. selector1._domainkey.example.com). Some may insist on being default._domainkey, and if two of them insist on that, you should contact support (the message contains the selector and then verification will fail if it does not match). Email providers would prefer CNAME instead of TXT records as that allows them to rotate the keys without you having to change your DNS records.
  • DMARC (Domain-based Message Authentication, Reporting and Conformance) – this DNS record contains the policy according to which your emails should be validated – it enforces SPF and DKIM and tells the receiving side what to do if they fail. You can have one DMARC policy (again as a TXT record). The syntax is not exactly human readable, so use a tool to generate and validate it. An important aspect of DMARC is that you can receive reports in case of failures – you can specify an email where reports are sent and you can analyze them. There are services (like ReportURI) that can aggregate and analyze these reports. I prefer setting multiple report emails – one administrative and one for ReportURI.
  • PTR (pointer) – this is used for reverse DNS loopkups – it maps a domain name to an IP address (as opposed to A records which map IP adresses to domain names). Spam filters use it to check incoming email. The PTR records should be there for the servers that send the email, e.g. mail.example.com. External providers are likely to already have that record so no need to worry about it. And in many cases you won’t even be able to anyway if you don’t own/control the network.
  • Service-specific settings – you may configure your headers properly but the service sending the emails (e.g. Office365, Mailchimp) might still be missing some configuration. In some cases you have to manually confirm your headers in order to enable DKIM signing. With MS Exhange, for example, you have to execute a few PowerShell commands to generate and then confirm the DKIM records.
  • Blacklists – if you haven’t had everything setup correctly, or if one of your sending servers/services has been compromised, your domain and/or servers may be present in some blacklist. You have to check that. There are tools that aggregate blacklists and check against the, e.g. this one

After everything is done, you can run a spam test using some of the available online tools: e.g. this, this or this. And speaking of tools, MXToolbox has many useful tools to verify all aspects of email configuration.

By the way, this is not everything you can configure about your email. SMTP over TLS (SMTPS), MTA-STS, TLS RPT, DANE. I’ve omitted them because they are about encrypted communication and not about spam, but you should review them for proper email configuration.

By now (even if you already knew most of the things above) you are probably wondering “why did we get here?”. Why do we have to do so many things just to send simple email. Well, first, it’s not that simple to have a universal messaging protocol. It looks simple to use and that’s the great part of it, but it does hide some complexities. The second reason is that the SMTP protocol was not designed with security in mind. Spam and phishing were maybe not seen as such a big issue and so the protocol does not have built-in guarantees for anything. It doesn’t have encryption, authentication, non-repudiation, anything.

That’s why this set of instruments evolved over time to add these security features to email (I haven’t talked about encryption, as it’s handled differently). Why did it have to be DNS-based? It’s the most logical solution, as it guarantees the ownership of the domain, which is what matters even visually to the recipient in the end. But it makes administration more complicated, as you are limited to one-line, semicolon or space separated formats. I think it would be helpful to have a way to delegate all of that to external services, e.g. by a single authenticating DNS record which points to a URL which provide all these policies. For example an EML record to point to https://example.com/email-policies which can publish them in a prettier and more readable (e.g. JSON) format and does that in a single place rather than having to generate multiple records. Maybe that has its own cons, like having the policy server compromised.

But if anything is obvious it is that everything should be designed with security in mind. And every malicious scenario should be taken into account. Because adding security later makes things even more complicated.

The post Getting Email Sending Settings Right appeared first on Bozho's tech blog.

A Disk-Backed ArrayList

Post Syndicated from Bozho original https://techblog.bozho.net/a-disk-backed-arraylist/

It sometimes happens that your list can become too big to fit in memory and you have to do something in order to avoid running out of memory.

The proper way to do that is streaming – instead of fitting everything in memory, you should stream data from the source and discard the entries that are already processed.

However, there are cases when code that’s outside of your control requires a List and you can’t use streaming. These cases are rather rare but in case you hit them, you have to find a workaround. One is to re-implement the code to work with streaming, but depending on the way the library is written, it may not be possible. So the other option is to use a disk-backed list – one that works as a list, but underneath stores and loads elements from disk.

Searching for existing solutions results in several 3+ years old repos like this one and this one and this one.

And then there’s MapDB, which is great and supported. It’s mostly about maps, but it does support a List as well, as shown here.

And finally, you have the option to implement something simpler yourself, in case you need just iteration and almost nothing else. I’ve done it here – DiskBackedArrayList.java. It doesn’t support many things (not all methods are overridden to throw an exception, but they should). But most importantly, it doesn’t support random adding and random getting, and also toArray(). It’s purely “fill the collection” and then “iterate the collection”. It relies on ObjectOutputStream which is not terribly efficient, but is simple to use. Note that I’ve allowed a short in-memory prependList in case small amounts of data need to be prepended to the list.

The list gets filled in memory until a specified threshold and then gets flushed to disk, clearing the memory which starts getting filled again. This too can be more efficient – with background flushing in another thread that doesn’t interfere with adding elements to the list, but optimizations complicate things and in this case the total running time was not an issue. Most importantly, the iterator() method is overridden to return a custom iterator that first streams the prepended list, then reads everything from disk and finally iterates over the latest batch which is still in memory. And finally, the clear() method should be called in the end in order to close the underlying stream. An output stream could be opened and closed on each flush, but ObjectOutputStream can’t be used in append mode due to some implementation specific about writing headers first.

So basically we hide the streaming approach underneath a List interface – it’s still streaming elements and discarding them when not needed. Ideally this should be done at the source of the data (e.g. a database, message queue, etc.) rather than using the disk as overflow space, but there are cases where using the disk is fine. This implementation is a starting point, as it’s not tested in production, but illustrates that you can adapt existing classes to use different data access patterns if needed.

The post A Disk-Backed ArrayList appeared first on Bozho's tech blog.

Near Real-Time Indexing With ElasticSearch

Post Syndicated from Bozho original https://techblog.bozho.net/near-real-time-indexing-with-elasticsearch/

Choosing your indexing strategy is hard. The Elasticsearch documentation does have some general recommendations, and there are some tips from other companies, but it also depends on the particular usecase. In the typical scenario you have a database as the source of truth, and you have an index that makes things searchable. And you can have the following strategies:

  • Index as data comes – you insert in the database and index at the same time. It makes sense if there isn’t too much data; otherwise indexing becomes very inefficient.
  • Store in database, index with scheduled job – this is probably the most common approach and is also easy to implement. However, it can have issues if there’s a lot of data to index, as it has to be precisely fetched with (from, to) criteria from the database, and your index lags behind the actual data with the number of seconds (or minutes) between scheduled job runs
  • Push to a message queue and write an indexing consumer – you can run something like RabbitMQ and have multiple consumers that poll data and index it. This is not straightforward to implement because you have to poll multiple items in order to leverage batch indexing, and then only mark them as consumed upon successful batch execution – somewhat transactional behaviour.
  • Queue items in memory and flush them regularly – this may be good and efficient, but you may lose data if a node dies, so you have to have some sort of healthcheck based on the data in the database
  • Hybrid – do a combination of the above; for example if you need to enrich the raw data and update the index at a later stage, you can queue items in memory and then use “store in database, index with scheduled job” to update the index and fill in any missing item. Or you can index as some parts of the data come, and use another strategy for the more active types of data

We have recently decided to implement the “queue in memory” approach (in combination with another one, as we have to do some scheduled post-processing anyway). And the first attempt was to use a class provided by the Elasticsearch client – the BulkProcessor. The logic is clear – accumulate index requests in memory and flush them to Elasticsearch in batches either if a certain limit is reached, or at a fixed time interval. So at most every X seconds and at most at every Y records there will be a batch index request. That achieves near real-time indexing without putting too much stress on Elasticsearch. It also allows multiple bulk indexing requests at the same time, as per Elasticsearch recommendations.

However, we are using the REST API (via Jest) which is not supported by the BulkProcessor. We tried to plug a REST indexing logic instead of the current native one, and although it almost worked, in the process we noticed something worrying – the internalAdd method, which gets invoked every time an index request is added to the bulk, is synchronized. Which means threads will block, waiting for each other to add stuff to the bulk. This sounded suboptimal and risky for production environments, so we went for a separate implementation. It can be seen here – ESBulkProcessor.

It allows for multiple threads to flush to Elasticsearch simultaneously, but only one thread (using a lock) to consume from the queue in order to form the batches. Since this is a fast operation, it’s fine to have it serialized. And not because the concurrent queue can’t handle multiple threads reading from it – it can; but reaching the condition for forming the bulk by multiple threads at the same time will result in several small batches rather than one big one, hence the need for only one consumer at a time. This is not a huge problem so the lock can be removed. But it’s important to note it’s not blocking.

This has been in production for a while now and doesn’t seem to have any issues. I will report any changes if there are such due to increased load or edge cases.

It’s important to reiterate the issue if this is the only indexing logic – your application node may fail and you may end up with missing data in Elasticsearch. We are not in that scenario, and I’m not sure which is the best approach to remedy it – be it to do a partial reindex of recent data in case of a failed server, or a batch process the checks if there aren’t mismatches between the database and the index. Of course, we should also say that you may not always have a database – sometimes Elasticsearch is all you have for data storage, and in that case some sort of queue persistence is needed.

The ultimate goal is to have a near real-time indexing as users will expect to see their data as soon as possible, while at the same time not overwhelming the Elasticsearch cluster.

The topic of “what’s the best way to index data” is huge and I hope I’ve clarified it at least a little bit and that our contribution makes sense for other scenarios as well.

The post Near Real-Time Indexing With ElasticSearch appeared first on Bozho's tech blog.

A Technical Guide to CCPA

Post Syndicated from Bozho original https://techblog.bozho.net/a-technical-guide-to-ccpa/

CCPA, or the California Consumer Privacy Act, is the upcoming “small GDPR” that is applied for all companies that have users from California (i.e. it has extraterritorial application). It is not as massive as GDPR, but you may want to follow its general recommendations.

A few years ago I wrote a technical GDPR guide. Now I’d like to do the same with CCPA. GDPR is much more prescriptive on the fact that you should protect users’ data, whereas CCPA seems to be mainly concerned with the rights of the users – to be informed, to opt out of having their data sold, and to be forgotten. That focus is mainly because other laws in California and the US have provisions about protecting confidentiality of data and data breaches; in that regard GDPR is a more holistic piece of legislation, whereas CCPA covers mostly the aspect of users’ rights (or “consumers”, which is the term used in CCPA). I’ll use “user” as it’s the term more often use in technical discussions.

I’ll list below some important points from CCPA – this is not an exhaustive list of requirements to a software system, but aims to highlight some important bits. And, obviously, I’m not a lawyer, but I’ve been doing data protection consultations and products (like SentinelDB) for the past several years, so I’m qualified to talk about the technical side of privacy regulations.

  • Right of access – you should be able to export (in a human-readable format, and preferable in machine-readable as well) all the data that you have collected about an individual. Their account details, their orders, their preferences, their posts and comments, etc.
  • Deletion – you should delete any data you hold about the user. Exceptions apply, of course, including data used for prevention of fraud, other legal reasons, needed for debugging, necessary to complete the business requirement, or anything that the user can reasonably expect. From a technical perspective, this means you most likely have to delete what’s in your database, but other places where you have personal data, like logs or analytics, can be skipped (provided you don’t use it to reconstruct user profiles, of course)
  • Notify 3rd party providers that received data from you – when data deletion is requested, you have to somehow send notifications to wherever you’ve sent personal data. This can be a SaaS like Mailchimp, Salesforce or Hubspot, or it can be someone you sold the data (apparently that’s a major thing in CCPA). So ideally you should know where data has been sent and invoke APIs for forgetting it. Fortunately, most of these companies are already compliant with GDPR anyway, so they have these endpoints exposed. You just have to add the logic. If your company sells data by posting dumps to S3 or sending Excel sheets via email, you have a bigger problem as you have to keep track of those activities and send unstructured requests (e.g. emails).
  • Data lineage – this is not spelled out as a requirement, but it follows from multiple articles, including the one for deletion as well as the one for disclosing who data was sent to and where did data came from in your system (in order to know if you can re-sell it, among other things). In order to avoid buying expensive data lineage solutions, you can either have a spreadsheet (in case of simpler processes), or come up with a meaningful way to tag your data. For example, using a separate table with columns (ID, table, sourceType, sourceId, sourceDetails), where ID and table identify a record of personal data in your database, sourceType is the way you have ingested the data (e.g. API call, S3, email) and the ID is the identifier that you can use to track how it came in your system – API key, S3 bucket name, email “from”, or even company registration ID (data might still be sent around flash drives, I guess). Similar table for the outgoing data (with targetType and targetId). It’s a simplified implementation but it might work in cases where a spreadsheet would be too cumbersome to take care of.
  • Age restriction – if you’ve had the opportunity to know the age of a person whose data you have, you should check it. That means not to ignore the age or data of birth field when you import data from 3rd parties, and also to politely ask users about their age. You can’t sell that data, so you need to know which records are automatically opted out. If you never ever sell data, well, it’s still a good idea to keep it (per GDPR)
  • Don’t discriminate if users have used their privacy rights – that’s more of a business requirement, but as technical people we should know that we are not allowed to have logic based on users having used their CCPA (or GDPR) rights. From a data organization perspective, I’d put rights requests in a separate database than the actual data to make it harder to fulfill such requirements. You can’t just do a SQL query to check if someone should get a better price, you should do cross system integration and that might dissuade product owners from breaking the law; furthermore it will be a good sign in case of audits.
  • “Do Not Sell My Personal Information” – this should be on the homepage if you have to comply with CCPA. It’s a bit of a harsh requirement, but it should take users to a form where they can opt out of having their data sold. As mentioned in a previous point, this could be a different system to hold users’ CCPA preferences. It might be easier to just have a set of columns in the users’ table, of course.
  • Identifying users is an important aspect. CCPA speaks about “verifiable requests”. So if someone drops you an email “I want my data deleted”, you should be able to confirm it’s really them. In an online system that can be a button in the user profile (for opting out, for deletion, or for data access) – if they know the password, it’s fairly certain it’s them. However, in some cases, users don’t have accounts in the system. In that case there should be other ways to identify them. SSN sounds like one, and although it’s a terrible things to use for authentication, with the lack of universal digital identity, especially in the US, it’s hard not to use it at least as part of the identifying information. But it can’t be the only thing – it’s not a password, it’s an identifier. So users sharing their SSN (if you have it), their phone or address, passport or driving license might be some data points to collect for identifying them. Note that once you collect that data, you can’t use it for other purposes, even if you are tempted to. CCPA requires also a toll-free phone support, which is hardly applicable to non-US companies even though they have customers in California, but it poses the question of identifying people online based on real-world data rather than account credentials. And please don’t ask users about their passwords over the phone; just initiate a request on their behalf in the system and direct them to login and confirm it. There should be additional guidelines for identifying users as per 1798.185(a)(7).
  • Deidentification and aggregate consumer information – aggregated information, e.g. statistics, is not personal data, unless you are able to extract personal data based on it (e.g. the statistics is split per town and age and you have only two users in a given town, you can easily see who is who). Aggregated data is differentiate from deidentified data, which is data that has its identifiers removed. Simply removing identifiers, though, might again not be sufficient to deidentify data – based on several other data points, like IP address (+ logs), physical address (+ snail mail history), phone (+ phone book), one can be uniquely identified. If you can’t reasonably identify a person based on a set of data, it can be considered deidentified. Do make the mental exercise of thinking how to deidentify your data, as then it’s much easier to share it (or sell it) to third parties. Probably nobody minds being part of an aggregated statistics sold to someone, or an anonymized account used for trend analysis.
  • Pseudonymization is a measure to be taken in many scenarios to protect data. CCPA mentions it particularly in research context, but I’d support a generic pseudonymization functionality. That means replacing the identifying information with a pseudonym, that’s not reversible unless a secret piece of data is used. Think of it (and you can do that quite literally) as encrypting the identifier(s) with a secret key to form the pseudonym. You can then give that data to third parties to work with it (e.g. to do market segmentation) and then give it back to you. You can then decrypt the pseudonyms and fill the obtained market segment(s) into your own database. The 3rd party doesn’t get personal information, but you still get the relevant data
  • Audit trail is not explicitly stated as a requirement, but since you have the obligation to handle users requests and track the use of their data in and outside of your system, it’s a good idea to have a form of audit trail – who did what with which data; who handled a particular user request; how was the user identified in order to perform the request, etc.

As CCPA is not concerned with data confidentiality requirements, I won’t repeat my GDPR advice about using encryption whenever possible (notably, for backups), or about internal security measures for authentication.

CCPA is focused on the rights of your users and you should be able to handle them (and track how you handled them). You can have manual and spreadsheet based processes if you are not too big, and you should definitely check with your legal team if and to what extent CCPA applies to your company. But if you have implemented the GDPR data subject rights, it’s likely that you are already compliant with CCPA in terms of the overall system architecture, except for a few minor details.

The post A Technical Guide to CCPA appeared first on Bozho's tech blog.

Restoring Cassandra Priam Backup With sstableloader

Post Syndicated from Bozho original https://techblog.bozho.net/restoring-cassandra-priam-backup-with-sstableloader/

I’ve previously written about setting up Cassandra and Priam for backup and cluster management. The example that I gave for backup restore there, however, is not applicable in every situation – it may not work on a completely separate cluster, for example. Or in case of partial restore to just one table, rather than the whole database.

In such cases you may choose to do a restore using the sstableloader utility. It has a straightforward syntax:

sudo sstableloader -d 172.35.1.2,172.35.1.3 -ts /etc/cassandra/conf/truststore.jks \
   -ks /etc/cassandra/conf/node.jks -f /etc/cassandra/conf/cassandra.yaml  \
    ~/keyspacename/table-0edcc420c19011e7a8c37656dd492a94

If you look at your Priam-generated backup, it looks like you can just copy the files (e.g. via s3 aws cp on AWS) for the particular tables and sstableloader import them. There’s a catch, however. In order to save space, Priam is using Snappy to compress all of the files. So if you try to feed them to any Cassandra utility, it will complain that they are corrupted.

So you have to decompress them before using sstableloader or anything else. But how? Well, Priam offers a service for that – you call it by passing the absolute path to a compressed file and the absolute path to where the uncompressed should be placed and it does the simple job of streaming the original through a decompressor. For decompressing an entire backup, I’ve written a python script. It assumes a certain structure, but you can parameterize it to make it more flexible. Here’s the code (excuse my non-idiomatic Python, I’m only using it for simple scripting):

#! /usr/bin/env python
# python script used to pass each backup file through the decompression facility of Priam (using Snappy)
# so that it can be used with sstableloader for restore
import os
import requests

rootdir = '/home/ec2-user/backup'
target = '/home/ec2-user/keyspace'

for subdir, dirs, files in os.walk(rootdir):
    for file in files:
        fullpath = os.path.join(subdir, file)
        parent = os.path.join(fullpath, os.pardir)
        table = os.path.basename(os.path.abspath(parent))
        targetdir = target + "/" + table + "/"
        if not os.path.exists(targetdir):
            os.makedirs(targetdir)

        url = 'http://localhost:8080/Priam/REST/v1/cassadmin/decompress?in=' + fullpath + '&out=' + target + "/" + table + "/" + file
        print(url)
        requests.get(url)

Now you have decompressed backup files that you can restore using sstableloader. It may take some time if you have a lot of data, and you should not run the restore at the same time a snapshot backup is performed, as it may fail (was warned by the documentation)

As a general note here, it’s very important to have backups but it’s much more important to be able to restore from them. A backup is useless if you don’t have a restore procedure. And simply having the tools available (e.g. Priam) doesn’t mean you can a restore procedure ready to execute. You should be doing test restores on active staging data as well as full restores on an empty, newly formed cluster, as there are different restore scenarios.

The post Restoring Cassandra Priam Backup With sstableloader appeared first on Bozho's tech blog.

The Personal Data Store Pattern

Post Syndicated from Bozho original https://techblog.bozho.net/the-personal-data-store-pattern/

With the recent trend towards data protection and privacy, as well as the requirements of data protection regulations like GDPR and CCPA, some organizations are trying to reorganize their personal data so that it has a higher level of protection.

One path that I’ve seen organizations take is to apply the (what I call) “Personal data store” pattern. That is, to extract all personal data from existing systems and store it in a single place, where it’s accessible via APIs (or in some cases directly through the database). The personal data store is well guarded, audited, has proper audit trail and anomaly detection, and offers privacy-preserving features.

It makes sense to focus one’s data protection efforts predominantly in one place rather than scatter it across dozens of systems. Of course it’s far from trivial to migrate so much data from legacy systems to a new module and then upgrade them to still be able to request and use it when needed. That’s why in some cases the pattern is applied only to sensitive data – medical, biometric, credit cards, etc.

For the sake of completeness, there’s something else called “personal data stores” and it means an architecture where the users themselves store their own data in order to be in control. While this is nice in theory, in practice very few users have the capacity to do so, and while I admire the Solid project, for example, I don’t think it is viable pattern for many organizations, as in many cases users don’t directly interact with the company, but the company still processes large amounts of their personal data.

So, the personal data store pattern is an architectural approach to personal data protection. It can be implemented as a “personal data microservice”, with CRUD operations on predefined data entities, an external service can be used (e.g. SentinelDB, a project of mine), or it can just be a centralized database that has some proxy in front of it to control the access patterns. You an imagine it as externalizing your application’s “users” table and its related tables.

It sounds a little bit like a data warehouse for personal data, but the major difference is that it’s used for operational data, rather than (just) analysis and reporting. All (or most) of your other applications/microservices interact constantly with the personal data store whenever they need to access or update (or “forget”) personal data.

Some of the main features of such a personal data store, the combination of which protect against data breaches, in my view, include:

  • Easy to use interface (e.g. RESTful web services or simply SQL) – systems that integrate with the personal data store should be built in a way that a simple DAO layer implementation gets swapped and then data that was previously accessed form a local database is now obtained from the personal data store. This is not always easy, as ORM technologies add a layer of complexity.
  • High level of general security – servers protected with 2FA, access control, segregated networks, restricted physical access, firewalls, intrusion prevention systems, etc. The good things is that it’s easier to apply all the best practices applied to a single system instead of applying it (and keeping it that way) to every system.
  • Encryption – but not just “data at rest” encryption; especially sensitive data can and should be encrypted with well protected and rotated keys. That way the “honest but curious” admin won’t be able to extract anything form the underlying database
  • Audit trail – all infosec and data protection standards and regulations focus on accountability and traceability. There should not be a way to extract or modify personal data without leaving a trace (and ideally, that trace should be protected as well)
  • Anomaly detection – checking if there is something strange/anomalous in the data access patterns. Such strange access patterns can mean a data breach is happening, and the personal data store can actively block it. There is a lot of software out there that does anomaly detection on network traffic, but it’s much better if the rules (or machine learning) are domain-specific. “Monitor for increased traffic to those servers” is one thing, but it’s much better to be able to say “monitor for out-of-the ordinary accesses to personal data of such and such kind”
  • Pseudonymization – many systems that need the personal data don’t actually need to know who it is about. That includes marketing, including outsourcing to 3rd parties, reporting functionalities, etc. So the personal data store can return data that does not allow a person do be identified, but a pseudo-ID instead. That way, when updates are made back to the personal data store, they can still refer to a particular person, via the pseudonymous ID, but the application that extracted the data in the first place doesn’t get to know who the data was about. This is useful in scenarios where data has to be (temporarily or not) stored in a database that lies outside the personal datastore.
  • Authentication – if the company offers user authentication, this can be done via the personal data store. Passwords, two-factor authentication secrets and other means of authentication are personal data, and an important one as well. An organization may use a single-sign-on internally (e.g. Active Directory), but it doesn’t make sense to put customers there, too, so they are usually stored in a database. During authentication, the personal data store accepts all necessary credentials (username, password, 2FA code), and return a token to be used for subsequent calls or to be used a a session cookie token.
  • GDPR (or CCPA or similar) functionalities – e.g. export of all data about a person, forgetting a person. That’s an often overlooked problem, but “give me all data about me that you have” is an enormous issue with large companies that have dozens of systems. It’s next to impossible to extract the data in a sensible way from all the systems. Tracking all these requests is itself a requirement, so the personal data store can keep track of them to present to auditors if needed.

That’s all easier said than done. In organizations that have already many systems working alongside and processing personal data, migration can be costly. So it’s a good idea to introduce it as early as possible, and have a plan (even if it lasts for years) to move at least sensitive personal data to the well protected silo. This silo is a data engineering effort, a system refactoring effort and an organizational effort. The benefits, though, are reduced long-term cost and reduced risks for data breaches and non-compliance.

The post The Personal Data Store Pattern appeared first on Bozho's tech blog.

Remote Log Collection on Windows

Post Syndicated from Bozho original https://techblog.bozho.net/remote-log-collection-on-windows/

Every organization needs to collect logs from multiple sources in order to put them in either a log collector or SIEM (or a dedicated audit trail solution). And there are two options for that – using an agent and agentless.

Using an agent is easy – you install a piece of software on each machine that generates logs and it forwards them wherever needed. This is however not preferred by many organizations as it complicates things – upgrading to new versions, keeping track of dozens of configurations, and potentially impacting performance of the target machines.

So some organizations prefer to collect logs remotely, or use standard tooling, already present on the target machine. For Linux that’s typically syslog, where forwarding is configured. Logs can also be read remotely via SCP/SSH.

However, on Windows things are less straightforward. You need to access the Windows Event Log facility remotely, but there is barely a single place that describes all the required steps. This blogpost comes close, but I’d like to provide the full steps, as there are many, many things that one may miss. It is a best practice to use a non-admin, service account for that and you have to give multiple permissions to allow reading the event logs remotely.

There are also multiple ways to read the logs remotely:

  • Through the Event Viewer UI – it’s the simplest to get right, as only one domain group is required for access
  • Through Win32 native API calls (and DCOM) – i.e. EvtOpenSession and the related methods
  • Through PowerShell Get-WinEvent (Get-EventLog is a legacy cmdlet that doesn’t support remoting)
  • Through WMI directly (e.g. this or this. To be honest, I don’t know whether the native calls and the powershell commands don’t use WMI and/or CIM underneath as well – probably.

So, in order to get these options running, the following configurations have to be done:

  1. Allow the necessary network connections to the target machines (through network rules and firewall rules, if applicable)
  2. Go to Windows Firewall -> Inbound rules and enable the rules regarding “Remote log management”
  3. Create a service account and configure it in the remote collector. The other option is to have an account on the collector machine that is given the proper access, so that you can use the integrated AD authentication
  4. Add the account to the following domain groups: Event log readers, Distributed COM users. The linked article above mentions “Remote management users” as well, but that’s optional if you just want to read the logs
  5. Give the “Manage auditing and security log” privilege to the service account through group policies (GPO) or via “local security policy”. Find it under User Rights Assignment > Manage auditing and security log
  6. Give WMI access – open “wmimgmt” -> right click -> properties > Security -> Advanced and allow the service account to “Execute Methods”, “Provider Write”, “Enable Account”, “Remote Enable”. To be honest, I’m not sure exactly which folder that should be applied to, and applying it to the root may be too wide, so you’d have to experiment
  7. Give registry permissions: Regedit -> Local machine -> System\CurrentControlSet\Services\eventlog\Security -> right click -> permissions and add the service account. According to the linked post you also have to modify a particular registry entry, but that’s not required just for reading the log. This step is probably the most bizarre and unexpected one.
  8. Make sure you have DCOM rights. This comes automatically wit the DCOM group, but double check via DCOMCnfg -> right click -> COM security
  9. Grant permissions for the service account on c:\windows\system32\winevt. This step is not required for “simple” reading of the logs, but I’ve seen it in various places, so in some scenarios you might need to check it
  10. Make sure the application or service that is reading the logs remotely has sufficient permissions – it can usually run with admin privileges, because it’s on a separate, dedicated machine.
  11. Restart services – that is optional, but can be done just in case: Restart “Windows Remote Management (WS-Management)” and “Windows Event Log” on the target machine

As you can see, there are many things that you can miss, and there isn’t a single place in any documentation to list those steps (though there are good guides like this that go in a slightly different direction).

I can’t but make a high-level observation here – the need to do everything above is an example of how security measures can “explode” and become really hard to manage. There are many service, groups, privileges, policies, inbound rules and whatnot, instead of just “Allow remote log reading for this user”. I know it’s inherently complex, but maybe security products should make things simpler by providing recipes for typical scenarios. Following guides in some blog is definitely worse than running a predefined set of commands. And running the “Allow remote access to event log” recipe would do just what you need. Of course, knowing which recipe to run and how to parameterize it would require specific knowledge, but you can’t do security without trained experts.

The post Remote Log Collection on Windows appeared first on Bozho's tech blog.

Protecting JavaScript Files (From Magecart-Style Attacks)

Post Syndicated from Bozho original https://techblog.bozho.net/protecting-javascript-files-from-magecart-attacks/

Most web pages now consist of multiple JavaScript files that are included in various ways (via >script< or in some more dynamic fashion, bundled and minified or not). But since these scripts interact with everything on the page, they can be a security risk.

And Magecart showcased that risk – the group attacked multiple websites, including British Airways and Ticketmaster, and stole a few hundred thousand credit card numbers.

It is a simple attack where attacker inserts a malicious javascript snippet into a trusted javascript file, collects credit card details entered into payment forms and sends them to an attacker-owned website. Obviously, the easy part is writing the malicious javascript; the hard part is getting it on the target website.

Many websites rely on externally hosted assets (including scripts) – be it a CDN, or a dedicated asset server (as in the case of British Airways). These externally hosted assets may be vulnerable in several ways:

  • Asset servers may be less protected than the actual server, because they are just static assets, what could go wrong?
  • Credentials to access CDN configuration may be leaked which can lead to an attacker replacing the original source scripts with their own
  • Man-in-the-middle attacks are possible if the asset server is misconfigured (e.g. allowing TLS downgrade attack)
  • The external service (e.g. CND) that was previously trusted can go rogue – that’s unlikely with big providers, but smaller and cheaper ones are less predictable

Once the attackers have replaced the script, they are silently collecting data until they are caught. And this can be a long time.

So how to protect against those attacks? A typical advice is to introduce a content security policy, which will allow scripts from untrusted domains to be executed. This is a good idea, but doesn’t help in the scenario where a trusted domain is compromised. There are several main approaches, and I’ll summarize them below:

  • Subresource integrity – this is a browser feature that lets you specify the hash of a script file and validates that hash when the page loads. If it doesn’t match the hash of the actually loaded script, the script is blocked. This sounds great, but has several practical implications. First, it means you need to complicate your build pipeline so that it calculates the hashes of minified and bundled resources and inject those hashes in the page templates. It’s a tedious process, but it’s doable. Then there are the dynamically loaded scripts where you can’t use this feature, and there are the browsers that don’t support it fully (Edge, IE and Safari on mobile). And finally, if you don’t have a good build pipeline (which many small websites don’t), a very small legitimate change in the script can break your entire website.
  • Don’t use external services – that sounds straightforward but it isn’t always. CDNs exist for a reason and optimize your site loading speeds and therefore ranking, internal policies may require using a dedicated asset server, sometimes plugins (e.g. for WordPress) may fetch external resources. An exception to this rule is allowed if you somehow sandbox the third party script (e.g. via iframe as explained in the link above)
  • Secure all external servers properly – if you can do that, that’s great – upgrade the supported cipher suites, monitor for 0days, use only highly trusted CDNs. Regardless of anything, you should obviously always strive to do that. But it requires expertise and resources, which may not be available to every company and every team.

There is one more scenario that may sound strange – if an attacker hacks into your main application server(s), they can replace the scripts with whatever they want. It sounds strange at first, because if they have access to the server, it’s game over anyway. But it’s not always full access with RCE – might be a limited access. Credit card numbers are usually not stored in plain text in the database, so having access to the application server may not mean you have access to the credit card numbers. And changing the custom backend code to collect the data is much more unpredictable and time-consuming than just replacing the scripts with malicious ones. None of the options above protect against that (as in this case the attacker may be able to change the expected hash for the subresource integrity check)

Because of the limitations of the above approaches, at my company we decided to provide a tool to monitor your website for such attacks. It’s called Scriptinel.com (short for Script Sentinel) and is currently in early beta. It’s mainly targeted at small website owners who can’t get any of the above 3 points, but can be used for sophisticated websites as well.

What it does is straightforward – it scans a given URL, extracts all scripts from it (even the dynamic ones), and starts monitoring them for changes with periodic requests. If it discovers a change, it notifies the website owner so that they can react.

This means that the attacker may have a few minutes to collect data, but time is an important factor here – this is not a “SELECT *” data breach; it relies on customers using the website. So a few minutes minimizes the damage. And it doesn’t break your website (I guess we can have a script to include that blocks the page if scriptinel has found discrepancies). It also doesn’t require changes in the build process to include hashes. Of course, such a reactive approach is not perfect, especially if there is nobody to react, but monitoring is a good idea regardless of whether other approaches are used.

There is the issue of protected pages and pages that are not directly accessible via a GET request – e.g. a payment page. For that you can enter your javascript files individually, rather than having the tool scan the page. We can add a more sophisticated user journey scan, with specifying credentials and steps to reach the protected pages, but for now that seems unnecessary.

How does it solve the “main server compromised” problem? Well, nothing solves that perfectly, as the attacker can do changes that serve the legitimate version of the script to your monitoring servers (identifying them by IP) and the modified scripts to everyone else. This can be done on the compromised external asset servers as well (though not with leaked CDN credentials). However this implies the attacker knows that Scriptinel is used, knows the IP addresses of our scanners, and has gained sufficient control to server different versions based on IP. This raises the bar significantly, and can even be made impossible to pull off if we regularly change the IP addresses in a significantly large IP range.

Such functionality may be available in some enterprise security suites, though I’m not aware of it (if it exists somewhere, please let me know).

Overall, the problem is niche, but tough, and not solving it can lead to serious data breaches even if your database is perfectly protected. Scriptinel is a simple to use, good enough solution (and one that’s arguably better than the other options).

Good information security is the right combination of knowledge, implementation of best practices and tools to help you with that. And I maybe Scriptinel is one such tool.

The post Protecting JavaScript Files (From Magecart-Style Attacks) appeared first on Bozho's tech blog.

Let’s Annotate Our Methods With The Features They Implement

Post Syndicated from Bozho original https://techblog.bozho.net/lets-annotate-our-methods-with-the-features-they-implement/

Writing software consists of very little actual “writing”, and much more thinking, designing, reading, “digging”, analyzing, debugging, refactoring, aligning and meeting others.

The reading and digging part is where you try to understand what has been implemented before, why it has been implemented, and how it works. In larger projects it becomes increasingly hard to find what is happening and why – there are so many classes that interfere, and so many methods participate in implementing a particular feature.

That’s probably because there is a mismatch between the programming units (classes, methods) and the business logic units (features). Product owners want a “password reset” feature, and they don’t care if it’s done using framework configuration, custom code split in three classes, or one monolithic controller method that does that job.

This mismatch is partially addressed by the so called BDD (behaviour driven development), as business people can define scenarios in a formalized language (although they rarely do, it’s still up to the QAs or developers to write the tests). But having your tests organized around features and behaviours doesn’t mean the code is, and BDD doesn’t help in making your way through the codebase in search of why and how something is implemented.

Another issue is linking a piece of code to the issue tracking system. Source control conventions and hooks allow for setting the issue tracker number as part of the commit, and then when browsing the code, you can annotate the file and see the issue number. However, due the the many changes, even a very strict team will end up methods that are related to multiple issues and you can’t easily tell which is the proper one.

Yet another issue with the lack of a “feature” unit in programming languages is that you can’t trivially reuse existing projects to start a new one. We’ve all been there – you have a similar project and you want to get a skeleton to get thing running faster. And while there are many tools to help that (Spring Boot, Spring Roo, and other scaffolding utilities), they can rarely deliver what you need – you always have to tweak something, delete something, customize some configuration, as defaults are almost never practical.

And I have a simple proposal that will help with the issues above. As with any complex problem, simple ideas don’t solve everything, but are at least a step forward.

The proposal is in the title – let’s annotate our methods with the features they implement. Let’s have @Feature(name = "Forgotten password", issueTrackerCode="PROJ-123"). A method can implement multiple features, but that is generally discouraged by best practices (e.g. the single responsibility principle). The granularity of “feature” is something that has to be determined by each team and is the tricky part – sometimes an epic describes a feature, sometimes individual stories or even subtasks do. A definition of a feature should be agreed upon and every new team member should be told what to do and how to interpret it.

There is of course a lot of complexity, e.g. for generic methods like DAO methods, utility methods, or methods that are reused in too many places. But they also represent features, it’s just that these features are horizontal. “Data access layer” is a feature – a more technical one indeed, but it counts, and maybe deserves a story in the issue tracker.

Your features can actually be listed in one or several enums, grouped by type – business, horizontal, performance, etc. That way you can even compose features – e.g. account creation contains the logic itself, database access, a security layer.

How does such a proposal help?

  • Consciousnesses about the single responsibility of methods and that code should be readable
  • Provides a rationale for the existence of each method. Even if a proper comment is missing, the annotation will put a method (or a class) in context
  • Helps navigating code and fixing issues (if you can see all places where a feature is implemented, you are more likely to spot an issue)
  • Allows tools to analyze your features – amount, complexity, how chaotic a feature is spread across the code base, test coverage per feature, etc.
  • Allows tools to use existing projects for scaffolding for new ones – you specify the features you want to have, and they are automatically copied

At this point I’m supposed to give a link to a GitHub project for a feature annotation library. But it doesn’t make sense to have a single-annotation project. It can easily be part of guava or something similar Or can be manually created in each project. The complex part – the tools that will do the scanning and analysis, deserve separate projects, but unfortunately I don’t have time to write one.

But even without the tools, the concept of annotating methods with their high-level features is I think a useful one. Instead of trying to deduce why is this method here and what requirements does it have to implement (and were all necessary tests written at the time), such an annotation can come handy.

The post Let’s Annotate Our Methods With The Features They Implement appeared first on Bozho's tech blog.

JKS: Extending a Self-Signed Certificate

Post Syndicated from Bozho original https://techblog.bozho.net/jks-extending-a-self-signed-certificate/

Sometimes you don’t have a PKI in place but you still need a key and a corresponding certificate to sign stuff (outside of the TLS context). And after the certificate in initially generated jks file expires, you have few options – either generate an entirely new keypair, or somehow “extend” the existing certificate. This is useful mostly for testing and internal systems, but still worth mentioning.

Extending certificates is generally not possible – once they expire, they’re done. However, you can have a new certificate with the same private key and a longer period. This sounds like something that should be easy to do, but it turns it it isn’t that easy with keytool. Even with my favourite tool, keystore explorer, it’s not immediately possible.

In order to reuse the private key to have a new, longer certificate, you need to do the following:

  1. Export the private key (with keytool & openssl or through the keystore-explorer UI, which is much simpler)
  2. Make a certificate signing request (with keytool or through the keystore-explorer UI)
  3. Sign the request with the private key (i.e. self-signed)
  4. Import the certificate in the store to replace the old (expired) one

The last two steps seem to be not straightforward with keytool or keystore exporer. If you try to sign the request with your existing keystore keypair, the current certificate is used as the root of the chain (and you don’t want that). And you can’t remove the certificate and generate a new one.

So you need to use OpenSSL:

x509 -req -days 3650 -in req.csr -signkey private.key -sha256 -extfile extfile.cnf -out result.crt

The extfile.cnf is optional and is used if you want to specify extensions. E.g. for timestamping, the extension file looks like this:

extendedKeyUsage=critical,timeStamping

After that “simply” create a new keystore and import the private key and the newly generated certificate. This is straightforward through the keystore-explorer UI, and much less easy through the command line.

You’ve noticed my preference for keytool-explorer. It is a great tool that makes working with keys and keystores easy and predictable, as opposed to command-line tools like keytool and openssl, which I’m sure nobody is able to use without googling every single command. Of course, if you have to do very specific or odd stuff, you’ll have to revert to command line, but for most operations the UI is sufficient (unless you have to automate it, in which case, obviously, use the CLI).

You’d rarely need to do what I’ve shown above, but in case you have to, I hope the hints above were useful.

The post JKS: Extending a Self-Signed Certificate appeared first on Bozho's tech blog.