Tag Archives: Developer tips

Content-Security-Policy Nonce with Spring Security

Post Syndicated from Bozho original https://techblog.bozho.net/content-security-policy-nonce-with-spring-security/

Content-Security-Policy is important for web security. Yet, it’s not mainstream yet, it’s syntax is hard, it’s rather prohibitive and tools rarely have flexible support for it.

While Spring Security does have a built-in Content Security Policy (CSP) configuration, it allows you to specify the policy a a string, not build it dynamically. And in some cases you need more than that.

In particular, CSP discourages the user of inline javascript, because it introduces vulnerabilities. If you really need it, you can use unsafe-inline but that’s a bad approach, as it negates the whole point of CSP. The alternative presented on that page is to use hash or nonce.

I’ll explain how to use nonce with spring security, if you are using .and().headers().contentSecurityPolicy(policy). The policy string is static, so you can’t generate a random nonce for each request. And having a static nonce is useless. So first, you define a CSP nonce filter:

public class CSPNonceFilter extends GenericFilterBean {
    private static final int NONCE_SIZE = 32; //recommended is at least 128 bits/16 bytes
    private static final String CSP_NONCE_ATTRIBUTE = "cspNonce";

    private SecureRandom secureRandom = new SecureRandom();

    @Override
    public void doFilter(ServletRequest req, ServletResponse res, FilterChain chain) throws IOException, ServletException {
        HttpServletRequest request = (HttpServletRequest) req;
        HttpServletResponse response = (HttpServletResponse) res;

        byte[] nonceArray = new byte[NONCE_SIZE];

        secureRandom.nextBytes(nonceArray);

        String nonce = Base64.getEncoder().encodeToString(nonceArray);
        request.setAttribute(CSP_NONCE_ATTRIBUTE, nonce);

        chain.doFilter(request, new CSPNonceResponseWrapper(response, nonce));
    }

    /**
     * Wrapper to fill the nonce value
     */
    public static class CSPNonceResponseWrapper extends HttpServletResponseWrapper {
        private String nonce;

        public CSPNonceResponseWrapper(HttpServletResponse response, String nonce) {
            super(response);
            this.nonce = nonce;
        }

        @Override
        public void setHeader(String name, String value) {
            if (name.equals("Content-Security-Policy") && StringUtils.isNotBlank(value)) {
                super.setHeader(name, value.replace("{nonce}", nonce));
            } else {
                super.setHeader(name, value);
            }
        }

        @Override
        public void addHeader(String name, String value) {
            if (name.equals("Content-Security-Policy") && StringUtils.isNotBlank(value)) {
                super.addHeader(name, value.replace("{nonce}", nonce));
            } else {
                super.addHeader(name, value);
            }
        }
    }
}

And then you configure it with spring security using: .addFilterBefore(new CSPNonceFilter(), HeaderWriterFilter.class).

The policy string should containt `nonce-{nonce}` which would get replaced with a random nonce on each request.

The filter is set before the HeaderWriterFilter so that it can wrap the response and intercept all calls to setting headers. Why it can’t be done by just overriding the headers after they are set by the HeaderWriterFiilter, using response.setHeader(..) – because the response is already committed and overriding does nothing.

Then in your pages where you for some reason need inline scripts, you can use:

<script nonce="{{ cspNonce }}">...</script>

(I’m using the Pebble template syntax; but you can use any template to output the request attribute “csp-nonce”)

Once again, inline javascript is rarely a good idea, but sometimes it’s necessary, at least temporarily – if you are adding a CSP to a legacy application, for example, and can’t rewrite everything).

We should have CSP everywhere, but building the policy should be aided by the frameworks we use, otherwise it’s rather tedious to write a proper policy that doesn’t break your application and is secure at the same time.

The post Content-Security-Policy Nonce with Spring Security appeared first on Bozho's tech blog.

Releasing Often Helps With Analyzing Performance Issues

Post Syndicated from Bozho original https://techblog.bozho.net/releasing-often-helps-with-analyzing-performance-issues/

Releasing often is a good thing. It’s cool, and helps us deliver new functionality quickly, but I want to share one positive side-effect – it helps with analyzing production performance issues.

We do releases every 5 to 10 days and after a recent release, the application CPU chart jumped twice (the lines are differently colored because we use blue-green deployment):

What are the typical ways to find performance issues with production loads?

  • Connect a profiler directly to production – tricky, as it requires managing network permissions and might introduce unwanted overhead
  • Run performance tests against a staging or local environment and do profiling there – good, except your performance tests might not hit exactly the functionality that causes the problem (this is what happens in our case, as it was some particular types of API calls that caused it, which weren’t present in our performance tests). Also, performance tests can be tricky
  • Do a thread dump (and heap dump) and analyze them locally – a good step, but requires some luck and a lot of experience analyzing dumps, even if equipped with the right tools
  • Check your git history / release notes for what change might have caused it – this is what helped us resolve the issue. And it was possible because there were only 10 days of commits between the releases.

We could go through all of the commits and spot potential performance issues. Most of them turned out not to be a problem, and one seemingly unproblematic pieces was discovered to be the problem after commenting it out for a brief period a deploying a quick release without it, to test the hypothesis. I’ll share a separate post about the particular issue, but we would have to waste a lot more time if that release has 3 months worth of commits rather than 10 days.

Sometimes it’s not an obvious spike in the CPU or memory, but a more gradual issue that you introduce at some point and it starts being a problem a few months later. That’s what happened a few months ago, when we noticed a stead growth in the CPU with the growth of ingested data. Logical in theory, but the CPU usage grew faster than the data ingestion rate, which isn’t good.

So we were able to answer the question “when did it start growing” in order to be able to pinpoint the release that introduced the issue. because the said release had only 5 days of commits, it was much easier to find the culprit.

All of the above techniques are useful and should be employed at the right time. But releasing often gives you a hand with analyzing where a performance issues is coming from.

The post Releasing Often Helps With Analyzing Performance Issues appeared first on Bozho's tech blog.

Syntactic Sugar Is Not Always Good

Post Syndicated from Bozho original https://techblog.bozho.net/syntactic-sugar-is-not-always-good/

This write-up is partly inspired by a recent post by Vlad Mihalcea on LinkedIn about the recently introduced text blocks in Java. More about them can be read here.

Now, that’s a nice feature. I’ve used it in Scala several years ago, and other languages also have it, so it seems like a no-brainer to introduce it in Java.

But, syntactic sugar (please don’t argue whether that’s precisely syntactic sugar or not) can be problematic and lead to “syntactic diabetes”. It has two possible issues.

The less important one is consistency – if you can do one thing in multiple, equally valid ways, that introduces inconsistency in the code and pointless arguments of “the right way to do things”. In this context – for 2-line strings do you use a text block or not? Should you do multi-line formatting for simple strings or not? Should you configure checkstyle rules to reject one or the other option and in what circumstances?

The second, and bigger problem, is code readability. I know it sounds counter-intuitive, but bear with me. The example that Vlad gave illustrates that – do you want to have a 20-line SQL query in your Java code? I’d say no – you’d better extract that to a separate file, and using some form of templating, populate the right values. This is certainly readable when you browse the code:

String query = QueryUtils.loadQuery("age-query.sql", timestamp, diff, other);
// or
String query = QueryUtils.loadQuery("age-query.sql", 
       Arrays.asList("param1Name", "param2Name"), 
      Arrays.asList(param1Value, param2Value);

Queries can be placed in /src/main/resources and loaded as templates (and cached) by the QueryUtils. And because of the previous lack of text blocks, you would not prefer ugly string concatenated queries inside your code.

But now, with this feature, you are tempted to do that, because, well, it looks good. Same goes for Elasticsearch queries, for JSON templates and whatnot. With this “sugar” you have more incentive to just put them in the code, where they, arguably, make the code less readable. If you really have to debug the query, as opposed to assuming its semantics by its name and relying on a proper implementation, you can easily go to age-query.sql and work with it. Just like when you extract a private method with some implementation details so that it makes the calling method more readable and easy to follow.

Both problems have manifested themselves in my Scala experience, which I’ve summed up in my talk “Scala – the good, the bad and the very ugly” (only slides available). Scala allows you to express things nicely and in multiple ways, which leads to inconsistencies and horrible code readability in some cases.

Counter intuitively, sometimes the syntactic improvements may be worse for code readability. Because they introduce complexity and because they make it easier to do the wrong thing.

That’s not a universal complaint, and certainly syntactic sugar is needed – you don’t have to write List<String> list = new ArrayList<String>() if you can use the diamond operator. But each such feature should be well thought not just for how easy it makes to write the code, but also how easy it is to read it, and more importantly - what type of code it incentivizes.

The post Syntactic Sugar Is Not Always Good appeared first on Bozho's tech blog.

Creating a CentOS Startup Screen

Post Syndicated from Bozho original https://techblog.bozho.net/creating-a-centos-startup-screen/

When distributing bundled software, you have multiple options, but if we exclude fancy newcomers like Docker and Kubernetes, you’re left with the following options: an installer (for Windows), a package (rpm or deb) for the corresponding distro, tarball with a setup shell script that creates the necessary configurations, and a virtual machine (or virtual appliance).

All of these options are applicable in different scenarios, but distributing a ready-to-deploy virtual machine image is considered standard for enterprise software. Your machine has all the dependencies it needs (because it might not be permitted to connect to the interenet), and it just has to be fired up.

But typically you’d want some initial configuration or at least have the ability to show the users how to connect to your (typically web-based) application. And so creating a startup screen is what many companies choose to do. Below is a simple way to do that on CentOS, which is my distro of preference. (There are other resources on the topic, e.g. this one, but it relies on /etc/inittab which is deprecated in CentOS 8).

useradd startup -m
yum -y install dialog

sed -i -- "s/-o '-p -- \\u' --noclear/--autologin startup --noclear/g" /usr/lib/systemd/system/[email protected]

chmod +x /install/startup.sh
echo "exec /install/startup.sh" >> /home/startup/.bashrc

systemctl daemon-reload

With the code above you are creating a startup user and auto-logging that user in before the regular login prompt. Replacing the Exec like in the [email protected] is doing exactly that.

Then the script adds the invocation of a startup bash script to the .bashrc which gets run when the user is logged in. What this script does is entirely up to you, below is a simple demo using the dialog command (that we just installed above):

#!/bin/sh
# Based on https://askubuntu.com/questions/1705/how-can-i-create-a-select-menu-in-a-shell-script

HEIGHT=15
WIDTH=70
CHOICE_HEIGHT=4
BACKTITLE="Your Company"
TITLE="Your Product setp"
BIND_IP=`ifconfig | sed -En 's/127.0.0.1//;s/.*inet (addr:)?(([0-9]*\.){3}[0-9]*).*/\2/p'`
INFO="Welcome to MyProduct.\n\n\nWeb access URL: https://$BIND_IP\n\n\n\nFor more information visit https://docs.example.com"

CHOICE=$(dialog --clear \
                --backtitle "$BACKTITLE" \
                --title "$TITLE" \
                --msgbox "$INFO" \
                $HEIGHT $WIDTH \
                2>&1 >/dev/tty)

clear
echo 'Enter password for user "root":'
su root

This dialog shows just some basic information, but you can extend it to allow users making choices and input some parameters. More importantly, it gets the current IP address and shows it to the user. That’s not something they can’t do themselves in other ways, but it’s friendlier to show it like that. And you can’t hard-code that, because in each installation it will have a different IP (even if not using DHCP, you should let the user set the static IP that they’ve assigned rather than forcing one on them). At the end of the script it switches to the root user.

Security has to be considered here – your startup user should not be allowed to do anything meaningful in the system, because it is automatically logged in without password. According to this answer exec sort-of solves that problem (e.g. when you mistype the root password, you are back to the startup.sh script rather than to the console).

I agree that’s a rare use-case but I thought I’d share this “arcane” knowledge.

The post Creating a CentOS Startup Screen appeared first on Bozho's tech blog.

My Advice To Developers About Working With Databases: Make It Secure

Post Syndicated from Bozho original https://techblog.bozho.net/my-advice-to-developers-about-working-with-databases-make-it-secure/

Last month Ben Brumm asked me for the one advice I’d like to give to developers that are working with databases (in reality – almost all of us). He published mine as well as many others’ answers here, but I’d like to share it with my readers as well.

If I had to give developers working with databases one advice, it would be: make it secure. Every other thing you’ll figure in time – how to structure your tables, how to use ORM, how to optimize queries, how to use indexes, how to do multitenancy. But security may not be on the list of requirements and it may be too late when the need becomes obvious.

So I’d focus on several things:

  • Prevent SQL injections – make sure you use an ORM or prepared statements rather than building queries with string concatenation. Otherwise a malicious actor can inject anything in your queries and turn them into a DROP DATABASE query, or worse – one that exfiltrates all the data.
  • Support encryption in transit – this often has to be supported by the application’s driver configuration, e.g. by trusting a particular server certificate. Unencrypted communication, even within the same datacenter, is a significant risk and that’s why databases support encryption in transit. (You should also think about encryption at rest, but that’s more of an Ops task)
  • Have an audit log at the application level – “who did what” is a very important question from a security and compliance point of view. And no native database functionality can consistently answer the question “who” – it’s the application that manages users. So build an audit trail layer that records who did what changes to what entities/tables.
  • Consider record-level encryption for sensitive data – a database can be dumped in full by those who have access (or gain access maliciously). This is how data breaches happen. Sensitive data (like health data, payment data, or even API keys, secrets or tokens) benefits from being encrypted with an application-managed key, so that access to the database alone doesn’t reveal that data. Another option, often used for credit cards, is tokenization, which shifts the encryption responsibility to the tokenization providers. Managing the keys is hard, but even a basic approach is better than nothing.

Security is often viewed as an “operations” responsibility, and this has lead to a lot of tools that try to solve the above problem without touching the application – web application firewalls, heuristics for database access monitoring, trying to extract the current user, etc. But the application is the right place for many of these protections (although certainly not the only place), and as developers we need to be aware of the risks and best practices.

The post My Advice To Developers About Working With Databases: Make It Secure appeared first on Bozho's tech blog.

Discovering an OSSEC/Wazuh Encryption Issue

Post Syndicated from Bozho original https://techblog.bozho.net/discovering-an-ossec-wazuh-encryption-issue/

I’m trying to get the Wazuh agent (a fork of OSSEC, one of the most popular open source security tools, used for intrusion detection) to talk to our custom backend (namely, our LogSentinel SIEM Collector) to allow us to reuse the powerful Wazuh/OSSEC functionalities for customers that want to install an agent on each endpoint rather than just one collector that “agentlessly” reaches out to multiple sources.

But even though there’s a good documentation on the message format and encryption, I couldn’t get to successfully decrypt the messages. (I’ll refer to both Wazuh and OSSEC, as the functionality is almost identical in both, with the distinction that Wazuh added AES support in addition to blowfish)

That lead me to a two-day investigation on possible reasons. The first side-discovery was the undocumented OpenSSL auto-padding of keys and IVs described in my previous article. Then it lead me to actually writing C code (an copying the relevant Wazuh/OSSEC pieces) in order to debug the issue. With Wazuh/OSSEC I was generating one ciphertext and with Java and openssl CLI – a different one.

I made sure the key, key size, IV and mode (CBC) are identical. That they are equally padded and that OpenSSL’s EVP API is correctly used. All of that was confirmed and yet there was a mismatch, and therefore I could not decrypt the Wazuh/OSSEC message on the other end.

After discovering the 0-padding, I also discovered a mistake in the documentation, which used a static IV of FEDCA9876543210 rather than the one found in the code, where the 0 preceded 9 – FEDCA0987654321. But that didn’t fix the issue either, only got me one step closer.

A side-note here on IVs – Wazuh/OSSEC is using a static IV, which is a bad practice. The issue is reported 5 years ago, but is minor, because they are using some additional randomness per message that remediates the use of a static IV; it’s just not idiomatic to do it that way and may have unexpected side-effects.

So, after debugging the C code, I got to a simple code that could be used to reproduce the issue and asked a question on Stackoverflow. 5 minutes after posting the question I found another, related question that had the answer – using hex strings like that in C doesn’t work. Instead, they should be encoded: char *iv = (char *)"\xFE\xDC\xBA\x09\x87\x65\x43\x21\x00\x00\x00\x00\x00\x00\x00\x00";. So, the value is not the bytes corresponding to the hex string, but the ASCII codes of each character in the hex string. I validated that in the receiving Java end with this code:

This has an implication on the documentation, as well as on the whole scheme as well. Because the Wazuh/OSSEC AES key is: MD5(password) + MD5(MD5(agentName) + MD5(agentID)){0, 15}, the 2nd part is practically discarded, because the MD5(password) is 32 characters (= 32 ASCII codes/bytes), which is the length of the AES key. This makes the key derived from a significantly smaller pool of options – the permutations of 16 bytes, rather than of 256 bytes.

I raised an issue with Wazuh. Although this can be seen as a vulnerability (due to the reduced key space), it’s rather minor from security point of view, and as communication is mostly happening within the corporate network, I don’t think it has to be privately reported and fixed immediately.

Yet, I made a recommendation for introducing an additional configuration option to allow to transition to the updated protocol without causing backward compatibility issues. In fact, I’d go further and recommend using TLS/DTLS rather than a home-grown, AES-based scheme. Mutual authentication can be achieved through TLS mutual authentication rather than through a shared secret.

It’s satisfying to discover issues in popular software, especially when they are not written in your “native” programming language. And as a rule of thumb – encodings often cause problems, so we should be extra careful with them.

The post Discovering an OSSEC/Wazuh Encryption Issue appeared first on Bozho's tech blog.

OpenSSL Key and IV Padding

Post Syndicated from Bozho original https://techblog.bozho.net/openssl-key-and-iv-padding/

OpenSSL is an omnipresent tool when it comes to encryption. While in Java we are used to the native Java implementations of cryptographic primitives, most other languages rely on OpenSSL.

Yesterday I was investigating the encryption used by one open source tool written in C, and two things looked strange: they were using a 192 bit key for AES 256, and they were using a 64-bit IV (initialization vector) instead of the required 128 bits (in fact, it was even a 56-bit IV).

But somehow, magically, OpenSSL didn’t complain the way my Java implementation did, and encryption worked. So, I figured, OpenSSL is doing some padding of the key and IV. But what? Is it prepending zeroes, is it appending zeroes, is it doing PKCS padding or ISO/IEC 7816-4 padding, or any of the other alternatives. I had to know if I wanted to make my Java counterpart supply the correct key and IV.

It was straightforward to test with the following commands:

# First generate the ciphertext by encrypting input.dat which contains "testtesttesttesttesttest"
$ openssl enc -aes-256-cbc -nosalt -e -a -A -in input.dat -K '7c07f68ea8494b2f8b9fea297119350d78708afa69c1c76' -iv 'FEDCBA987654321' -out input-test.enc

# Then test decryption with the same key and IV
$ openssl enc -aes-256-cbc -nosalt -d -a -A -in input-test.enc -K '7c07f68ea8494b2f8b9fea297119350d78708afa69c1c76' -iv 'FEDCBA987654321'
testtesttesttesttesttest

# Then test decryption with different paddings
$ openssl enc -aes-256-cbc -nosalt -d -a -A -in input-test.enc -K '7c07f68ea8494b2f8b9fea297119350d78708afa69c1c76' -iv 'FEDCBA9876543210'
testtesttesttesttesttest

$ openssl enc -aes-256-cbc -nosalt -d -a -A -in input-test.enc -K '7c07f68ea8494b2f8b9fea297119350d78708afa69c1c760' -iv 'FEDCBA987654321'
testtesttesttesttesttest

$ openssl enc -aes-256-cbc -nosalt -d -a -A -in input-test.enc -K '7c07f68ea8494b2f8b9fea297119350d78708afa69c1c76000' -iv 'FEDCBA987654321'
testtesttesttesttesttest

$ openssl enc -aes-256-cbc -nosalt -d -a -A -in input-test.enc -K '07c07f68ea8494b2f8b9fea297119350d78708afa69c1c76' -iv 'FEDCBA987654321'
bad decrypt

So, OpenSSL is padding keys and IVs with zeroes until they meet the expected size. Note that if -aes-192-cbc is used instead of -aes-256-cbc, decryption will fail, because OpenSSL will pad it with fewer zeroes and so the key will be different.

Not an unexpected behavaior, but I’d prefer it to report incorrect key sizes rather than “do magic”, especially when it’s not easy to find exactly what magic it’s doing. I couldn’t find it documented, and the comments to this SO question hint in the same direction. In fact, for plaintext padding, OpenSSL uses PKCS padding (which is documented), so it’s extra confusing that it’s using zero-padding here.

In any case, follow the advice from the stackoverflow answer and don’t rely on this padding – always provide the key and IV in the right size.

The post OpenSSL Key and IV Padding appeared first on Bozho's tech blog.

ElasticSearch Multitenancy With Routing

Post Syndicated from Bozho original https://techblog.bozho.net/elasticsearch-multitenancy-with-routing/

Elasticsearch is great, but optimizing it for high load is always tricky. This won’t be yet another “Tips and tricks for optimizing Elasticsearch” article – there are many great ones out there. I’m going to focus on one narrow use-case – multitenant systems, i.e. those that support multiple customers/users (tenants).

You can build a multitenant search engine in three different ways:

  • Cluster per tenant – this is the hardest to manage and requires a lot of devops automation. Depending on the types of customers it may be worth it to completely isolate them, but that’s rarely the case
  • Index per tenant – this can be fine initially, and requires little additional coding (you just parameterize the “index” parameter in the URL of the queries), but it’s likely to cause problems as the customer base grows. Also, supporting consistent mappings and settings across indexes may be trickier than it sounds (e.g. some may reject an update and others may not depending on what’s indexed). Moving data to colder indexes also becomes more complex.
  • Tenant-based routing – this means you put everything in one cluster but you configure your search routing to be tenant-specific, which allows you to logically isolate data within a single index.

The last one seems to be the preferred option in general. What is routing? The Elasticsearch blog has a good overview and documentation. The idea lies in the way Elasticsearch handles indexing and searching – it splits data into shards (each shard is a separate Lucene index and can be replicated on more than one node). A shard is a logical grouping within a single Elasticsearch node. When no custom routing is used, and an index request comes, the ID is used to determine which shard is going to be used to store the data. However, during search, Elasticsearch doesn’t know which shards have the data, so it has ask multiple shards and gather the results. Related to that, there’s the newly introduced adaptive replica selection, where the proper shard replica is selected intelligently, rather than using round-robin.

Custom routing allows you to specify a routing value when indexing a document and then a search can be directed only to the shard that has the same routing value. For example, at LogSentinel when we index a log entry, we use the data source id (applicationId) for routing. So each application (data source) that generates logs has a separate identifier which allows us to query only that data source’s shard. That way, even though we may have a thousand clients with a hundred data sources each, a query will be precisely targeted to where the data for that particular customer’s data source lies.

This is key for horizontally scaling multitenant applications. When there’s terabytes of data and billions of documents, many shards will be needed (in order to avoid large and heavy shards that cause performance issues). Finding data in this haystack requires the ability to know where to look.

Note that you can (and probably should) make routing required in these cases – each indexed document must be required to have a routing key, otherwise an implementation oversight may lead to a slow index.

Using custom routing you are practically turning one large Elasticsearch cluster into smaller sections, logically separated based on meaningful identifiers. In our case, it is not a userId/customerId, but one level deeper – there are multiple shards per customer, but depending on the use-case, it can be one shard per customer, using the userId/customerId. Using more than one shard per customer may complicate things a little – for example having too many shards per customer may require searches that span too many shards, but that’s not necessarily worse than not using routing.

There are some caveats – the isolation of customer data has to be handled in the application layer (whereas for the first two approaches data is segregated operationally). If there’s an application bug or lack of proper access checks, one user can query data from other users’ shards by specifying their routing key. It’s the role of the application in front of Elasticsearch to only allow queries with routing keys belonging to the currently authenticated user.

There are cases when the first two approaches to multitenancy are viable (e.g. a few very large customers), but in general the routing approach is the most scalable one.

The post ElasticSearch Multitenancy With Routing appeared first on Bozho's tech blog.

Bulk vs Individual Compression

Post Syndicated from Bozho original https://techblog.bozho.net/bulk-vs-individual-compression/

I’d like to share something very brief and very obvious – that compression works better with large amounts of data. That is, if you have to compress 100 sentences you’d better compress them in bulk rather than once sentence at a time. Let me illustrate that:

public static void main(String[] args) throws Exception {
    List<String> sentences = new ArrayList<>();
    for (int i = 0; i < 100; i ++) {
        StringBuilder sentence = new StringBuilder();
        for (int j = 0; j < 100; j ++) { 
          sentence.append(RandomStringUtils.randomAlphabetic(10)).append(" "); 
        } 
        sentences.add(sentence.toString()); 
    } 
    byte[] compressed = compress(StringUtils.join(sentences, ". ")); 
    System.out.println(compressed.length); 
    System.out.println(sentences.stream().collect(Collectors.summingInt(sentence -> compress(sentence).length)));
}

The compress method is using commons-compress to easily generate results for multiple compression algorithms:

public static byte[] compress(String str) {
   if (str == null || str.length() == 0) {
       return new byte[0];
   }
   ByteArrayOutputStream out = new ByteArrayOutputStream();
   try (CompressorOutputStream gzip = new CompressorStreamFactory()
           .createCompressorOutputStream(CompressorStreamFactory.GZIP, out)) {
       gzip.write(str.getBytes("UTF-8"));
       gzip.close();
       return out.toByteArray();
   } catch (Exception ex) {
       throw new RuntimeException(ex);
   }
}

The results are as follows, in bytes (note that there’s some randomness, so algorithms are not directly comparable):

AlgorithmBulkIndividual
GZIP659010596
LZ4_FRAMED921410900
BZIP2666312451

Why is that an obvious result? Because of the way most compression algorithms work – they look for patterns in the raw data and create a map of those patterns (a very rough description).

How is that useful? In big data scenarios where the underlying store supports per-record compression (e.g. a database or search engine), you may save a significant amount of disk space if you bundle multiple records into one stored/indexed record.

This is not a generically useful advice, though. You should check the particular datastore implementation. For example MS SQL Server supports both row and page compression. Cassandra does compression on an SSTable level, so it may not matter how you structure your rows. Certainly, if storing data in files, storing it in one file and compressing it is more efficient than compressing multiple files separately.

Disk space is cheap so playing with data bundling and compression may be seen as premature optimization. But in systems that operate on large datasets it’s a decision that can save you a lot of storage costs.

The post Bulk vs Individual Compression appeared first on Bozho's tech blog.

Encryption Overview [Webinar]

Post Syndicated from Bozho original https://techblog.bozho.net/encryption-overview-webinar/

“Encryption” has turned into a buzzword, especially after privacy standards and regulation vaguely mention it and vendors rush to provide “encryption”. But what does it mean in practice? I did a webinar (hosted by my company, LogSentinel) to explain the various aspects and pitfalls of encryption.

You can register to watch the webinar here, or view it embedded below:

And here are the slides:

Of course, encryption is a huge topic, worth a whole course, rather than just a webinar, but I hope I’m providing good starting points. The interesting technique that we employ in our company is “searchable encryption” which allows to have encrypted data and still search in it. There are many more very nice (and sometimes niche) applications of encryption and cryptography in general, as Bruce Schneier mentions in his recent interview. These applications can solve very specific problems with information security and privacy that we face today. We only need to make them mainstream or at least increase awareness.

The post Encryption Overview [Webinar] appeared first on Bozho's tech blog.

Blockchainizing Existing Databases

Post Syndicated from Bozho original https://techblog.bozho.net/blockchainizing-existing-databases/

Blockchain has been a buzzword for the past several years and it hasn’t lived to its promises (yet). The value proposition usually includes vague claims about trust and unmodifiability, but rarely that has brought demonstrable improvement to existing processes.

There are dozens of blockchain projects, networks, protocols, “standards”, and all of them can in some way help you solve either data integrity issues (guarantee that data has not been tampered with) or multi-party trust issues (several companies participating in one process shouldn’t have to trust each other in order to have automated cross-organization business processes).

However, deploying and integrating a separate blockchain solution is usually a large project in itself and especially in the COVID-19 crisis likely gets postponed because of the questionable return on investment.

But for the enterprise, blockchain is largely a shared database. Sharing data with other participants in a given business process in a secure way that doesn’t allow any of the participants to cheat. And this can be achieved not by adding a whole new blockchain infrastructure that would in turn integrate with existing systems (which in many cases can’t be integrated easily because they don’t have APIs), but by “blockchainizing” the existing database.

Ideally, what I’m describing should be a project itself, which is either deployed alongside the database, or as part of an application. And what it can do is as follows:

  • Select tables and columns to share with other participants – obviously only parts of the database should be shared with others
  • Define shared data model and data transformations – sometimes data has to be transformed, or masked, in order to meet regulatory or privacy requirements. Certainly, the databases of participants will differ and they have to be aligned to a common model.
  • Track inserts, updates and deletes, sign them and send them to the peers
  • Manage a PKI with a private key shared with the rest of the participants (or some of them)
  • Generate merkle trees based on the “transactions” (inserts, updates and deletes) and expose them to be verified regularly by the peers
  • Provide an admin dashboard to view various aspects of the system – activity, status of peers, configuration options

Effectively, that’s also an integration effort. Defining shared data models and transformations. It also involves setting up some piece of software that does the “blockchaining” and communication with peers. But since the goal is usually to have a shared database, it makes sense to go directly at the database level, rather than providing an append-only key-value store which is then queried in order to fill the actual database.

Can such an approach be just a configuration option for existing solutions like Openchain, Hyperledger, Corda? It could be – allowing them to stream changes directly to and from an existing database in a predefined fashion.

This post is in the “random ideas” category, things that I’ve thought about could be implemented, but never found the time to do so. I think blockchain should be taken to the ground and viewed as an infrastructure components. Much like enabling database encryption or adding another data source to an ESB for the sake of integrating two systems. Because the business case for blockchain is usually this – integrate several systems and don’t allow them to cheat. I think this can be achieved by plugging at the database level for the systems integrated.

The post Blockchainizing Existing Databases appeared first on Bozho's tech blog.

Seven Legacy Integration Patterns

Post Syndicated from Bozho original https://techblog.bozho.net/seven-legacy-integration-patterns/

If we have to integrate two (or more) systems nowadays, we know – we either use an API or, more rarely, some message queue.

Unfortunately, many systems in the world do not support API integration. And many more a being created as we speak, that don’t have APIs. So when you inevitably have to integrate with them, you are left with imperfect choices to make. Below are seven patterns to integrate with legacy systems (or not-so-legacy systems that are built in legacy ways).

Initially I wanted to use “bad integration patterns”. But if you don’t have other options, they are not bad – they are inevitable. What’s bad is that fact that so many systems continue to be built without integration in mind.

I’ve seen all of these, on more than one occasion. And I’ve heard many more stories about them. Unfortunately, they are not the exception (fortunately, they are also not the rule, at least not anymore).

  1. Files on FTP – one application uploads files (XML, CSV, other) to an FTP (or other shared resources) and the other ones reads them via a scheduled job, parses them and optionally spits a response – either in the same FTP, or via email. Sharing files like that is certainly not ideal in terms of integration – you don’t get real-time status of your request, and other aspects are trickier to get right – versioning, high availability, authentication, security, traceability (audit trail).
  2. Shared database – two applications sharing the same database may sound like a recipe for disaster, but it’s not uncommon to see it in the wild. If you are lucky, one application will be read-only. But breaking changes to the database structure and security concerns are major issues. You can only use this type of integration is you expose your database directly to another application, which you normally don’t want to do.
  3. Full daily dump – instead of sharing an active database, some organizations do a full dump of their data every day or week and provide to to the other party for import. Obvious data privacy issues exist with that, as it’s a bad idea to have full dumps of your data flying around (in some cases on DVDs or portable HDDs), in addition to everything mention below – versioning, authentication, etc.
  4. Scraping – when an app has no API, it’s still possible to extract data from it or push data to it – via the user interface. With web applications that’s easier, as they “speak” HTML and HTTP. With desktop apps, screen scraping has emerged as an option. The so-called RPA software (Robotic process automation) relies on all types of scraping to integrate legacy systems. It’s very fragile and requires complicated (and sometimes expensive) tooling to get right. Not to mention the security aspect, which requires storing credentials in non-hashed form somewhere in order to let the scraper login.
  5. Email – when the sending or receiving system don’t support other forms of integration, email comes as a last resort. If you can trigger something by connecting a mailbox or if an email is produced after some event happens in the sending system, this may be all you need to integrate. Obviously, email is a very bad means of integration – it’s unstructured, it can fail for many reasons, and it’s just not meant for software integration. You can attach structured data, if you want to get extra inventive, but if you can get both ends to support the same format, you can probably get them extended to support proper APIs.
  6. Adapters – you can develop a custom module that has access to the underlying database, but exposes a proper API. That’s an almost acceptable solution, as you can have a properly written (sort-of) microservice independent of the original application and other system won’t know they are integrating with a legacy piece of software. It’s tricky to get it right in some cases, however, as you have to understand well the state space of the database. Read-only is easy, writing is much harder or next to impossible.
  7. Paper – no, I’m not making this up. There are cases where one organizations prints some data and then the other organization (or department) receives the paper documents (by mail or otherwise) and inputs them in their system. Expensive projects exist out there that aim to remove the paper component and introduce actual integration, as paper-based input is error-prone and slow. The few legitimate scenarios for a paper-based step is when you need an extra security and the paper trail, combined with the fact that paper is effectively airgapped, may give you that. But even then it shouldn’t be the only transport layer.

If you need to do any of the above, it’s usually because at least one of the system is stuck and can’t be upgraded. It’s either too legacy to touch, or the vendor is gone, or adding an API is “not on their roadmap” and would be too expensive.

If you are building a system, always provide an API. Some other system will have to integrate with it, sooner or later. It’s not sustainable to build close systems and delay the integration question for when it’s needed. Assume it’s always needed.

Fancy ESBs may be able to patch things quickly with one of the approaches above and integrate the “unintegratable”, but heavy reliance on an ESB is an indicator of too many legacy or low-quality systems.

But simply having an API doesn’t cut it either. If you don’t support versioning and backward-compatible APIs, you’ll be in an even more fragile state, as you’ll be breaking existing integrations as you progress.

Enterprise integration is tricky. But, as with many things in software, it’s best handled in the applications that we build. If we build them right, things are much easier. Otherwise, organizations have to revert to the legacy approaches mentioned above and introduce complexity, fragility, security and privacy risks and a general feeling of low-quality that has to be supported by increasingly unhappy people.

The post Seven Legacy Integration Patterns appeared first on Bozho's tech blog.

Using Multiple Dynamic Caches With Spring

Post Syndicated from Bozho original https://techblog.bozho.net/using-multiple-dynamic-caches-with-spring/

In a third post about cache managers in spring (over a long period of time), I’d like to expand on the previous two by showing how to configure multiple cache managers that dynamically create caches.

Spring has CompositeCacheManager which, in theory, should allow using more than one cache manager. It works by asking the underlying cache managers whether they have a cache with the requested name or not. The problem with that is when you need dynamically-created caches, based on some global configuration. And that’s the common scenario, when you don’t want to manually define caches, but instead want to just add @Cacheable and have spring (and the underlying cache manager) create the cache for you with some reasonable defaults.

That’s great until you need to have more than one cache managers. For example – one for local cache and one for a distributed cache. In many cases a distributed cache is needed; however not all method calls need to be distributed – some can be local to the instance that handles it and you don’t want to burden your distributed cache with stuff that can be kept locally. Whether you can configure a distributed cache provider to designate some cache to be local, even though it’s handled by the distributed cache provider – maybe, but I don’t guarantee it will be trivial.

So, faced with that issue, I had to devise some simple mechanism of designating some caches as “distributed” and some as “local”. Using CompositeCacheManager alone would not do it, so I extended the distributed cache manager (in this case, Hazelcast, but it can be done with any provider):

/**
 * Hazelcast cache manager that handles only cache names with a specified prefix for distributed caches
 */
public class OptionalHazelcastCacheManager extends HazelcastCacheManager {

    private static final String DISTRIBUTED_CACHE_PREFIX = "d:";

    public OptionalHazelcastCacheManager(HazelcastInstance hazelcast) {
        super(hazelcast);
    }

    @Override
    public Cache getCache(String name) {
        if (name == null || !name.startsWith(DISTRIBUTED_CACHE_PREFIX)) {
            return null;
        }

        return super.getCache(name);
    }
}

And the corresponding composite cache manager configuration:

    <bean id="cacheManager" class="org.springframework.cache.support.CompositeCacheManager">
        <property name="cacheManagers">
            <list>
                <bean id="hazelcastCacheManager" class="com.yourcompany.util.cache.OptionalHazelcastCacheManager">
                    <constructor-arg ref="hazelcast" />
                </bean>

                <bean id="caffeineCacheManager" class="com.yourcompany.util.cache.FlexibleCaffeineCacheManager">
                    <property name="cacheSpecification" value="expireAfterWrite=10m"/>
                    <property name="cacheSpecs">
                        <map>
                            <entry key="statistics" value="expireAfterWrite=1h"/>
                        </map>
                    </property>
                </bean>
            </list>
        </property>
    </bean>

That basically means that any cache with a name starting with d: (for “distributed”) should be handled by the distributed cache manager. Otherwise, proceed to the next cache manager (Caffeine in this case). So when you want to define a method with a cacheable result, you have to decide whether it’s @Cacheable("d:cachename") or just @Cacheable("cachename")

That’s probably one of many ways to approach that issue, but I like it for its simplicity. Caching is hard (distributed caching even more so), and while we are lucky to have Spring abstract most of that, we sometimes have to handle special cases ourselves.

The post Using Multiple Dynamic Caches With Spring appeared first on Bozho's tech blog.

Six Things I Learned from Mr. Robot

Post Syndicated from Bozho original https://techblog.bozho.net/six-things-i-learned-from-mr-robot/

Mr. Robot is an amazingly accurate series about a hacker (Elliot Alderson) and his, uhm, undertakings. The series is consulted by many cybersecurity experts and so every hack that happens is actually properly executed, using real tools and commands that do exactly what an infosec expert would expect. Nothing shown on screen is the usual bullshit TV hacking. And that is awesome and scary in a number of ways.

Obviously, this post is full of spoilers, so SPOILER ALERT, if you haven’t watched it, go and do that and then continue reading. Another disclaimer I have to make is that you should use the tools and techniques below only for ethical hacking, penetration testing and other legal activities.

The series has many hacks happening, from connecting to a neighbor’s WiFi, guessing people’s social media passwords and installing malware via flash drives, to complex social engineering plots, an HSM hack, the use of steganography and other advanced stuff. There’s even a tool pack with everything that’s used in the series. I’m fairly well-versed in the cybersecurity domain so even though I haven’t performed any hacks, most of the things happening on screen were familiar. But I did learn some interesting things from the series, which I’d like to share, as they sparked my interest and made me read a bit more. Many hacking scenes require you to pause and try to read what’s on screen, but that makes it even more fun.

  • proxychains – many tools are in use in the series, and Kali Linux is often used. But I wasn’t familiar with one particular tool that I think is crucial. Proxychains. I know, I’m a n00b to have never heard about it. What it does is basically running every command via a SOCKS or HTTP proxy. By default it uses Tor, but as Elliot knows that the NSA might be watching the Tor exit nodes, he switches to a regular proxy to execute the main attack in Season 1. proxychains curl https://google.com would open Google via the pre-configured proxy. This is very important to avoid being caught, and makes attribution and forensics much harder. To be fair, I didn’t see exactly where Elliot uses proxychains, but several people claim he uses it. Even if it’s not on the show, I learned about it after reading about the hacks, so it counts.
  • shred -z – I know and I’ve used the shred command. However, I didn’t know it has a “-z” argument which “add a final overwrite with zeros to hide shredding”. I guess there are more of these minor tool options here and there, but that’s an important point about covering one’s tracks. If you don’t do the zeroing, eventually, with proper forensics, they might discover that something was there (and was shredded).
  • How encryption ransomware works – ransomware that encrypts all your files and then asks for bitcoin in order to decrypt them is something fairly known. However, I didn’t know exactly how these ransomeware programs work – e.g. what they use for encryption. It appears that Darlene may have modified CryptoLocker in order to encrypt E Corp’s data. I was interested by that because at the end of season 3 Elliot finds a key that can undo the hack that rendered E Corp’s data unusable and he sends an RSA private key (via ProtonMail). Why would they encrypt something with RSA if they explicitly said they used AES? Well, it turns out that Cryptolocker (and others) work by generating one AES key per file, encrypting the file, then encrypting the AES key with the RSA public key and storing the encrypted AES key alongside the file. That’s a sensible approach because the private key never leaves the attacker and can’t be intercepted. But using asymmetric encryption to encrypt large volumes is not a good idea (it’s slow). So whenever asymmetric encryption is used, it’s always used to encrypt a symmetric key that is then used to encrypt/decrypt the actual data. And the fact that they reused existing ransomware rather than creating something from scratch allowed Mr. Robot to keep the RSA private key (that can be used to decrypt everything) rather than throwing it away.
  • UPS devices explode – I guess I’m just not good with hardware. I used to own a UPS, and I know it’s a battery and batteries sometimes malfunction (hello, Samsung), but the fact that the firmware controls important physical parameters that could in theory lead to the UPS exploding was not something I had thought about (shame on me, especially given that for one semester I’ve been taking chemistry classes). The answers in this thread are interesting – basically it’s absolutely expected for a UPS to explode (just not all of them at the same time). In the end, it’s very unlikely that the particular hack would succeed, but exploding UPSs are a reality. And that is an important point – you can’t trust your hardware either.
  • IMSI catchers – while I’m aware that mobile communication standards (SS7) are insecure, but I didn’t know exactly how. I’ve worked on telecom projects, but more on the business logic/billing side, so I was unaware of devices available to be used for MITM attacks. IMSI catchers allow intercepting one’s mobile phone communication based on the fact that mobile phones try to optimize their coverage combined with the lack of cell authentication. The bottom line here is you can’t rely on plain calls or SMS to be secure. So don’t use SMS for 2FA (shown as weak in season 4). Basically, use Signal.
  • HSM “cloning” – this is а complicated scene that shows how Angela copies keys from an HSM. It’s explained in detail here and here. The premise of the hack was that Elliot had upgraded the UPS firmware to only accept updates that are signed by E Corp’s private key. That means that you can’t compromise them unless you own the keys; however the keys are stored on a hardware security module that doesn’t allow them to be copied. I knew what an HSM is and what it does; what I didn’t know is the procedure for performing an HSM backup. It turns out they used an actual SafeNet HSM to backup the original one, which then allowed them to sign whatever they wanted on behalf of E Corp. The private keys never left the HSMs, of course (which is the whole point of HSMs), but by backing it up on one that’s controlled by the adversary, they effectively “leaked” the keys. The HSM backup procedure is complex and requires USB keys to unlock certain stages of the process. Normally these keys should be stored in safes (and not lying around, as depicted), but not strictly following procedures is not uncommon in organizations.

The scary thing about the series is that although it’s fiction, everything is realistic. Every hack depicted can happen in real life, and probably has happened. The social engineering parts may be a bit optimistic, but they are entirely plausible. The “wildest” hacks require some suspension of disbelief but even they are theoretically possible at least. The feeling that everything can be hacked and exploited is, unfortunately, realistic. You get that same feeling after watching a few DEF CON or CCC talks. And that’s why cybersecurity is important. Because things that are not hacked yet are not safe, they are just not interesting enough.

The post Six Things I Learned from Mr. Robot appeared first on Bozho's tech blog.

AWS Alarms for Application Errors

Post Syndicated from Bozho original https://techblog.bozho.net/aws-alarms-for-application-errors/

Monitoring is key for any real-world application. You have to know what’s happening and be alerted in real time if something wrong is happening. AWS has CloudWatch for that, and gives you a lot of metrics automatically. But there are some that you have to define yourself. And then you need to define proper alarms.

Here I’ll focus on hour:

  • High number of application errors
  • High number of application warnings
  • High number of 5xx errors on the load balancer
  • High number of 4xx errors on the load balancer

First, the prerequisites:

  • You need to be using CloudFormation to automate everything. You can create all of those things manually, but automation is a big plus
  • If using CloudFormation, you’d preferably have a sub-stack for configuring alarms
  • You need to be collecting your logs with CloudWatch logs

If you are not using CloudWatch logs, here’s a simple config file and script to enable them:

{
  "agent": {
    "metrics_collection_interval": 10,
    "region": "eu-west-1",
    "logfile": "/opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log"
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "{{logPath}}",
            "log_group_name": "{{logGroupName}}",
            "log_stream_name": "{instance_id}",
            "timestamp_format": "%Y-%m-%d %H:%M:%S"
          }
        ]
      }
    }
  }
}
# install AWS CloudWatch monitor
mkdir cloud-watch-agent
cd cloud-watch-agent
wget https://s3.amazonaws.com/amazoncloudwatch-agent/linux/amd64/latest/AmazonCloudWatchAgent.zip
unzip AmazonCloudWatchAgent.zip
./install.sh

aws s3 cp s3://$BUCKET_NAME/cloudwatch-agent-config.json /var/config/cloudwatch-agent-config.json
sed -i -- 's|{{logPath}}|/var/log/application.log|g' /var/config/cloudwatch-agent-config.json
sed -i -- 's|{{logGroupName}}|app_node|g' /var/config/cloudwatch-agent-config.json

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/var/config/cloudwatch-agent-config.json -s

Now you have to define two things: Log metrics and alarms. The cloudformation code below creates both:

    "HighAppErrorsNotification": {
      "Type": "AWS::CloudWatch::Alarm",
      "Properties": {
        "AlarmActions": [
          {
            "Ref": "NotificationTopicId"
          }
        ],
        "InsufficientDataActions": [
          {
            "Ref": "NotificationTopicId"
          }
        ],
        "AlarmDescription": "Notify if there are too many application errors",
        "ComparisonOperator": "GreaterThanOrEqualToThreshold",
        "EvaluationPeriods": "1",
        "MetricName": "ApplicationErrors",
        "Namespace": "LogMetrics",
        "Period": "900",
        "Statistic": "Sum",
        "Threshold": "5",
        "TreatMissingData": "ignore"
      }
    },
    "ErrorMetricFilter": {
      "Type": "AWS::Logs::MetricFilter",
      "Properties": {
        "LogGroupName": "app_node",
        "FilterPattern": "ERROR",
        "MetricTransformations": [
          {
            "DefaultValue": 0,
            "MetricValue": "1",
            "MetricNamespace": "LogMetrics",
            "MetricName": "ApplicationErrors"
          }
        ]
      }
    },

If you need to do that manually, go to the CloudWatch logs homepage, select the log group (app_node) and use the button “Create metric filter” ontop. It lets you specify the pattern to look for (“ERROR” in this case). When you have that ready, you can create an Alarm based on it, through the Alarms -> Create alarm. Lookup the metric by name and select it to trigger the alarm (in the example above, it gets triggered if there are more than 5 errors within 900 seconds)

You can then create an identical alarm for warnings (pattern to look for: “WARN”). The threshold there might be higher, e.g. 10 or 20. But that depends on your application logging patterns.

Then there’s the error 5xx load balancer alarms. In CloudFormation it would look like this:

   "TooMany5xxErrorsWebAppAlarmNotification": {
      "Type": "AWS::CloudWatch::Alarm",
      "Properties": {
        "AlarmActions": [
          {
            "Ref": "NotificationTopicId"
          }
        ],
        "InsufficientDataActions": [
          {
            "Ref": "NotificationTopicId"
          }
        ],
        "AlarmDescription": "Notify if there are too many 5xx errors",
        "ComparisonOperator": "GreaterThanOrEqualToThreshold",
        "Dimensions": [
          {
            "Name": "LoadBalancer",
            "Value": {
              "Ref": "WebAppALBId"
            }
          }
        ],
        "TreatMissingData": "notBreaching",
        "EvaluationPeriods": "1",
        "MetricName": "HTTPCode_Target_5XX_Count",
        "Namespace": "AWS/ApplicationELB",
        "Period": "60",
        "Statistic": "Sum",
        "Threshold": "2"
      }
    }

You can again create that manually – look for the HTTPCode_Target_5XX_Count metric in the metric selection screen for the alarm. You have several options there, the most straightforward is to select the per AppELB metric. And again, the same approach can be used for 4xx errors (HTTPCode_Target_5XX_Count).

Getting this running with CloudFormation (and even manually) is not as straighforward as it seems. The right combination of metric names, namespaces and values is not obvious and the relevant documentation is not the first thing that pops up. So I decided to share something that works, as it may take some time experimenting before getting it to that state.

But even outside of CloudFormation or AWS context, monitoring and alerting in case of a high number of application errors, warnings and HTTP errors is a must. And automating the creation of those alarms is the recommended approach.

The post AWS Alarms for Application Errors appeared first on Bozho's tech blog.

Automating the Change of WordPress Permalinks

Post Syndicated from Bozho original https://techblog.bozho.net/automating-the-change-of-wordpress-permalinks/

WordPress is powering 35% of website. And while it may not be seen as very complex or interesting, it is one of the most prevalent technologies of our time. And many developers, even if they are not working with PHP, have to support some WordPress installation (e.g. a blog like this one). And unfortunately, there are still basic things that you can’t easily do. Plugins do help, but they don’t always have the proper functionality.

Today I had to change permalinks on a website. From https://example.com/post-name to https://example.com/blog/post-name. And WordPress allows that, except there’s a problem – when you change it, the old links stop working (404). Not only that ruins your SEO, any previous share of your post will not work. I’ve pointed that out a long time ago when Spring broke their documentation links. Bottomline: you don’t want that to happen.

Instead, you want to do 301 redirect (and not 302, which appears to also break your SEO). But all tutorials that are easily found online assume you can manually configure redirection through some plugin. But if you have 100-200 posts, that’s a lot of tedious work. There are plugins that allegedly monitor your posts for changes and create the redirects automatically. That may work if you manually edit a post, but it didn’t work for me when changing the permalink format settings.

So how to do it in an automated way and without disrupting your website? You’d need a little bit more than a plugin – namely SQL and Regex. Here are the steps:

  1. Install and activate the Redirection plugin
  2. Open your blog’s hosting admin page and navigate to the PhpMyAdmin to browse your MySQL database. That’s the typical setup. If you are using another database admin tool, you’ll know what to do
  3. Run a query to fetch the post name of all published posts: SELECT post_name FROM `wpgk_posts` where post_type = "post" and post_status = "publish" LIMIT 200 (if you have more than 200 posts, increase the limit; if you don’t put a limit, PhpMyAdmin puts a default one)
  4. Expand the retrieved text (with the big T), select all records with the checkbox below and click “Copy to keyboard”
  5. Paste in Notepad++ (or any regex-supporting text editor, or if you are running Linux, just create a file and then you’ll be able to do regex replaces wit sed)
  6. Replace all tabs that you may have copied (\t)
  7. Then replace (.+) with /$1/,/blog/$1/ (if you are redirecting /postname to /blog/postname, otherwise you have to write the appropriate regex)
  8. Change the permalink format settings
  9. Import the resulting CSV file in the Redirection plugin. It interprets it as: source,target. That way you have redirected all your existing posts to their new permalink (the redirects are created in the chosen group). Note the trailing slash. Just in case, you may repeat the same operation to produce another CSV without the trailing slashes, but the canonical WordPress URL is with the trailing slash.

It feels a bit awkward that such an obvious things that probably happens often would need such a non-trivial set of steps to do automatically. But software, no matter how user-friendly and “no code”, will require knowledge of basic things like SQL and Regex. And maybe that’s good.

The post Automating the Change of WordPress Permalinks appeared first on Bozho's tech blog.

An AWS Elasticsearch Post-Mortem

Post Syndicated from Bozho original https://techblog.bozho.net/aws-elasticsearch-post-mortem/

So it happened that we had a production issue on the SaaS version of LogSentinel – our Elasticsearch stopped indexing new data. There was no data loss, as elasticsearch is just a secondary storage, but it caused some issues for our customers (they could not see the real-time data on their dashboards). Below is a post-mortem analysis – what happened, why it happened, how we handled it and how we can prevent it.

Let me start with a background of how the system operates – we accept audit trail entries (logs) through a RESTful API (or syslog), and push them to a Kafka topic. Then the Kafka topic is consumed to store the data in the primary storage (Cassandra) and index it for better visualization and analysis in Elasticsearch. The managed AWS Elasticsearch service was chosen because it saves you all the overhead of cluster management, and as a startup we want to minimize our infrastructure management efforts. That’s a blessing and a curse, as we’ll see below.

We have alerting enabled on many elements, including the Elasticsearch storage space and the number of application errors in the log files. This allows us to respond quickly to issues. So the “high number of application errors” alarm triggered. Indexing was blocked due to FORBIDDEN/8/index write. We have a system call that enables it, so I tried to run it, but after less than a minute it was blocked again. This meant that our Kafka consumers failed to process the messages, which is fine, as we have a sufficient message retention period in Kafka, so no data can be lost.

I investigated the possible reasons for such a block. And there are two, according to Amazon – increased JVM memory pressure and low disk space. I checked the metrics and everything looked okay – JVM memory pressure was barely reaching 70% (and 75% is the threshold), and there was more than 200GiB free storage. There was only one WARN in the elasticsearch application logs (it was “node failure”, but after that there were no issues reported)

There was another strange aspect of the issue – there were twice as many nodes as configured. This usually happens during upgrades, as AWS is using blue/green deployment for Elasticsearch, but we haven’t done any upgrade recently. These additional nodes usually go away after a short period of time (after the redeployment/upgrade is ready), but they wouldn’t go away in this case.

Being unable to SSH into the actual machine, being unable to unblock the indexing through Elasticsearch means, and being unable to shut down or restart the nodes, I raised a ticket with support. And after a few ours and a few exchanged messages, the problem was clear and resolved.

The main reason for the issue is 2-fold. First, we had a configuration that didn’t reflect the cluster status – we had assumed a bit more nodes and our shared and replica configuration meant we have unassigned replicas (more on shards and replicas here and here). The best practice is to have nodes > number of replicas, so that each node gets one replica (plus the main shard). Having unassigned shard replicas is not bad per se, and there are legitimate cases for it. Our can probably be seen as misconfiguration, but not one with immediate negative effects. We chose those settings in part because it’s not possible to change some settings in AWS after a cluster is created. And opening and closing indexes is not supported.

The second issue is AWS Elasticsearch logic for calculating free storage in their circuit breaker that blocks indexing. So even though there were 200+ GiB free space on each of the existing nodes, AWS Elasticsearch thought we were out of space and blocked indexing. There was no way for us to see that, as we only see the available storage, not what AWS thinks is available. So, the calculation gets the total number of shards+replicas and multiplies it by the per-shard storage. Which means unassigned replicas that do not take actual space are calculated as if they take up space. That logic is counterintuitive (if not plain wrong), and there is hardly a way to predict it.

This logic appears to be triggered when blue/green deployment occurs – so in normal operation the actual remaining storage space is checked, but during upgrades, the shard-based check is triggered. That has blocked the entire cluster. But what triggered the blue/green deployment process?

We occasionally need access to Kibana, and because of our strict security rules it is not accessible to anyone by default. So we temporarily change the access policy to allow access from our office IP(s). This change is not expected to trigger a new deployment, and has never lead to that. AWS documentation, however, states:

In most cases, the following operations do not cause blue/green deployments: Changing access policy, Changing the automated snapshot hour, If your domain has dedicated master nodes, changing data instance count.
There are some exceptions. For example, if you haven’t reconfigured your domain since the launch of three Availability Zone support, Amazon ES might perform a one-time blue/green deployment to redistribute your dedicated master nodes across Availability Zones.

There are other exceptions, apparently, and one of them happened to us. That lead to the blue/green deployment, which in turn, because of our flawed configuration, triggered the index block based on the odd logic to assume unassigned replicas as taking up storage space.

How we fixed it – we recreated the index with fewer replicas and started a reindex (it takes data from the primary source and indexes it in batches). That reduced the size taken and AWS manually intervened to “unstuck” the blue/green deployment. Once the problem was known, the fix was easy (and we have to recreate the index anyway due to other index configuration changes). It’s appropriate to (once again) say how good AWS support is, in both fixing the issue and communicating it.

As I said in the beginning, this did not mean there’s data loss because we have Kafka keep the messages for a sufficient amount of time. However, once the index was writable, we expected the consumer to continue from the last successful message – we have specifically written transactional behaviour that committed the offsets only after successful storing in the primary storage and successful indexing. Unfortunately, the kafka client we are using had auto-commit turned on that we have overlooked. So the consumer has skipped past the failed messages. They are still in Kafka and we are processing them with a separate tool, but that showed us that our assumption was wrong and the fact that the code calls “commit” doesn’t actually mean something.

So, the morals of the story:

  • Monitor everything. Bad things happen, it’s good to learn about them quickly.
  • Check your production configuration and make sure it’s adequate to the current needs. Be it replicas, JVM sizes, disk space, number of retries, auto-scaling rules, etc.
  • Be careful with managed cloud services. They save a lot of effort but also take control away from you. And they may have issues for which your only choice is contacting support.
  • If providing managed services, make sure you show enough information about potential edge cases. An error console, an activity console, or something, that would allow the customer to know what is happening.
  • Validate your assumptions about default settings of your libraries. (Ideally, libraries should warn you if you are doing something not expected in the current state of configuration)
  • Make sure your application is fault-tolerant, i.e. that failure in one component doesn’t stop the world and doesn’t lead to data loss.

To sum it up, a rare event unexpectedly triggered a blue/green deployment, where a combination of flawed configuration and flawed free space calculation resulted in an unwritable cluster. Fortunately, no data is lost and at least I learned something.

The post An AWS Elasticsearch Post-Mortem appeared first on Bozho's tech blog.

Running a Safe Database Cluster in AWS With Auto-Scaling Groups

Post Syndicated from Bozho original https://techblog.bozho.net/running-a-safe-database-cluster-in-aws-with-auto-scaling-groups/

When you have to run a scalable application on AWS, your database must also be scalable. It’s easier to scale the stateless application layer, where each node is mostly disposable – even if a node in a 3-node cluster fails, you can just fire up another one and nobody notices.

The database layer is stateful and therefore there’s a risk to lose data. Having just a single node is not an option, as a node can always go down and that would mean downtime. So you need multiple nodes in a cluster to make sure your application is highly available and fault tolerant (I won’t go into the differences in terminology).

What database am I talking about? It doesn’t matter. It can be a SQL or a NoSQL database – each has some form of clustering available. Whether it’s active-active or active-passive.

Now, for AWS in particular, you can choose RDS (or another managed option), which will handle it for you. But if there’s no managed option (e.g. Cassandra) or you don’t feel the managed option is giving you enough control, or is more expensive, or the version you require is not available, you have to manage the database layer yourself. I won’t go into the details of how to configure the database-specific clustering – you should check the documentation of the particular database for that. I’ll try to give some tips how to safely run your infrastructure that supports the database cluster.

And here come auto-scaling groups. They allow you have have a group of identical nodes (based on a launch configuration) and the ASG makes sure you always have at least X healthy nodes by starting new nodes if existing ones fail( they can automatically kill unhealthy nodes (that is, nodes that do not respond to automated healthchecks)).

That’s awesome for application nodes, but it can be an issue with database nodes. You don’t necessarily want your database node killed if it gets unresponsive for a while. That’s why below I’ve compiled a list of tips to avoid pitfalls. Unfortunately, many of them are not available through CloudFormation, so you have to do them manually. And document them so that you don’t forget in case you need to recreate the stack:

  • Set the minimum number of nodes to 1. It protects from accidentally setting the “Desired” count to 0 as part of experimenting with other, unrelated ASGs
  • Make sure you have enabled termination protection for each instance as well as scale-in termination protection per ASG.
  • In the ASG settings there’s “Suspended Processes”. Make sure you suspend “Terminate” and “ReplaceUnhealthy”.
  • Make sure that in your LaunchConfiguration the EBS volume is not deleted in case of termination. Why do you need that, given that you have disabled all termination options? Well, termination can occasionally happen due to issues with the underlying host, or a node may be scheduled for decommission
  • If at some point you need to restore from an EBS volume, 1. let the ASG spawn a new node 2. temporarily add “Launch” to the suspended actions 3. Detach the root volume of the node 4. attach the old EBS volume to /dev/xvda 5. start the node.
  • Setup a lifecycle policy (through CloudFormation or manually) to do a backup on the database EBS volumes. Make sure you set the proper tags to the volumes (and this can only be done manually)
  • Make sure the ASG can spawn instances in multiple availability zones (in case one goes down)

If you follow that, your auto-scaling groups will not behave exactly as auto-scaling groups. You can still configure automatically increasing the number of nodes in case of increased load, but the rest of the features are rarely a good idea for database layers – you’d prefer to resolve your database issues on existing machines, even if stopped temporarily, rather than just spawn new ones.

But you should embrace failure. Even with all termination protections, you have to assume everything may fail and die and you should have a clear path to restoring your nodes.

The post Running a Safe Database Cluster in AWS With Auto-Scaling Groups appeared first on Bozho's tech blog.

Where Is This Coming From?

Post Syndicated from Bozho original https://techblog.bozho.net/where-is-this-coming-from/

In enterprise software the top one question you have to answer as a developer almost every day is “Where is this coming from?”. When trying to fix bugs, when developing new features, when refactoring. You have to be able to trace the code flow and to figure out where a certain value is coming from.

And the bigger the codebase is, the more complicated it is to figure out where something (some value or combination of values) is coming from. In theory it’s either from the user interface or from the database, but we all know it’s always more complicated. I learned that very early in my career when I had to navigate a huge telecom codebase in order to implement features and fix bugs that were in total a few dozens line of code.

Answering the question means navigating the code easily, debugging and tracing changes to the values passed around. And while that seems obvious, it isn’t so obvious in the grand scheme of things.

Frameworks, architectures, languages, coding styles and IDEs that obscure the answer to the question “where is this coming from?” make things much worse – for the individual developer and for the project in general. Let me give a few examples.

Scala, for which I have mixed feelings, gives you a lot of cool features. And some awful ones, like implicits. An implicit is something like a global variable, except there are nested implicit scopes. When you need some of those global variables, so just add the “implicit” keyword and you get the value from the inner-most scope available that matches the type of the parameter you want to set. And in larger projects it’s not trivial to chase where has that implicit value been set. It can take hours of debugging to figure out why something has a particular value, only to figure out some unrelated part of the code has touched the relevant implicits. That makes it really hard to trace where stuff is coming from and therefore is bad for enterprise codebases, at least for me.

Another Scala feature is partially applied functions. You have a function foo(a, b, c) (that’s not the correct syntax, of course). You have one parameter known at some point, and the other two parameters known at a later point. So you can call the function partially and pass the resulting partially applied function to the next function, and so on until you have the other arguments available. So you can do bar(foo(a)) which means that in bar(..) you can call foo(b, c). Of course, at that point, answering the question “where did the value of a come from” is harder to answer. The feature is really cool if used properly (I’ve used it, and was proud about it), but it should be limited to smaller parts of the codebase. If you start tossing partially applied functions all over the place, it becomes a mess. And unfortunately, I’ve seen that as well.

Enough about Scala, the microservices architecture (which I also have mixed feeling about) also complicates the ability of a developer to trace what’s happening. If for a given request you invoke 3-4 external systems, which both return data and manipulate data, it becomes much harder to debug your application. Instead of putting a breakpoint or doing a call hierarchy, you have to track the parameters of each interaction with each microservice. It’s news to nobody that microservices are harder to debug but I just wanted to put that in the context of answering the “where is this coming from” question.

Dynamic typing is another example. I’ve included that as part of my arguments why I prefer static typing. Java IDEs have “Call hierarchy”. Which is the single most useful IDE functionality for large enterprise software (for me even more important than the refactoring functionality). You really can trace every bit of possible code flow, not only in your codebase, but also in your dependencies, which often hide the important details (chances are, you’ll be putting breakpoints and inspecting 3rd party code rather often). Dynamic typing doesn’t give you the ability to do that properly. doSomething called on an unknown-at-compile-time type can be any method with that name. And tracing where stuff is coming from becomes much harder.

Code generation is something that I’ve always avoided. It takes input from text files (in whatever language they are) and generates code, turning the question “where is this coming from” to “why has this been generated that way”.

Message queues and async programming in general – message passing obscures the source and destination of a given piece of data; a message queue adds complexity to the communication between modules. With microservices you at least have API calls, with queues, you have multiple abstractions between the sender and recipient (exchanges, topics, queues). And that’s a general drawback of asynchrounous programming – that there’s something in between the program flow that does “async magic” and spits something on the other end – but is it transformed, is it delayed, is it lost and retried, is it still waiting?

By all these examples I’m not saying you should not use message queues, code generation, dynamic languages, microservices or Scala (though for some I’d really advice against). All of these things have their strengths, and they have been chosen exactly for those strengths. A message queue was probably chosen because you want to really decouple producer and consumer. Scala was chosen for its expressiveness. Microservices were chosen because a monolith had become really hard to manage with multiple teams and multiple languages.

But we should try to minimize the “damage” of not being able to easily trace the program flow and not being able to quickly answer “where is this coming from”. Impose a “no-implicits” rule on your scala code base. Use code-generation for simpler components (e.g. DTOs for protobuf). Use message queues with predictable message/queue/topic/exchange names and some slightly verbose debug logging. Make sure your microservices have corresponding SDKs with consistent naming and that they can be run locally without additional effort to ease debugging.

It is expected that the bigger and more complex a project is, the harder it will be to trace where stuff is going. But do try to make it as easy as possible, even if it costs a little extra effort in the design and coding phase. You’ll be designing and writing that feature for a week. And you (and others) will be supporting and expanding it for the next 10 years.

The post Where Is This Coming From? appeared first on Bozho's tech blog.

One-Month of Microsoft DKIM Failure and Thoughts on Technical Excellence

Post Syndicated from Bozho original https://techblog.bozho.net/one-month-of-microsoft-dkim-failure-and-thoughts-on-technical-excellence/

Last month I published an article about getting email settings right. I had recently configured DKIM on our company domain and was happy to share the experience. However, the next day I started receiving error DMARC reports saying that our Office365-backed emails are failing DKIM which can mean they are going in spam.

That was quite surprising as I have followed every step in the Microsoft guide. Let me first share a bit about that as well, as it will contribute to my final point.

In order to enable DKIM, on a normal email provider, you just have to get the DKIM selector value from some settings screen and copy them in your DNS admin dashboard (in a new ._domainkey. record). Microsoft, instead, make you do the following:

  • Connect to Exchange Powershell
  • Oh, you have 2FA enabled? Sorry, that doesn’t work – follow this tutorial instead
  • Oh, that fails when downloading their custom application? An obscure stackoverflow answer tells you that you should not download it with Firefox or Chrome – it fails to run then. You should download it with Internet Explorer (or Edge) and then it works. Just Microsoft.
  • Now that you are connected, follow the DKIM guide
  • In addition to powershell, you should also go in the Exchange admin in your Office365 admin, and enable DKIM

Yes, you can imagine you can waste some time here. But it finally works, and you set the CNAME records and assume that what they wrote – namely, that the TXT records to which the CNAME records point, are generated on Microsoft’s end (for onmicrosoft.com), and so anyone trying to validate a signature will pick the right public key from there and everything will be perfect. Well, no. Here’s an excerpt from a test mail I sent to my Gmail in December:

DKIM: ‘FAIL’ with domain logsentinel.com
ARC-Authentication-Results: i=2; mx.google.com;
dkim=temperror (no key for signature) [email protected] header.s=selector1

I immediately reported the issue in a way that I think is quite clear, although succint:

Hello, I configured DKIM following your tutorial (https://docs.microsoft.com/en-us/microsoft-365/security/office-365-security/use-dkim-to-validate-outbound-email), however the TXT records at your end that are supposed to be generated are not generated – so me setting a CNAME records points to a missing TXT record. I tried disabling and reenabling, and rotating the keys, but the TXT record is still missing, which means my emails get in spam folders, because the DMARC policy epxects a DKIM record. Can you please fix that and create the required TXT records?

What followed was 35 days of long phone calls and repetitive emails with me trying to explain what the issue is and Microsoft support not getting it. As a sidenote, Office365 support center doesn’t support tickets with so much communication – it doesn’t have pagination so I can’t see the original note online (only in my inbox).

On the 5th minute of my first-of-several 20 minute calls my wife, who was around, already knew exactly what the issue is (she’s quite technical, yes), but somehow Microsoft support didn’t get it. We went through multiple requests for me providing DMARC reports (and just the gmail report wasn’t enough, it had to be from 2 others as well), screenshots of my DNS administration screen, screenshots to prove the record is missing (but mxtoolbox is some third party tool they can’t trust), even me sharing my computer for a remote desktop session (no way, company policy doesn’t allow random semi-competent people to touch my machine). I ended up providing an nslookup (a Microsoft tool) screenshot, to show that the record is missing.

In the meantime I tried key rotation again form the exchange admin UI. That got stuck for a week, so I raised a separate ticket. While we were communication on that ticket as well, rotation finished, but the keys didn’t change. What changed was (that I later figured) the active selector. We’ll get to that later.

After the initial few days I executed Get-DkimSigningConfig | fl to get the DKIM record values and just created TXT records with them (instead of CNAME), so that our mails don’t get in spam. Of course, having DKIM fail doesn’t automatically mean you’ll get in spam, even though our DMARC policy says so, but it’s a risk one shouldn’t take. With the TXT records set everything was working okay, so I didn’t have an urgent problem anymore. That allowed me to continue slowly trying to resolve the issue with Microsoft support, for another month.

Two weeks ago they acknowledged something is missing on their end. Miracle! So the last few days of the horror were about me having to set the CNAME records back (and get rid of my TXT records) so that they can fix the missing records on their end. I know, that sounds ridiculous – you don’t need a CNAME to point to your TXT in order to get that TXT to appear, but who knows what obscure and shitty programs and procedures they have, so I complied. What followed was a request for more screenshots and more DKIM reports.

No. You have nslookup. No screenshot of mine is better than aн nslookup. The TTL for my entries is 300s, so there should be no issue with caching (and screenshots won’t help with that either).

Then finally, after having spoken to or emailed probably 5-6 different people, there was an email that made sense:

Good Afternoon,

As per the public documentation you followed (https://docs.microsoft.com/en-us/microsoft-365/security/office-365-security/use-dkim-to-validate-outbound-email) it is mentioned to have TXT records created in Microsoft DNS server for both Selector CNAME Records. we agree with that.

Our product engineer team has confirmed that there is a design level change was rolled out and it is still not published in public documentation. To be precise, we don’t require TXT record for both CNAME Records published. The TXT record will be published only for the Active Selector.

LastChecked 2020-01-01 07:40:58Z
KeyCreationTime 2019-12-28 18:37:38Z
RotateOnDate 2020-01-01 18:37:38Z
SelectorBeforeRotateOnDate selector1
SelectorAfterRotateOnDate selector2

Above info confirms that the active selector is “Selector2” and we have the TXT record published. So we don’t have to wait for the TXT record for Selector1 in our scenario.

Also we require the Selector1 CNAME record in your DNS, whenever next rotate happen. Microsoft will publish the other TXT record. This shouldn’t affect your environment at any chance.

Alright. So, when I triggered key rotation, I fixed the results of the original bug that lead to no records present. The rotation made the active selector “selector2” and generated its TXT record. Selector1 is currently missing, but it’s inactive, so that is not an issue. What “active selector” means is an interesting question. It’s nowhere in the documentation, but there’s this article from 2016 that sheds some light. Microsoft has to swap the selectors on each rotation. According to the article rotation happens every week, but that’s no longer the case – since my manual rotation there hasn’t been any automatic one.

So to summarize – there was an initial bug that happened who knows why, then my stuck-for-a-week manually triggered rotation fixed it, but because they haven’t documented the actual process, there was no way for me to know that the missing TXT record is fine and will (hopefully) be generated on the next rotation (which is my current pending question for them – when will that happen so that I check if it worked).

To highlight the issues here. First, a bug. Second, missing/outdated documentation. Third, incompetent (though friendly) support. Fourth, a convoluted procedure to activate a basic feature. And that’s for not getting emails in spam in an email product.

I can expand those issues to any Microsoft cloud offering. We’ve been doing some testing on Azure and it’s a mess. At first an account didn’t even work. No reason, just fails. I had to use another account. The UI is buggy. Provisioning resources takes ages with no estimate. Documentation is far from perfect. Tools are meh.

And yet, Microsoft is growing its cloud business. Allegedly, they are the top cloud provider in the world. they aren’t, obviously, but through bundling Office365 and Azure in the same reporting, they appear bigger than AWS. Azure is most likely smaller than AWS. But still, Satya Nadella has brought Microsoft to market success again. Because technical excellence doesn’t matter in enterprise sales. You have your sales reps, nice conferences, probably some kickbacks, (alleged) easy integration with legacy Windows stuff that matters for most enterprise customers. And for the decision maker that chooses Microsoft over Amazon or Google the argument of occasional bugs, poor documentation or bad support is irrelevant.

But it is relevant for the end results. For the engineers and the products they build. Unfortunately, tech companies have prioritized alleged business value over technical excellence so much that they can no longer deliver the business value. If the outgoing email of an entire organization goes in spam because Microsoft has a bug in Office365 DKIM, where’s the business value of having a managed email provider? If the people-hours saved form going from on-prem to Azure are wasted because of poor technology, where’s the business value? Probably decision makers don’t factor that in, but they should.

I feel technical excellence has been in decline even in big tech companies, let alone smaller ones. And while that has lead to projects gone out of budged, poor quality and poor security, the fact that the systems aren’t that critical has made it possible for the tech sector to get away with that. And even grow.

But let me tell you of big player in a different industry that stopped caring about technical excellence. Boeing. This article tells the story of shifting Boeing from an engineering organization to one that cuts on quality expenses and outsources core capabilities. That lead to the death of people. Their plane crashed because of that shift in priorities. That lead to a steep decline in sales, and a moderate decline in stock and a bailout.

If you de-prioritize technical excellence, and you are in a critical sector, you lose. If you are an IT vendor, you can get away with it, for a while. Maybe forever? I hope not, otherwise everything in tech will be broken for a long time. And digital transformation will be a painful process around bugs and poor tools.

The post One-Month of Microsoft DKIM Failure and Thoughts on Technical Excellence appeared first on Bozho's tech blog.